Source-cited AI for accounting: the bar that should disqualify every tool that can’t meet it

Jiesen Li Advisory

There is a question every CPA partner is asking right now, in some version: can I actually use this stuff on client work?

The honest answers in circulation cluster at two extremes. One camp says no — the hallucination risk is too high, the profession’s standards are too strict, the liability exposure is too real. The other camp says yes, enthusiastically, and points at productivity demos that fall apart the moment a partner asks the model to show its work.

Both camps are wrong, and they are wrong for the same reason: they are arguing about the wrong metric.

“The model is 95% accurate” is not an accounting metric

When a vendor tells you their AI is 95% accurate at categorizing transactions, or 92% accurate at drafting memos, or 97% accurate at extracting fields from a 1099, what they are quietly telling you is that some non-trivial percentage of outputs is wrong in ways that no downstream user can detect.

That is fine for marketing copy. It is fine for first drafts of internal email. It is not fine for accounting workpapers.

The standard the profession actually operates under is not accuracy. It is defensibility. A workpaper has to be re-performable by a reviewer, traceable to its source, and standable in front of an auditor or regulator who is going to ask, in some form: show me where you got this. A 95%-accurate tool that cannot answer that question is, for accounting purposes, a 0%-defensible tool.

This is why “AI accuracy” benchmarks are the wrong frame. The right frame is source-citation.

The right metric: every output traceable to a source

The bar is simple to state. For every numerical figure, every accounting conclusion, every memo paragraph that an AI produces, the user has to be able to click through to the underlying source — the specific general ledger row, the specific document page, the specific paragraph of the standard — that the output is grounded in.

Not “the model was trained on accounting literature.” Not “the model has read the standards.” Not “the model is highly confident in this answer.” A literal, addressable, re-openable source.

If the tool cannot do that, it does not belong on a workpaper. Full stop.

What that means architecturally

This is not a UX feature that vendors can bolt on. It is an architectural commitment that has to be made before a single line of product code is written.

The model has to be operating in a retrieval-augmented configuration, meaning that for any question it answers, the system first retrieves the specific source material the answer should be based on, then asks the model to reason over that material and that material only — and to cite back to it. The model’s parametric memory — what it “knows” from training — is not the answer source. The retrieved documents and transaction rows are the answer source. The model is the reasoning layer over them.

This sounds like a technical distinction. It is actually the entire game. A tool that lets the model answer from its training data, then dresses up the response with citations after the fact, is a tool that will hallucinate plausibly-cited nonsense. A tool that constrains the model to only reason over retrieved sources, and forces every output to point at the specific row or page that produced it, is a tool that can be used on real work.

There is a name for the second posture. It is called RAG — retrieval-augmented generation. Most accounting AI tools on the market today do not implement it rigorously. Many implement it as a thin veneer over a model that is still mostly answering from training data. The difference is not visible in a demo. It is brutally visible the first time a partner asks, “show me the row.”

What this disqualifies

This bar disqualifies a lot of what is currently being sold to accounting firms.

It disqualifies any tool that produces narrative accounting analysis without addressable citations to source documents or transaction data. It disqualifies any tool whose “citations” point to summaries it generated rather than primary sources. It disqualifies any tool that cannot, on request, show you the chunk of source material that produced a given sentence. It disqualifies any tool that asks you to trust the model.

It does not disqualify AI from the engagement. It disqualifies most of the AI currently on the market from the engagement.

The AICPA and Circular 230 angle

This is not just a quality preference. It is a professional-standards posture.

The AICPA Code of Professional Conduct requires due professional care, which has always meant that a CPA’s work product has to be supportable. Circular 230, governing tax practice, imposes diligence and competence requirements that include the obligation to base conclusions on adequate factual and legal foundations. Neither standard was written with AI in mind, and neither needs to be amended to apply: a workpaper or memo whose conclusions cannot be traced to their factual basis fails both regimes, whether a human or a model produced it.

A CPA who relies on AI-generated content that cannot be traced to its sources is not being efficient. They are signing work product they cannot defend.

Source-citation is not a feature. It is the line between professional practice and professional malpractice.

How jiesen.ai implements it

The platform is built around this constraint as a starting condition, not a refinement. Every output — forensic reconciliation result, ASC memo paragraph, valuation calculation, close commentary line — is generated against retrieved source material and ties back to the specific row, document page, or standard citation that produced it. The model does not get to draw on its training data for accounting conclusions. It reasons over what was retrieved, and it cites where the reasoning came from.

This produces slower demos. It produces shorter answers. It produces, occasionally, the response “I cannot answer this from the sources provided” — which is the right response when it is true, and a response no tool optimized for “accuracy benchmarks” will ever give you.

It also produces output that survives a partner review, because every claim has an audit trail.

The technical implementation — chunking strategy, retrieval architecture, source-row addressability across 425,000-row general ledger extracts — is covered in detail in a companion piece. The point here is upstream of that detail. The point is that source-citation is the bar, and any tool that cannot meet it does not belong in the engagement, regardless of how impressive its demos are.

The standard the profession should adopt

Firms evaluating AI tools should add one question to every diligence checklist:

For any output your system produces, can the user click through to the specific source that produced it?

If the answer is no, the tool is not a candidate for client work. It may be useful for internal brainstorming, training scenarios, or marketing drafts. It is not useful for workpapers.

If the answer is yes, the next question is whether the citations are real — whether they point at primary source material (transaction rows, document pages, standard paragraphs) or at the model’s own intermediate outputs dressed up to look like sources.

Most firms will not ask these questions because the vendor demos do not invite them. The firms that do ask will, within a few quarters, have access to AI capability that the firms that do not ask will be unable to deploy on real engagements. The gap is going to widen.

The profession does not need a new standard to govern AI. The standards already in force — due professional care, diligence, supportable conclusions — already require source-citation. What the profession needs is the willingness to enforce the standards it already has, against vendors that would prefer to be evaluated on demos.

The bar is straightforward. The tools that meet it are usable on client work. The tools that do not should not be.

jiesen.ai is the educational AI workbench for working CPAs and finance teams, operated by Jiesen Li Advisory. Every output is source-cited. Inputs are not used to train models. Open the platform or engage the firm for signed deliverables.