AI Evals: What the Checkmarks Actually Prove
Stage: pre-launch, evaluation
In March 2026, a major health system quietly paused the rollout of an agentic care-coordination assistant. The model had passed every offline eval the vendor ran. The integration passed every unit test. The agent shipped. Inside ten weeks, a cardiology fellow flagged a pattern. The agent was producing risk stratifications that cited the 2017 ACC/AHA heart failure guideline for a class of patients the 2023 revision had reclassified. The outputs were fluent, confident, and formatted exactly like the training examples. They were also reasoning from retired criteria. The vendor’s eval suite tested whether the output matched the training distribution. It did. The training distribution was the old guideline.
That scenario, or a less fortunate version of it, is going to be the defining eval failure of 2026. Not a hallucinated answer invented from nothing. A fluent, correctly formatted answer produced against a reference the system no longer has any right to cite, and an eval suite that confirmed the agent was doing exactly what it was trained to do.
Three mental models that served you for twenty years break in that scenario. The checkmarks are still there. What they prove is not. This chapter is about what changed, and what the PM has to own to close the gap.
Chapter 1 introduced non-determinism as a first principle. Chapter 4 introduced the runtime artifacts. Evals are the bridge: how do you know, before shipping, that the runtime is doing what the spec asked for, and not producing a confident wrong answer that looks like success? There are three places the old QA mental model breaks, and a fourth that is often treated as separate but deserves equal weight.
What Evals Actually Are
A grounding first. An AI evaluation framework is the way you check whether the system behaves correctly, reliably, and safely before and after deployment. Not whether the code compiles. Whether the system makes the decisions it was designed to make, under conditions that resemble the real world.
Evals are not a complete validation system. They are partial observability tools. They approximate behavior under controlled conditions. They bound uncertainty. They do not prove correctness. The PM who treats a passing eval suite the way they treat a passing QA matrix has imported a precision the tool does not provide.
The structural mapping between QA and AI evals is direct: epics become capability definitions, acceptance criteria become eval criteria, unit tests become single-step evals, integration tests become trajectory evals, and regression tests become continuous evals re-run after every model update. The division of ownership is identical: engineering runs the evals, the PM owns what success means.
The critical difference: in QA, one passing test proves the behavior is fixed. In AI evals, one passing test is a sample from a distribution. Binary pass/fail is not a sufficient quality gate for a non-deterministic system. Evals test the agent. Production monitoring (Chapter 6) tests the agent-plus-supervisor system. Both are required, and they are measuring different things.
The mapping is close enough to be useful and different enough to be dangerous if you import QA instincts unchanged.
| Traditional QA | AI eval analog | Key difference |
|---|---|---|
| Epic | Agent capability definition | The unit of work is a decision, not a feature |
| Acceptance criteria | Eval criteria (task success rate, intent resolution accuracy, tool call correctness) | Defined on distributions, not single cases |
| Unit tests | Single-step evals (one reasoning step or tool invocation) | Output is non-deterministic; one pass is a sample, not a proof |
| Integration tests | Trajectory evals (full multi-step chain end-to-end) | Multi-step workflows compound failure probability |
| QA test plan | Eval suite | The dataset is the product; what you test is what you optimize for |
| Traceability matrix | LLM-as-judge rubric mapped to original criteria | Judge itself introduces error; must be calibrated to a human-labeled reference |
| UAT | Human review of flagged agent trajectories | You cannot UAT every sample; you UAT the failures the system surfaces |
| Regression tests | Continuous evals | Re-run on every model update, not only on code changes |
Table 5.1. Mapping QA to AI evals.
Where the Old Mental Model Breaks
The three places the QA mental model breaks are the deterministic assumption, component-level testing, and execution observability. Each has its own failure mode. Each needs a different PM artifact.
The determinism problem. In traditional software QA, a passing test proves the code will behave correctly. Given the same input, the same code produces the same output, every time. One green checkmark means the behavior is fixed.
In agentic AI, one green checkmark is a sample from a distribution. Even in systems configured for low variance, non-determinism persists at the system level due to retrieval variability, external tool responses, and context differences across sessions. The same agent, given the same task on different days, may succeed, fail, or take a path nobody anticipated.
The practical consequence: binary pass/fail is not a sufficient quality gate. Leading teams use repeated sampling and distribution-based metrics, tracking success rates across multiple runs. The practice has a name in the eval literature: pass@k. You run the eval k times, where k is typically between five and ten for internal release gates, higher for published benchmarks, and report the success rate across the distribution. A team that presents one green checkmark on a non-deterministic system has not told you whether the agent passed. They have told you it passed once.
Pass@K is the practice of running an eval k times (typically five to ten for internal release gates) and reporting the success rate across the distribution rather than a single binary result. In a non-deterministic system, a single passing run proves very little. A pass rate of 8 out of 10 tells you something meaningful about the system’s reliability. A pass rate of 1 out of 1 tells you almost nothing.
Two numbers from the distribution matter for a release decision. The P50, the median pass rate, is the expected performance in a typical interaction. The P10, the worst-case ten percent of runs, determines user experience at scale. An 85 percent P50 with a 40 percent P10 is a system that succeeds in most interactions and fails badly in the cases that hit the tail. The P10 is the number most teams never calculate. When your engineering team reports eval results, the first question is: how many times was each eval run, and what is the P10?
The value of pass@k is in the shape of the distribution, not in a single green check.
| Result | k | Interpretation |
|---|---|---|
| 1 of 1 passes | 1 | Almost no information about reliability |
| 8 of 10 pass | 10 | Approximates the real success rate; adequate for most internal release gates |
| 48 of 50 pass | 50 | Strong evidence, near production-ready |
Table 5.2. Pass@k in practice.
An LLM-as-judge setup uses a language model to score the output of another language model against a rubric. It scales eval grading because human-labeled evaluation does not keep up with the volume of iterations modern teams run. The tradeoff: the judge introduces its own error rate, which must be characterized against a sample of human-labeled examples. A false-pass rate of 10 percent in the judge translates directly into a quality gate with a 10 percent blind spot.
If the team cannot tell you the judge’s calibration error, the team cannot tell you what the passing scores mean.
One sharpening worth carrying with you on LLM-as-judge, because the bias literature is now specific. Research on judge models has documented several systematic biases that affect what passes. Judges prefer longer answers, even when length does not correlate with correctness. Judges prefer the answer presented first when comparing two outputs side by side, a position effect that is independent of quality. Judges prefer outputs from the same model family that generated the answer being judged, because the judge is more familiar with the syntactic patterns of its own family. None of these biases are subtle in a controlled experiment. All three are present in production LLM-as-judge deployments by default. The judge is not a neutral grader. It is an instrument with documented systematic errors. If the team has not characterized those errors against human-labeled reference examples, the eval scores have a noise floor the team has not measured.1
One specific calibration metric belongs in every PM’s vocabulary on this. True Negative Rate, or TNR, is the proportion of actual failure cases the judge correctly labels as failures. If you give the judge a hundred outputs that humans have labeled as failed, and the judge correctly flags ninety of them, TNR is ninety percent. The other ten passed the judge despite being wrong; those are the false-pass rate, the inverse of TNR. TNR is the metric the judge is graded on, and it is the metric most teams do not produce. The PM’s question to engineering at launch review is not “does the judge work” but “what is the judge’s True Negative Rate against a human-labeled failure set, and what minimum TNR did we set as the calibration target before this number was measured?” If the team set the target after seeing the result, the calibration is post-hoc rationalization. If the team did not measure TNR at all, the eval gate is operating at an unknown noise floor and the passing scores cannot be interpreted. Three artifacts before sign-off: the human-labeled failure set, the measured TNR, and the pre-declared minimum TNR. Without all three, the LLM-as-judge layer is performing eval theater.
The compound probability problem. This one has no equivalent in traditional software development, which is why it catches experienced PMs off guard.
Suppose the agentic system runs a ten-step workflow. Every single step passes at 95 percent accuracy in isolation. In a traditional system, you sign off.
In an agentic system, 95 percent accuracy across ten steps produces an end-to-end success rate of roughly 60 percent. The math: 0.95 to the tenth power is 0.598. End-to-end reliability is the product of the steps, not their average. This assumes the steps are independent, which rarely holds in practice. Errors are often correlated, and some workflows include recovery steps that partially compensate. The actual degradation varies. What the math guarantees is that component-level accuracy and system-level reliability are different numbers, and that difference grows with every step added to the chain.
If the eval suite tests each component in isolation, the system has not been tested. The components have. This is the sentence to quote to a finance audience that cannot yet see why “our model passes evals at 95 percent” does not mean “our product is reliable.”
The background failure problem. In traditional QA, execution is observable. The code either ran or it did not.
An agentic system can produce a semantically correct output, pass the eval, and never have triggered the underlying action. In production deployments, audits have revealed cases where “order updated and closed” messages never corresponded to any API call. The agent wrote the right words, the automated judge scored the output as accurate, the test passed. No state validation layer. Invisible to metrics until someone looked directly at the system state.
A background failure occurs when the agent produces a correct-looking output without performing the underlying action. The agent says “order confirmed” but no API call was made. The eval scores the output as correct because it checks what the agent said, not what actually changed. Semantic validation reads what the agent said. State validation checks what actually changed in the target system.
Both are required. A release gate that runs one and calls it done has half a gate.
The Scope of Classifier Metrics
A short digression on the metric toolkit most PMs and engineers bring to this work.
Physicians, data scientists, and ML engineers trained on classifier evaluation carry a familiar set of metrics into agent evaluation: accuracy, sensitivity, specificity, AUC, precision, recall, F1. These metrics retain meaning at the single-step level inside an agentic system. If one step of a workflow is a tool selector, a PII detector, a refusal classifier, or any other binary decision against a ground truth, classifier metrics apply directly. Tool call accuracy evaluators on enterprise eval platforms are examples of classifier metrics at single-step granularity.
The same metrics break down for the agent end to end. A five-step workflow with four correct steps and one wrong step produces 80 percent step accuracy and 0 percent task success if the wrong step is the critical one. No classifier metric expresses that fact. Compound probability and end-to-end task success do. Evaluation-platform practitioners are direct about it: standard ML metrics like precision, recall, and accuracy, and NLP metrics like BLEU and BERTScore, do not cut it for multi-step agents.
The practical rule: use classifier metrics for single-step validators inside the agent (router selection, PII detection, refusal detection, tool call accuracy). Use Pass@K, end-to-end task success, and compound probability for the agent as a whole. A team reporting classifier metrics on the trajectory is reporting a number that cannot, mathematically, tell you whether the agent works.
One reconciliation note. Evaluation platforms still expose accuracy, precision, recall, and F1 on agent dashboards because the metrics are cheap to compute over tool-call events and familiar to any engineer who has shipped a classifier. Exposure does not mean sufficiency. When you see them in your tooling, read them as single-step diagnostics, not as an end-to-end quality signal.
| Metric | What it actually tells you | Use for | Misleads when |
|---|---|---|---|
| Accuracy, precision, recall, F1 | Rate of correct single-step classifications against ground truth | Binary validators inside the agent: router, PII detector, refusal classifier | Applied to multi-step trajectories; 80 percent step accuracy can hide 0 percent task success if the wrong step is critical |
| AUC, sensitivity, specificity | Discrimination of a single-step binary classifier across thresholds | Classifiers with a confidence threshold you can tune | Applied to agent trajectories; there is no single threshold to optimize end to end |
| Pass@K | Success distribution across K runs of the same eval | Non-deterministic agent trajectories, K between 5 and 50 depending on the release gate | K=1, which is a single sample and almost no signal; K chosen without variance reported alongside |
| End-to-end task success | Whether the agent delivered the user’s intended outcome | Release decisions on the agent as a whole | Taken alone, without step-level diagnostics, when a failure is being investigated |
| Compound probability | Projected end-to-end success from per-step accuracy | Pre-release reliability projection from component evals | Assumes step independence; correlated failures and recovery steps shift the actual number |
| Trajectory match | Whether the agent followed a reference path | Workflows with enumerable correct paths | Workflows with legitimate path variation; scores valid alternatives as failures |
| State validation | Whether the target system actually changed | Every agent action with state-change semantics | Read-only operations where no state change is expected |
Table 5.3. The eval metrics you will see on platform dashboards, when to use them, and when they mislead.
The Coverage Problem
The fourth failure mode is easy to treat as “a Chapter 10 problem” because it is about what you did not test. Treating it that way is how teams ship. It belongs in the evals discussion.
Evals only test what someone thought to test. Every scenario in the eval suite represents a hypothesis someone had before launch about how the system might behave. The long tail of production behavior, the edge cases nobody anticipated, the input combinations that only emerge at scale, the adversarial patterns users discover in real use, none appear in the pre-launch eval suite. Most consequential production failures happen outside the scenarios anyone thought to include.
The dataset is the product. What you choose to test becomes what you optimize for. The scenarios you include in the eval suite define the edges of what the system will be validated against, and most production failures happen outside those edges. This is not a tooling problem. It is a coverage problem, and it belongs to whoever owns the criteria.
This is the same mechanism behind the widely cited pattern that roughly 95 percent of AI proofs-of-concept succeed in the demo and fail when moved to production. The PoC was built on curated data that matched the team’s assumptions. The eval suite was built on scenarios the team thought to include. Production arrived with live, messy, real-world data that matched neither. The mechanism is identical. The scale is different.
Coverage is not a scoring problem. You can have a perfectly designed rubric with excellent calibration and still be measuring the wrong things. The hardest question in eval design is not “how do we score this?” It is “what are we missing?”
A coverage statement is the PM artifact that makes this question answerable. It documents, explicitly, which user intents were tested, which failure modes were included, which adversarial inputs were considered, and, critically, which scenarios were known but deliberately deferred. That last category is the one most teams omit, because listing what you chose not to test requires admitting what you do not know. That is exactly why it belongs in the release decision package. If the team cannot produce a coverage statement, they cannot characterize the gap between what was tested and what production will look like.
A coverage statement is a PM artifact, not an engineering document, that lists: which user intents were tested, which failure modes were included, which adversarial inputs were considered, and which scenarios were known but deliberately deferred. The last category is the most important.
Listing what you chose not to test requires intellectual honesty about what you do not know. That honesty belongs in the release decision package, not in a post-mortem six months later.
One more category of failure belongs in any honest coverage statement, and it is the hardest to catch because the eval cannot see it. Evals can also pass when the input data is misleading in ways the eval does not check. The model was faithful to what it received. The received data was wrong in a way nobody anticipated: a stale record, a mislabeled field, an upstream system returning a default value when the real value was missing, a classification that was accurate at training time and has since shifted. The eval scores the output as correct because the output was correct given the input. The input was the failure. Chapter 10 names this as the sixth hallucination category, upstream-data-wrong. In the eval context, the handle is: your coverage statement should list the inputs whose fidelity the eval assumed, and how that assumption was tested.
And one specific adversarial coverage gap worth naming, because it is documented and growing. RAG poisoning. If the agent retrieves from a knowledge base, an attacker who can write to that knowledge base can shape what the agent reasons over. Researchers running the PoisonedRAG framework reported a ninety percent attack success rate when injecting just five malicious texts per target question into a knowledge base containing millions of texts. The model behaves as designed. The retrieval works as designed. The corrupted evidence is what the agent reads, and the eval that scored the output against the retrieved evidence scores it as correct, because the output was faithful to the input. The coverage statement should name the retrieval corpus as a trust boundary, and the eval should include adversarial retrieval scenarios. Most do not.2
A Model Update Is a Deployment
One point that needs its own callout. Foundation model providers update their models on their own schedule. A model update that your team did not deploy can still change your product’s behavior. Eval literature now treats model version changes as deployment events. So should you.
The PM artifacts: a model version policy declaring which provider versions are in which environments and what evidence is required to promote a new version; a regression eval suite re-run against every model update, with a pass threshold specified in advance; a vendor communication channel that surfaces upcoming version changes with enough notice to evaluate them; and a rollback path to the prior model, tested and rehearsed, not just documented. Most teams do not have any of these. Most teams discover they needed them after a silent model change broke something that had been working for a month.
The Metric That Did Not Matter: DAX Copilot
One case study worth carrying with you, because it illustrates a category of eval failure that does not show up in any of the boxes above and yet is the most expensive.
The DAX Copilot is an AI clinical documentation assistant used by physicians to reduce the administrative burden of note-writing. A 2024 randomized controlled trial published in NEJM AI by Tipirneni and colleagues, with two hundred and fifteen physicians, found high adoption rates and positive physician feedback. The product was well-designed for its contract: a copilot that assists with documentation while the human reviews before submitting. Acceptance criteria were met. The contract was correct.
The trial found no statistically significant improvement in documentation time or physician cognitive load compared to the control group.3
The product worked. The eval suite passed. The metric tracked, adoption, was not the metric that mattered. Time saved was the metric that mattered. Cognitive load reduced was the metric that mattered. The team that designed the trial did not connect the contract (a copilot for documentation efficiency) to the outcome measurement that would have proven the contract was being delivered.
This is an AI literacy failure before it is an evaluation failure. The lesson applies to any agentic feature where “people are using it” is treated as evidence that “it is working.” Adoption is a leading indicator of potential value. It is not a measure of delivered value. Pass@K, compound probability, and state validation are useless if the metric being passed is the wrong metric. The PM owns the connection between the contract and the outcome measurement, and the failure to make that connection is not detectable by any eval suite, because the eval suite was set up against the wrong target.
Evals Are One Signal, Not the Gate
Leading teams treat evals as one signal among several. A passing eval suite informs the release decision. It does not make it.
Responsible agentic deployments pair offline evals with production monitoring: canary releases to a small percentage of traffic, shadow testing against live inputs, feedback loops that surface failures as they occur, and drift detection that catches behavioral changes introduced by model updates. The most important evals often happen after deployment. Real-world variability exposes behaviors no offline suite can fully anticipate.
The PM who waits for every eval to pass before signing off has the wrong mental model. The PM who ships with eval coverage, production monitoring, and a clear incident response path has the right one.
What the PM Owns
Engineering runs the evals. The PM owns what success means. In practice that means owning seven specific things.
Define what graceful failure looks like. An agent that cannot complete a task has two options: fail silently or fail legibly. Only one is acceptable, and the PM specifies which.
Set the end-to-end pass rate, not just component thresholds. This is a PM accountability item, not an engineering detail. If the team is reporting step-level accuracy, ask for the system-level number across ten sequential runs. The question that surfaces it immediately: “What is the end-to-end success rate on the full workflow, run ten times?” If they cannot answer it, the system has not been tested end to end.
Own the coverage decision. Which scenarios are in the eval suite, and which were deliberately deferred, is a product decision with real consequences, not a test-plan detail. The coverage statement is the artifact; the decision is yours. If the team cannot produce the statement, the release package is incomplete. Sign off on the gaps, not just the scores.
Require state validation alongside semantic validation. Before sign-off, ask: did the evaluation verify that actions were actually taken, not just that the output text said they were?
Know whether the automated judge is calibrated. In LLM-as-judge setups, false-pass rates can be high without human-labeled reference examples. The bias literature is now specific: longer-answer, position, and same-family preferences are documented and present by default. If the team cannot characterize that number against a human reference, they do not know whether the quality gate is catching the failures that matter.
Own the model version policy. Versions are not an engineering detail. A change in the model is a change in the product. What is promoted, when, on what evidence, is your artifact. If a vendor pushes a version change on a Friday and production behavior shifts on Saturday, the rollback plan was yours to design.
Define the acceptable tradeoffs. Success is not a single number. The acceptable balance between cost, latency, reliability, and human oversight determines how eval criteria are weighted. That weighting is a PM judgment, not an engineering default. A team that ships without this explicit will ship to whatever tradeoff the last test configuration happened to encode.
The Checkmarks Are Still There
The release decision meeting for an agentic system looks the same from the outside. Someone updated the JIRA board. The dev lead has a demo. The QA lead has a results report. The checkmarks are still there.
What has changed is what they prove. A passing eval suite tells you the agent behaved correctly in the scenarios someone thought to test, under controlled conditions, at one moment in time. It bounds uncertainty. It does not eliminate it.
The teams that understand this before they ship do three things differently. They pair offline evals with production monitoring from day one. They treat the first weeks of deployment as an extension of the eval process, not its conclusion. And they build incident response into the release plan before anyone asks for it.
A fourth discipline belongs alongside those three, and most teams do not notice it until too late. The eval suite itself is an observation instrument, and like every observation instrument in an agentic system, it has a useful life pegged to frontier model release cadence. A suite that was calibrated against the model you launched on does not, without re-calibration, calibrate the model you are running eighteen months later. Chapter 8 covers how evals and every other observation instrument degrade over time, and what the maintenance calendar for them looks like.
The teams that carry the traditional checkmark model into agentic AI discover the gaps the same way most gaps get discovered: in production, by users, in ways that were never in the test suite. The conference room at the top of the next chapter is the one they did not want to be sitting in. Someone picks up a phone. The agent was confirming work it never did. For six months.
That is what Chapter 6 is about.
Notes
- LLM-as-judge bias literature: Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,” NeurIPS 2023, arXiv 2306.05685, documents position bias and length bias systematically. Same-model-family bias is treated in subsequent work on judge calibration. The practical implication, for the PM holding a quality gate, is that the false-pass rate of the judge must be characterized against a sample of human-labeled examples or the eval scores have a noise floor the team has not measured.
- Zou, Y. et al., “PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation,” USENIX Security 2025. The ninety-percent attack success rate at five malicious texts in a knowledge base of millions is the headline finding. The retrieval corpus as a trust boundary is treated more fully in Chapter 6’s data observability section.
- Tipirneni et al., “DAX Copilot RCT,” NEJM AI, 2024, n=215. The case is included here because it is the cleanest published example of a well-designed eval suite passing against the wrong outcome metric. The lesson generalizes: adoption is not delivered value, and any eval gate that uses adoption as the success criterion has imported the wrong target.