Part III · Design · Chapter 9

Chapter 9: Evals: What the Checkmarks Prove

An eval is one probabilistic signal, not a deterministic gate. A green checkmark on a non-deterministic system tells you the agent passed once. And a model update you did not make is still a deployment you are accountable for evaluating.

In early 2026 a health system paused the rollout of a care-coordination agent that had passed every offline eval the vendor ran and every unit test the integration team wrote. Ten weeks in, a cardiology fellow noticed the agent was stratifying a class of patients against a heart-failure guideline that had been revised three years earlier. The outputs were fluent, confident, formatted exactly like the training examples, and reasoning from criteria the field had retired. The eval suite had tested whether the output matched the training distribution. It did. The training distribution was the old guideline.

That is the eval failure of this era, and it is worth being precise about what kind of failure it is. Not a hallucination invented from nothing, which evals are getting good at catching. A correctly formatted answer produced against a reference the system no longer has any right to cite, confirmed by a suite that was doing exactly what it was built to do. The checkmarks were all green. What they proved had quietly stopped being what the team thought they proved. The previous chapter drew the line between structural correctness, which an eval can certify, and domain correctness, which only an expert can. This chapter is about the first half of that line: what evals actually measure, where the QA instinct you have carried for twenty years breaks, and what stays yours to own when the green checkmark and the correct decision come apart.

What an eval is, and what it is not

An eval checks whether the system makes the decisions it was designed to make under conditions that resemble the real world. Not whether the code compiles; whether the behavior is right. That much maps cleanly onto QA, and the mapping is close enough to be useful: a capability definition stands in for the epic, eval criteria for acceptance criteria, single-step evals for unit tests, trajectory evals for integration tests, continuous evals re-run after every model change for regression tests. Engineering runs all of it; you own what success means. The structure is familiar enough that the temptation is to import the rest of the QA instinct unchanged, and that is exactly where it goes wrong.

The eval suite runs against a golden dataset: a curated set of inputs paired with the outputs you would accept, the edge cases you know about, and the unsafe responses the agent must never produce. That dataset is the thing the suite grades against, and assembling it is product work, not test engineering, because deciding what counts as an acceptable answer to an ambiguous case is a judgment about the product, not a fact about the code. Keep the distinction clean: the golden dataset is the data, the eval suite is the harness that runs against it. And notice what the golden dataset replaces. A user story carried an acceptance criterion, one line, binary, “when the user does X, the system does Y.” That worked because the system was deterministic and Y was a single correct answer. An agent’s behavior is a distribution, and a distribution has no single Y. So the acceptance criterion the story used to hold cannot live in a story anymore; it moves into the golden dataset and the eval, which can express “right most of the time, and here is what the wrong tail is allowed to look like.” This is the first crack in the work-unit a PM has used for twenty years, and a later chapter takes up where the crack leads. For now: the eval set is not a test artifact bolted onto the backlog. For the agent’s behavior, it is becoming the backlog.

The difference that matters is this: in QA, one passing test proves the behavior is fixed. The same input produces the same output, so one green check settles the question. In an agentic system one passing test is a single sample from a distribution. Retrieval varies, tool responses vary, context differs across sessions, and the same agent given the same task on different days may succeed, fail, or take a path no one anticipated. So a binary pass is not a quality gate; it is one draw. The practice that addresses this has a name in the eval literature, pass@k: run the eval some number of times, typically five to ten for an internal gate, and report the rate across the runs. Eight of ten tells you something about reliability. One of one tells you almost nothing. When a team presents a single green checkmark on a non-deterministic system, they have not told you the agent passed. They have told you it passed once, and they may not know the difference.

Pass rate, not pass. A non-deterministic agent gives a different answer each time you run it, so one eval pass tells you almost nothing. Run each case many times and look at the spread, not the average. An agent that is right 85 percent of the time on a typical case can still be right only 40 percent of the time on its hardest cases, and at production volume those hard cases are not rare; they are simply spread unevenly across users, so someone meets them constantly. The average hides this; the worst slice reveals it. The question that surfaces it: how many times did you run each eval case, and how bad was the worst slice?

The three places the old model breaks

There are three failure modes the QA instinct does not see coming, and naming them is half the defense.

The first is determinism, which the pass-rate discipline above already addresses: you stop reading a green check as proof and start reading it as a sample. The second has no analog in traditional software, which is why it catches experienced people off guard. Call it compounding. Suppose an agent runs a ten-step workflow and every step is 95 percent accurate in isolation. In a traditional system you would sign off on 95 percent. Here the end-to-end success rate is the product of the steps, not their average: 0.95 to the tenth power is about 0.60. Ten strong components compound into a coin-flip of a system. This assumes the steps are independent, which rarely holds exactly, so the real number drifts, but the direction is guaranteed: component accuracy and system reliability are different numbers, and the gap widens with every step you add. If the suite tests each component in isolation, the components have been tested and the system has not. This is the sentence worth carrying into a finance review, where “our model passes evals at 95 percent” is doing a lot of unearned work.

The third is the one I have watched cost the most, because nothing flags it. Call it the invisible action. An agentic system can produce a semantically correct output, pass the eval, and never have performed the underlying action. Audits have turned up “order updated and confirmed” messages that corresponded to no API call at all. The agent wrote the right words, the automated judge scored the words as correct, the test went green, and nothing happened in the target system. Reading what the agent said is semantic validation. Checking what actually changed is state validation, and they are different gates. A release that runs the first and skips the second has half a gate, and the missing half is the one that fails silently.

The three eval breaks. Determinism: a green check is one sample, not a proof; report the pass rate and the worst tenth. Compounding: end-to-end reliability is the product of the steps, so 95 percent ten times is about 60 percent; test the trajectory, not just the components. The invisible action: a correct-looking output can sit on top of an action that never happened, so validate the state change, not just the text. Each break has a different shape, and each needs its own check; none of the three is visible to a team reporting a single number.

When the judge is the instrument

At production volume, human grading does not keep up, so teams use a model to score the other model against a rubric. That scales, and it imports a quieter problem: the judge is an instrument with documented systematic error. Judges prefer longer answers whether or not length tracks correctness, prefer whichever answer is presented first in a pairwise comparison, and prefer outputs from their own model family because the patterns are familiar. None of these are subtle effects in a controlled test, and all of them are present by default in production judging. The judge is not a neutral grader; it has a noise floor, and the only way to know the floor is to measure the judge against a set of outputs humans have already labeled. The number that matters is how many genuine failures the judge correctly catches. If the team cannot tell you that, against a human-labeled set, with a threshold they committed to before they saw the result, the eval scores have a blind spot the team has not measured, and the size of the blind spot is unknown rather than zero.

The fourth break the suite cannot see

The three breaks live inside the eval. The fourth lives in what the eval never contained. An eval only tests what someone thought to test; every scenario in the suite is a hypothesis a person had before launch about how the system might behave. The long tail, the input combinations that only appear at scale, the adversarial patterns users discover by using the thing, none of it is in the pre-launch suite, and most consequential production failures live precisely there. This is the same mechanism behind the worn statistic that most AI pilots shine in the demo and fail in production. The demo ran on curated data that matched the team’s assumptions; the suite ran on scenarios the team thought to include; production arrived with live, messy data that matched neither. Same mechanism, different scale.

The artifact that makes this answerable is a coverage statement: which intents were tested, which failure modes were included, which adversarial inputs were considered, and, the category most teams omit, which scenarios were known and deliberately deferred. That last line is uncomfortable because writing it down means admitting what you chose not to cover, which is exactly why it belongs in the release package and not in a postmortem six months on. Coverage is not a scoring problem. You can have a beautifully calibrated rubric and still be measuring the wrong things, and the hardest question in eval design is never “how do we score this,” it is “what are we missing.”

A model update is a deployment

One point earns its own line because teams discover it the hard way. Foundation-model providers update on their own schedule, and an update you did not make can change your product’s behavior overnight. The eval literature now treats a model version change as a deployment event, and so should you. The artifacts are unglamorous and rarely present: a version policy stating which provider versions run in which environments and what evidence promotes a new one, a regression suite re-run on every model change against a threshold set in advance, a vendor channel that surfaces upcoming changes with enough lead time to evaluate them, and a rollback to the prior model that has been rehearsed rather than merely documented. Most teams have none of these and find out they needed them when a silent version change breaks something that worked for a month, on a Saturday, with no one having shipped a thing.

The metric that did not matter

An eval suite answers one question well: did the agent do the thing you specified? It is silent on a harder one: was the thing you specified worth doing? A suite can pass every case, the rubric calibrated and the coverage honest, and still certify a product that moves no needle anyone cares about, because the target was set before the first run and the suite only ever measured faithfulness to it. This is the failure no score catches, because it is upstream of the score. The clearest version comes from healthcare, where the metric that passed and the metric that mattered came apart in plain sight.

DAX Copilot: the right pass, the wrong target. A clinical documentation assistant, well designed for its contract: it drafts the note, the physician reviews before signing. A randomized trial of more than two hundred physicians found high adoption and warm feedback, and a real improvement in how physicians rated their documentation burden. The eval criteria were met; the contract was honored. But on the outcome the business case actually rested on, time saved, the same trial found no statistically significant change against the control.

The product worked, and it even moved a real secondary measure. But the metric the case was built on, time saved, did not move, and adoption, the metric everyone watched, was never the one that mattered. No eval suite catches this, because the suite was pointed at the wrong target before the first run. Adoption is a leading indicator of possible value, not a measure of delivered value, and the link between the contract and the outcome that proves it was delivered is the one thing no eval can own for you.

Evals are a signal; you are the gate

Engineering runs the evals. You own what success means, and that turns into a short list of things only you can hold. You define what graceful failure looks like, because an agent that cannot finish a task will either fail silently or fail legibly and only one is acceptable. You insist on the end-to-end number, run many times, not the flattering component averages. You own the coverage decision and sign off on the gaps in writing, not just the scores. You require state validation alongside semantic validation. You ask whether the judge is calibrated and refuse to read scores whose noise floor is unknown. You own the version policy, because a change in the model is a change in the product. And you set the tradeoff among cost, latency, reliability, and oversight, because if you leave it implicit the system ships to whatever the last test configuration happened to encode.

The release meeting for an agentic system looks identical from the outside. Someone updated the board, the dev lead has a demo, the checkmarks are green. What changed is what they prove: that the agent behaved correctly in the scenarios someone thought to test, under controlled conditions, at one moment in time, against a model that may not be the one running next month. That is worth having. It is not the same as knowing the product is reliable, and the distance between the two is yours. Take the agent you are closest to shipping and find the one number everyone in the room trusts. Then ask what it would look like for that number to be green while the decision underneath it is wrong, and whether anything in your current gate would catch it.

Two Kinds of Human-in-the-Loop Operational Guardrails