Part III · Failures  ·  Chapter 12

The Green Checkmark Nobody Owned

Picture a health system rolling out a care-coordination agent that has passed every offline evaluation the vendor ran and every test the integration team wrote. The checkmarks are all green. Ten weeks in, a cardiology fellow notices something they did not: the agent is sorting a class of heart-failure patients against a clinical guideline that the field revised three years earlier. The outputs are fluent, confident, formatted exactly like the examples the agent learned from, and reasoning from criteria that medicine has retired. The evaluation suite tested whether the agent’s output matched the data it was trained on. It did. The training data was the old guideline. Everything the suite was built to check, it checked, and passed, and what it proved had quietly stopped being what anyone thought it proved.

This is a chapter about a single green checkmark and how many different people have to have done their jobs for it to mean what the release meeting assumes it means. The prior book in this series treated evaluation as a two-party arrangement, and stated it cleanly: engineering runs the evals, the product manager owns what success means. That division is correct as far as it goes, and this book keeps it. But the cardiology failure did not happen because engineering ran the evals badly or because the product manager defined success badly. Engineering ran exactly the evals it was given, and they passed. The product manager’s definition of success, the agent stratifies patients correctly, was met by every measure the suite contained. The failure lived in a place neither of those two roles was looking, and the gap between them is a job, and on that team the job belonged to no one.

What the green checkmark actually certifies

Start with what an evaluation is, because the whole chapter depends on being precise about it. An eval checks whether the agent makes the decisions it was designed to make under conditions that resemble the real world. It runs against a golden dataset: a curated set of inputs paired with the outputs a team would accept, the edge cases it knows about, and the answers the agent must never give. The suite grades the agent against that dataset, and a green result means the agent’s behavior matched the dataset’s expectations.

The first crack is in an assumption nobody states: the green checkmark is only as good as the golden dataset, and the golden dataset is a human artifact that can be wrong, incomplete, or out of date. The cardiology agent passed because the golden dataset encoded the old guideline as the correct answer. The suite did not malfunction. It faithfully certified the agent against a reference that had expired, and faithfully reported green, because grading the agent against the dataset is exactly what it is built to do and the dataset is exactly what it was given. A green checkmark certifies that the agent matches the golden dataset. It certifies nothing about whether the golden dataset is right. And the question of whether the golden dataset is right is not an engineering question and not, by itself, a product-management question. It is a domain question, and it is the first of the seats the two-party model leaves empty.

There are deeper cracks inside the eval itself, and they matter because each one is a place a different person has to be paying attention. A green check on a non-deterministic agent is one sample, not a proof: run the same case again and the agent, drawing on varying retrieval and context, may answer differently, so the honest number is the pass rate across many runs and the shape of the worst slice, not a single check. End-to-end reliability compounds downward: ten steps each ninety-five percent reliable produce a system around sixty percent reliable, because the rates multiply rather than average, so a suite that tests each component in isolation has tested the components and not the system. And an agent can produce a perfectly correct-looking output on top of an action that never happened, the confirmation message for an update that no system ever received, so checking what the agent said and checking what actually changed are two different gates and a suite that runs only the first has half a gate. Each of these is real, and each of them is a discipline somebody has to own, and naming who is the work of this chapter.

Three roles where the prior book saw two

Lay the failure modes next to the people and a third seat appears between the two the old model named, and a fourth standing just behind it.

Engineering still runs the evals, in the sense of building and operating the harness, the thing that executes the suite and reports the result. That has not changed and it is real work. But sitting between engineering and the product manager is a set of responsibilities that the two-party model quietly assigned to whoever happened to pick them up, and on most teams that meant no one held them on purpose. Who curates the golden dataset, and keeps it current as the world it encodes changes. Who decides the suite runs each case ten times and reports the worst tenth rather than the average. Who validates that the automated judge, the model scoring the other model at production volume, is calibrated against outputs humans have already labeled, because a judge has documented, systematic biases, toward longer answers, toward the answer shown first, toward its own model family, and an uncalibrated judge has a noise floor no one has measured. Who writes the coverage statement that says which scenarios were tested and, the line teams omit, which were knowingly left out. Who treats a vendor’s silent model update as the deployment it is and re-runs the regression suite when the ground shifts. None of those is harness-building, and none of them is defining what success means at the product level. They are their own discipline, the discipline of making the green checkmark trustworthy, and that discipline is a role: the eval owner.

And behind the eval owner stands the domain expert, owning the one thing neither engineering nor the eval owner nor the product manager can supply: whether the answer the golden dataset calls correct is actually correct in the world. The prior book drew this line precisely, structural correctness an eval can certify, domain correctness only an expert can, and then, writing from the product manager’s seat, folded the consequence back onto the product manager anyway. But a product manager cannot know that a heart-failure guideline was revised three years ago. A cardiologist can, and only a cardiologist can, and the agent had been running for ten weeks against a retired standard precisely because the person who could see it was not in the loop that built or maintained the golden dataset. Domain correctness is not a thing the product manager owns and delegates. It is a thing the domain expert owns, and the team either built a seat for that ownership or discovered its absence the way this health system did, by luck, ten weeks late, when a fellow happened to read the output closely.

The seam is the checkmark itself

So return to the green checkmark and watch how many hand-offs it silently depends on, because this is the shape of the failure and it is worth seeing whole.

The product manager defined success: the agent must stratify patients correctly. Correct, and not enough on its own, because “correctly” points at a standard the product manager cannot themselves verify. Engineering built the harness and ran the suite, faithfully, against the golden dataset it was handed. The golden dataset encoded a clinical guideline that someone, at some point, had captured as the correct answer, and that capture was never owned by anyone whose job was to keep it current, so it aged silently while everything around it stayed green. The domain expert who could have caught the expired guideline was consulted at the start, maybe, or maybe not, but was certainly not a standing owner of the dataset’s correctness, because the two-party model had no seat for that ownership and so the responsibility evaporated into the gap between “engineering runs it” and “the PM owns success.” Every individual did their job. The checkmark went green. And the checkmark was a lie that no single person told, assembled out of a correct harness, a faithful suite, an expired dataset, and a missing owner, each of which was fine on its own.

That is the seam: one role produces a signal, other roles depend on it, and the failure lives in a hand-off no one believed was theirs. The product manager trusted the checkmark because engineering produced it. Engineering produced it faithfully because the dataset said so. The dataset said so because no one owned keeping it honest. The green checkmark is not a fact the team can read off a dashboard. It is the end of a chain of custody, and a chain of custody is only as strong as its weakest owned link, and the link nobody owns is the one that breaks. On a team that has closed this seam, a green eval is not done when the suite passes; it is done when the eval owner can state the pass rate and the judge’s calibration and the coverage gaps, and the domain expert has signed that the golden dataset still reflects the world, and only then does the product manager read the green checkmark as meaning what the release meeting needs it to mean.

Four hands behind one checkmark

It takes four people to make a green checkmark mean what the release meeting assumes it means, and a checkmark with any one of them missing certifies less than the room thinks while looking exactly as green.

Engineering owns the harness: building the eval infrastructure, running the suite, producing the number. The eval owner owns the trustworthiness of that number: the golden dataset’s construction and upkeep, the pass-rate discipline rather than the single run, the judge’s calibration against a human-labeled set with a threshold committed before the result is seen, the coverage statement including the deferred scenarios, and the treatment of every model update as a deployment that triggers a regression run. The domain expert owns correctness: whether the answers the golden dataset blesses are right in the world, and whether the standards it encodes are current, which is the one judgment no amount of eval engineering can manufacture and the one the cardiology case rested on. And the product manager owns the gate: what success means, what graceful failure looks like, which tradeoffs among cost and latency and reliability the product will accept, and the decision to ship or not, made on a green checkmark they now have reason to trust because three other people stand behind it.

There is one discipline that binds the four, and it is the eval owner’s to drive: the green checkmark must come with its provenance, not just its color. A pass that arrives as a single green light is a pass with its chain of custody hidden, and a hidden chain of custody is where the expired guideline lives. A pass that arrives with its pass rate, its worst slice, its judge calibration, its coverage gaps, and a domain sign-off is a pass the product manager can actually gate on, because each link in the chain has a name attached. The difference between those two checkmarks is invisible in the release meeting, where both render as the same green. It becomes visible ten weeks later, in a cardiology ward, when someone reads the output closely enough to notice what the checkmark never checked. The clinical-AI literature has a name for the failure this prevents, the attributability gap, the situation where a system made a consequential call and no one can say which judgment, whose threshold, which sign-off it actually rests on, so that when it is wrong the accountability bounces from hand to hand and lands nowhere. A checkmark that carries its provenance is the audit surface refusing to let the accountability bounce: every link in the chain has a name, so the question “who decided this” has an answer before anyone has to ask it under oath.

Take the agent your team is closest to shipping and find the one eval number everyone in the room trusts. Then ask, of specific people and not the dashboard: who keeps the golden dataset current, who calibrated the judge, who signed that the answers it grades against are still correct in the world, and what it would look like for that number to be green while the decision underneath it is wrong. If the answer to any of those is that no one owns it, you have not found a gap in the test plan. You have found the cardiology ward, ten weeks early, while there is still time to put a name on the link that breaks. That question, and the others each chapter in this part leaves you with, are gathered into one instrument at the back of the book, the seam audit, to be run on one agent in one sitting.