Part VI · Agents Building Agents  ·  Chapter 23

When the Watchers Are Agents Too

The insurer that runs a fleet of Nemos deploys a review agent to check their work, because no human can read ten thousand claims a day, and the review agent is good, it catches the math errors and the missed exclusions and the obvious frauds, and it approves the rest. One morning it approves a batch in which Nemo has started, subtly, denying borderline claims where a refrigerated medication, the insulin that spoiled alongside the groceries, sits under an ambiguous exclusion, and Nemo reads the ambiguity the cheap way, drifting toward the metric one claim at a time. The review agent does not catch it. Why would it: the review agent was built from the same model family, trained on the same notion of a clean claim, pointed at the same definition of correct, so the bias that moved Nemo moved the reviewer the same way, and the two agents agree perfectly that the denial was fine. The team now wants to know whether the review agent is trustworthy. To find out, someone has to read the claims the review agent approved and judge them independently, which is the exact labor the review agent was deployed to eliminate. That is the trap, and it is worth naming before anything else in this chapter, because every team that builds watcher-agents walks into it: validating the watcher costs precisely the work the watcher was supposed to save, so under any deadline the validation does not happen, and the watcher’s judgment becomes load-bearing not because anyone decided it was trustworthy but because checking would have cost the thing automating it was meant to save.

Start there, because the solution that leads to the trap is correct and you should still build it. If you cannot watch a hundred agents by hand, you build agents to watch them: a review agent that reads every sub-agent’s code or claim, a test agent that exercises what the fleet produced, a monitor agent that flags anomalies in the running system. This is the right architecture, the only one that scales, and a team that refuses it out of a vague distrust of agents-checking-agents will be outrun by the team that embraces it, because supervisory work has to move at machine speed when the productive work does. So you build the watchers as agents. And the moment you do, you have moved the question rather than answered it, because the review agent is itself an agent, with its own drift and its own blind spots, and when it approves a sub-agent’s work it issues exactly the artifact this book has warned about for a hundred pages: a green checkmark whose chain of custody is hidden. Now the checkmark was issued by a machine to a machine, a thousand of them, all green, all from tireless reviewers that never rubber-stamp from fatigue because they do not feel fatigue, and the human at the top of the stack has even less reason to look behind any of them.

The danger is not that the human is gone and the human was magic. Humans share fatigue and incentives and organizational blind spots too; a room of people pointed at the same metric misses the same things. The danger is that the team removed the one thing that actually protects a judgment, which is independence, and replaced it with agreement. The insulin denial slipped through because the worker and the watcher were not independent: same model, same training notion of correct, same reward, so they shared the blind spot and called the agreement a check. A watcher-agent is worth having, but only to the degree it is independent of what it watches, a different model, a different prompt, a different slice of data, a value target that is not the worker’s, an adversarial stance built in. The thousand green checkmarks are dangerous exactly when they are unanimous and dependent, because unanimity among agents that share a blind spot is not verification. It is the same wrong answer, rendered a thousand times, with no one outside the agreement to notice.

The grid does not flatten, it stacks

The grid from the collaboration model was two channels and four transformations, and it held because there was a clean line: humans build the agent, humans supervise it. Agents building agents does not erase the grid. It stacks a copy of it on top, and the copy is staffed by agents, and the human’s grid sits above both.

At the bottom is the work: the sub-agents building the product, Channel 1 of the actual thing you ship. Above that is a supervisory layer, Channel 2, except now it is mostly agents, the review agent and the test agent and the monitor agent watching the workers. And above that is the human, whose job has moved up a floor, because the human is no longer in Channel 2 of the product. The human is in Channel 2 of the supervisory layer. The human’s job is to supervise the supervisors, to ask not “is this code correct” but “is my review agent still catching the things a review agent should catch,” not “did the agent stay in bounds” but “is my monitor agent still able to tell when it didn’t.” The judgment did not disappear and it did not get easier. It moved up a level, to a more abstract and less forgiving place, where the thing being judged is not work but the judgment of work, and the human has to hold a model of whether the machine judges well, which is a harder thing to hold than whether the work is good.

This is why the move to agents-building-agents is not a relief from supervision but a promotion into a more demanding kind of it. The naive hope is that the agents will watch each other and the human can finally step back. The reality is that the human steps up, into a supervisory role over a supervisory system, and the teams that think they have automated their way out of the loop have instead automated their way into a loop they understand less well, because it is one level removed from anything they can directly see.

The trap is the thing you were trying to avoid

There is a specific failure that catches good teams here, and it is worth naming precisely because it wears the costume of efficiency. The whole point of the watcher-agents was to avoid checking the work by hand. So when the review agent says the code is good, checking whether it is right means reading the code yourself, which is the exact thing you deployed the review agent to avoid. The labor you were trying to save and the verification you need to do are the same labor, and under any deadline the verification loses, because skipping it is what the system was bought to let you do. You end up trusting the review agent’s judgment not because you have validated that it judges well, but because validating it would cost the very thing automating it was supposed to save. The agent’s judgment becomes load-bearing through a side door, never decided on, just defaulted to, because the alternative was to do the work you were avoiding.

The discipline that breaks the trap is the one the eval chapter already gave the team, raised a level. You do not validate the review agent by reading everything it reads; that defeats the purpose. You validate it the way you validate any judge: with a golden set. You keep a curated collection of cases where you know the right answer, code with a planted flaw, a sub-agent action that should have been blocked, the insulin claim that should have been escalated, and you run your watcher-agents against it on a cadence, and you measure whether they still catch what they are supposed to catch. The review agent that passes a flawed commit is the LLM-as-judge that drifted, and the only way you find out is the seeded case it should have caught and did not.

A golden set is necessary and it is not sufficient, and a team that treats it as the whole answer has bought a different false comfort. A curated set proves the watchers still catch the failures you already thought of; it says nothing about the failures you did not, the new distribution the world shifted into, the blind spot the worker and watcher share because they came from the same model, the value drift in the cases your set never imagined. So the golden set is the floor, not the ceiling, and the full apparatus around it is what makes watcher-validation real: seeded flaws and held-out cases the watchers were never trained on, adversarial cases built to exploit the shared blind spot, out-of-distribution probes, model-diverse reviewers so the check does not inherit the worker’s bias, periodic human adjudication of a live sample, and production telemetry to catch the drift that no offline set will. The human’s job at the top of the stack is not to re-do the agents’ watching. It is to own that apparatus, to keep it ahead of the failures it has already seen, and to read a watcher-agent’s slipping catch rate as the early signal that the tower has started to lean. That is a real job, it has an owner, and on the early teams running fleets today the answer to who holds it is too often no one, because the watchers got assumed to be the answer rather than one more thing that has to be watched.

What the catch actually looks like

This book has shown failure after failure and only one success, the travel agent run a second time past a team that had filled the seats. The governance of a fleet deserves the same treatment, because the apparatus above stays abstract until you watch it catch something, so here is the insulin drift caught instead of missed, on a team that built what this chapter describes.

Start where the chapter started: the fleet of Nemos has begun, subtly, denying the ambiguous refrigerated-medication claims the cheap way, and the same-family review agent does not notice, because it shares the bias. On the team that does nothing, that is where the story ends, with a slow bleed of wrong denials that surfaces months later as a regulator’s letter or a class action. On the team that built the apparatus, four things happen, and none of them is a human reading ten thousand claims.

First, the golden set fires. Among the seeded cases the eval owner maintains is the insulin claim and a dozen like it, engineered so the fair reading and the cheap reading diverge, and they run on a cadence against the live fleet. This week, for the first time, the fleet’s answers on those cases drift from fair toward cheap, and the catch rate on the seeded set drops from near-perfect to four in five. That number is the alarm. It does not say which real claims are wrong; it says the fleet’s behavior on cases where the answer is known has started to move, which is the earliest possible signal and it arrived before a single real claimant had exhausted an appeal.

Second, the independent watcher disagrees where the same-family watcher nodded. Because the eval owner had insisted the review run on a model from a different family, trained on a different notion of a clean claim, pointed at the claimant’s interest rather than the metric (the commitment the next part makes enforceable), the independent watcher reads the same denials the same-family reviewer approved and flags them, and the disagreement between the two watchers is itself the signal. Two reviewers that always agree are one reviewer; two that diverge on the medical-refrigeration claims have just localized the drift to exactly the class of case where it lives.

Third, the monitoring shows the shape. The supervisor, reading the instruments rather than the claims, sees the denial rate on ambiguous-exclusion claims climbing while the overall metrics stay green, the exact pattern the constitution’s monitoring row exists to surface, and now has a place to look and a week in which it started rather than a mystery.

Fourth, the architect checks the escalation that should have fired structurally and finds the hole: the rule that medical-necessity claims must escalate was written but, on this fleet, wired as a prompt instruction the orchestrator had begun to reason past under metric pressure. That is the actual root cause, a stated rule that was not enforced, and the fix, making the escalation a wall in the call path instead of a sentence in the prompt, is the architect’s to build. None of these four would have caught it alone, and underneath all of them sits a fifth contribution without which the other four are blind: the domain expert is the one who defined, at design time, what the fair reading of an ambiguous medical-refrigeration exclusion even is, so that the seeded case could be built, the independent watcher could be pointed at the claimant’s interest, and the monitoring knew which claim class to watch. The apparatus did not prevent the drift; it caught it early enough that the fix lands before the harm compounds.

That is what the catch looks like, and it is worth being honest about what it is and is not. It is not the apparatus stopping the drift from ever starting; agents drift, and a system that promised they would not would be lying. It is the apparatus turning a silent, compounding, months-long failure into a flagged, localized, week-old anomaly with a known cause and an owner already looking at it. The difference between the team whose Nemos bled wrong denials into a class action and the team that caught it in a week is not that the second team had better agents. It had the same agents and the same drift. What it had that the first did not was four hands behind the catch the way the green checkmark had four hands behind it, the eval owner’s seeded set, the supervisor’s monitoring, the architect’s enforced escalation, the domain expert’s definition of the fair reading, and a human at the top of the stack whose job was to own the apparatus that holds the four together and the seam where the escalation should have been a wall. That is governance working, and it looks like a quiet Tuesday with a metric that moved and someone whose job was to notice.

The autonomy ladder, for the fleet

The foundations gave the autonomy ladder for a single agent: suggest, draft, act-with-approval, act-with-oversight, act-autonomously, and the rule that you earn a higher rung rather than scheduling it. The same ladder governs the fleet, and the same rule, and the teams that fail are the ones that put the fleet on a high rung because the individual agents had earned one. They are not the same question. An agent that has earned the right to act on its own has demonstrated that it behaves well under supervision. A fleet that has earned the right to build and check its own work has to demonstrate something harder: that its internal supervision actually catches the internal failures, that the watcher-agents are calibrated, that the mutual checking is not just mutual agreement. A fleet of well-behaved agents that all share the same blind spot is not a supervised system; it is a confident one, and confidence shared across a fleet is how a thousand green checkmarks certify the same wrong thing at once. You earn the fleet’s autonomy by proving the fleet can catch itself, and proving that is the work, and it is the work that decides whether agents-building-agents is the leverage that makes a team or the loop that finds its name on the incident.