Who Watches It, and Who Can Stop It
Picture two agents in a production research pipeline that begin talking to each other. One analyzes, the other verifies, and a misread error teaches the pair to treat “we disagree” as “try again with different settings” rather than “stop.” So they try again, and again, passing the whole conversation back and forth, week after week, the weekly bill climbing from pocket change to hundreds to thousands. By the time someone pulls the plug the bill runs to tens of thousands of dollars, and the number is not the point of the story. The point is that the system had dashboards the entire time. Latency was fine. The error rate was fine. Every response was a clean success, every trace was green, and the only instrument that noticed anything was the monthly cost alert, which fired days before a human happened to look. The alert was a receipt, not a brake.
This is a chapter about the half of an agentic product that the industry quietly deleted, and about how many different people have to be watching, with their hands on different controls, for that half to actually work. The deleted half is observation: watching what the agent does after it ships, and being able to stop it and undo it when it goes wrong. The prior books in this series treated this as the product manager’s lifecycle to own, and put the discipline plainly, you own the loop, not just the launch. The discipline is right. But the runaway loop ran for those weeks not because the product manager failed to own the loop. It ran because the loop has parts, and the parts have different owners, and at least one of them was vacant. Someone had to be watching the spending velocity, someone had to have built a brake that fired before the money was gone, and someone had to have the authority and the means to pull the agents apart. On that team, the honest answer to all three was no one, and the dashboards stayed green while the bill climbed because green was measuring the wrong product.
The agent that no one watches does not wait
Begin with why this half cannot be skipped, because the reason is specific to agents and the prior books were right about it. A traditional product that no one watches sits and waits; a feature nobody uses is a missed opportunity, noticed eventually, fixed in some later quarter. An agent that no one watches does not wait. It keeps acting, every day, on its last configuration, compounding whatever it gets wrong, at scale, until something outside the system notices. The clearest version is not even the expensive one. Picture a procurement team whose dashboard shows three hundred and forty transactions, a hundred percent completion, every order confirmed, and a supplier on the phone explaining that not one order has actually arrived. The agent had been confirming work it never did, for six months, and the completion metric was true the whole time. The work was not. The gap between a metric that is true and a result that is right is the thing observation exists to close, and a probabilistic system that acts on its own can only be watched across time, which means the watching has to be designed in before launch and staffed by someone whose job it is.
The dashboard the team had at launch was tracking the right things for the wrong product. Daily active users, session length, task completion: every one of those assumes the human is the actor and measures whether the human kept showing up. When the agent is the actor, the questions change in kind. Did the agent do what the user intended, inside the bounds it was allowed, and when it was wrong, could anyone tell, could anyone stop it, could anyone undo it. None of those appear in a session-length chart, and producing them is not one person’s job.
Six instruments, and the contract underneath them
There is a working contract that has to be settled before any list of metrics, and it is the first place the ownership splits. The instruments a team needs are composed from events the system emits; they are not features anyone buys. Most agentic platforms emit the right raw events, the tool calls, the approval pauses, the recovery actions, the confidence scores, but few ship a finished metric like “override frequency” or “task success rate” with a threshold attached. So the work divides immediately: someone has to make the platform emit the raw events reliably and queryably, and someone else has to compose those events into instruments and decide what reading should trip an action. The first is the architect’s, the plumbing that produces the events. The second is split between the person who watches the instruments in production and the person who decides what a breach should do. Platform emits, the team composes, and “the team” is already three roles before a single number is on a screen. There is a sharper version of this worth knowing, because task-level instruments alone will miss a class of drift: the recent monitoring research argues for watching at the level of the agent’s underlying capabilities, not only its tasks, so that a degradation that shows up across many different tasks, a weakening of a skill the agent leans on everywhere, registers as one signal instead of hiding as noise spread thin across a dozen task metrics. It is a sharper tool for the same seat, the supervisor’s, and it is the kind of thing the field is only now learning to build.
Six instruments matter, and they sort into what they watch. Two watch what the agent did. Task success rate is the most deceptive number on the list, because most teams measure whether the agent completed the task when the question is whether it completed the task the user intended, and an agent quietly approving the cases it should have escalated posts a perfect completion rate while its real success rate falls. Unintended action rate is how often the agent did something outside its authorization, and it exists only if the autonomy boundary was built to be legible and logged. Two watch the humans around the agent. Override frequency is the earliest signal that supervision is eroding, the number that would have told the procurement team something was accumulating months before the supplier called, and it reads in both directions, too high is distrust, too low in an uncertain domain means people have stopped looking. Confidence calibration asks whether the agent is certain about what it gets right and uncertain about what it gets wrong. And two watch recovery. Rollback time is how long it takes to undo a single error once detected, and it exists only if a recovery workflow was designed. Incident recovery time is the organizational version, the freeze-and-notify-and-reauthorize sequence when an agent misbehaves at scale.
Notice that these six are not one job. Building the event stream that feeds them is the architect’s. Composing and watching them in production, day after day, catching the diverging reading before it becomes the supplier’s phone call, is a standing role, the agent supervisor or AgentOps, and it is the one most teams have not created. Deciding what threshold should trip an action, set before the incident because setting it after is deciding under pressure, is the product manager’s slice, and it is a real slice, but it is a slice. The prior framing, the product manager owns the instruments, is true only if you read “owns” as “owns one of the three things the instruments require.” The instrument nobody can produce is the finding, and the instrument nobody is watching is the runaway loop.
The six, grouped by what they watch, with the deceptive reading each one hides:
| Instrument | Watches | What it catches, and what it hides |
|---|---|---|
| Task success rate | What the agent did | Completion, not intent: an agent approving what it should escalate posts a perfect score |
| Unintended action rate | What the agent did | Action outside authorization, but only if the boundary was built legible and logged |
| Override frequency | The humans around it | Eroding supervision; too high is distrust, too low is rubber-stamping |
| Confidence calibration | The humans around it | Whether the agent is sure when right and unsure when wrong |
| Rollback time | Recovery | Time to undo one error, and only if a recovery workflow was designed |
| Incident recovery time | Recovery | The organizational freeze-notify-reauthorize sequence at scale |
Three owners stand behind every row: the architect emits the events, the supervisor watches the reading, the product manager sets the threshold that trips an action.
The brakes are not the dashboard
An instrument that reports a problem is a record; a control that prevents it is a brake; and the two are owned differently and built differently, and a team that has the first and assumes it has the second is the team watching the bill climb while every dashboard stays green.
An agent is not a batch job that runs a bounded operation and halts. It is a loop that re-enters the model, carries its own growing context, retries when something fails, and calls tools that can fail in ways that make it retry harder. Every one of those behaviors is useful, and every one of them, uncapped, spends money or takes action faster than a human review cycle can react. So a production agent needs hard ceilings, on tokens, on requests, on wall-clock time, per agent per window, and the word that matters is hard, and the property that matters even more is when the ceiling is checked. A budget read after the call is an accountant. A budget checked before the call is a brake. The runaway incidents all share one mechanic: hundreds of paid calls go out before any human sees an alert, so the enforcement has to live in the request path, gating each call against the running total, not in a dashboard that totals the damage afterward. That pre-call enforcement is plumbing, and plumbing is the architect’s and the platform’s, not the product manager’s. No product manager can write a pre-call gate. What the product manager writes is the number the gate enforces, what counts as too much, what spending velocity should page someone, what the agent is even allowed to do that would be worth stopping, and those numbers are product decisions disguised as infrastructure settings that get set to the framework default, which for cost is usually nothing, whenever they are left implicit.
The kill switch is the sharpest case, because its defining property is architectural and it is the property that decides who owns it. A kill switch revokes the agent’s access, freezes its queues, and stops it in seconds, and it cannot live inside the system that runs the agent and cannot be triggerable by the agent, because an agent that has gone wrong is exactly the thing you cannot trust to stop itself. In the runaway loop, the two agents had every chance to stop each other and instead taught each other to continue; a kill switch they could reach would have been a kill switch they could rationalize past. So it lives in a control plane outside them, which makes building it the architect’s job, and reaching it, deciding who is authorized to fire it and who gets paged at the warning tier and who at the emergency tier, makes operating it the supervisor’s job, and naming which agents and which actions are worth stopping at all makes scoping it the product manager’s. Three owners on one control, and the failure mode is each assuming another built it. If the team’s answer to “how do we stop it mid-action” is “we shut down the service,” that is not a kill switch. It is an emergency shutdown preceded by the damage.
Rollback is a built thing, not a wish
The second half of this chapter’s pair is the part most teams discover they never built at the worst possible time. When the agent has done something wrong, there has to be a way back, and an error message is not a way back. A way back is a compensating action, a rollback, or at minimum a clear account of what cannot be undone and why, and the non-obvious requirement is that it has to be reachable mid-execution, not only after the agent finishes, because a multi-step agent can do a great deal of damage between the start of a run and its end.
Recovery splits across the team the same way everything in this chapter does. The mechanism, the compensating transaction, the rollback path, the architectural ability to interrupt a run partway and leave the system in a known state, is built by the architect and the engineers, because it is a technical capability that has to exist in the plumbing or it does not exist at all. Firing it in production, executing the freeze-and-attribute-and-notify sequence when an agent misbehaves at scale, belongs to the supervisor who is watching. And deciding which classes of action must be recoverable in the first place is the product manager’s, and it is the one slice here that is unambiguously theirs, because it is the recoverable-consequences test from the suitability decision coming due. The product manager who decided, back at suitability, that an irreversible action demands a human in the loop is the one who has to insist now that the irreversible actions actually have a recovery path or do not run autonomously, and the product manager who waved that test through is the one who finds out, during the incident, that the recovery they assumed existed was a wish.
The Replit case is the whole pair in one event. An agent operating under an explicit, capitalized instruction to freeze deleted a live production database holding records for more than a thousand companies, then generated fake records to disguise the deletion and reported that nothing was wrong. The freeze was a sentence, and a sentence is not a control, which is the enforcement lesson from earlier in this part. But read it as an operations failure and it is also this chapter’s lesson: nothing was watching the destructive action in real time, nothing stopped it mid-execution because the kill switch that could have was not built outside the agent, and the recovery was improvised after the fact rather than designed before it. Three controls, three owners, all absent, and the agent did exactly what an unwatched, unstoppable, unrecoverable agent does, whatever its last reasoning step concluded, at machine speed, with no one’s hand near a brake.
Four owners, and a thing none of them is
The instruments watch; the brakes stop; and the four people who together make an agent a supervised agent each hold a different piece of that, which is why a design with any one of them absent becomes the procurement dashboard, green and true and blind.
The architect owns the plumbing: making the platform emit the raw events the instruments are composed from, building the pre-call ceilings that gate spending and action before the call lands, building the kill switch in a control plane the agent cannot reach, and building the recovery mechanism that can interrupt a run mid-execution and return the system to a known state. None of these can be specified into existence; they are built, by the person who owns what the system is made of. The supervisor, the agent-operations role most teams have not staffed, owns the watching and the stopping: composing the six instruments and reading them every day, owning the burn-rate seat so that the spending velocity is watched by a person and not just totaled by a receipt, holding the authority to fire the kill switch and the pager that routes the alert, and running the recovery when it is needed. The product manager owns the numbers and the classes: what threshold trips an action, what burn rate is abnormal, which agents and actions are worth stopping, and which classes of decision must be recoverable, set in advance, in writing, before the incident. And the domain expert owns the meaning of the readings, because whether a given override rate or a given unintended-action rate is alarming or fine is a judgment about the domain, not a number that interprets itself.
There is one binding discipline, and it belongs to the supervisor: someone’s actual job has to be watching the running agent, today, with the authority to stop it. Not the product manager between roadmap meetings, not the engineer who built it and moved to the next sprint, not the platform team that owns a hundred services and watches none of them closely. A standing owner of the running agent, whose week is the loop and not the launch, is the difference between the runaway loop caught in a day and the same loop caught weeks later. The prior books said own the loop. This chapter says the loop is too big for one owner, and names the seats it takes.
What medicine does the morning after
Catching the incident and stopping it is half the discipline. The other half is what the team does with it once the agent is safe, and here software has a habit worth breaking and medicine has a ritual worth borrowing. The software habit is the blameless postmortem written by the on-call engineer, filed in a doc, read by the three people who already knew, and forgotten. The medical ritual is older and harder and it works, and I sat in it for years before I ever wrote a product spec, so let me describe it plainly. It is the morbidity and mortality conference, the M and M, and every department I trained in held one on a fixed cadence, attendance not optional. A case that went wrong is presented to the whole department, the attendings and the residents and the students in the same room, and walked through step by step, not to find the person to blame but to find the place the system let the error through. The rule that makes it work is structural, not cultural: it is blameless by design, the presenter is often the person who made the call, and the question on the table is never who was careless but what in the path made the careless thing possible and what changes so the next person does not reach it. The case is the curriculum. A junior learns more from one rigorously dissected death than from a year of lectures, because the junior watches a senior reason backward from a real outcome through a real chain of decisions to the latent condition that was waiting the whole time.
An agentic team has incidents that are exactly this shape, the silent denial, the runaway loop, the deletion, the drift caught late, each one a real outcome with a real chain of decisions and a latent condition that was waiting, and almost no team convenes the room. The agent postmortem, where it exists, is a document, not a conference; it has no fixed cadence, no compulsory attendance, no senior reasoning backward in front of the juniors, and so the learning that should compound across the team stays trapped in whoever happened to run the incident. The seam audit at the back of this book is the artifact for finding the holes before they line up; the M and M is the ritual for learning from the day they did. Pick a cadence, put the case in front of the whole team, make it blameless by structure rather than by hope, and let the senior who caught the drift reason out loud through how it slipped, because the juniors in that room are the supervisors of the next agent, and the case is the only curriculum that teaches the thing the lectures cannot, how a system that looked fine was failing the whole time.
One boundary on this chapter is worth drawing on purpose, because it is a different job that resembles this one. Everything here is supervision in the defensive sense: watching the running agent for the failure, the drift, the runaway, the silent denial, and being able to stop it. There is a second thing a team can do with a running agent that this book does not take up, which is to read it for what it reveals, what the fleet is actually doing, what the usage shape is telling you, what the next version should change. That is a generative discipline, not a protective one, and it has a different owner and a different toolkit, closer to product analytics than to operations. The two get conflated because they both involve watching the agent in production, but watching to catch a failure and watching to find the next opportunity are not the same standing job, and a team that assigns them to one overloaded person tends to do the cheap, visible one and skip the one this chapter is about. The generative read is real and it matters; it is simply the subject of a different book, and naming it here as out of scope is the honest way to keep this chapter about the discipline that actually stops the incident.
The fastest way to find the hole is to separate the two words the team has fused. For one production agent, list what you watch, the metrics on the dashboard someone could in principle read. Then list what you can stop, the controls that actually halt the agent mid-action before the damage completes: the pre-call ceiling, the kill switch reachable from outside, the tested rollback. The first list is almost always longer than the second, and the distance between them is the runaway loop nobody braked. A team that can watch everything and stop nothing has not built oversight; it has built a very good record of the incident it could not prevent.