Part IV · Operate  ·  Chapter 11

Chapter 11: You Can’t Measure What You Didn’t Design

The observation phase is the half of the product lifecycle teams quietly deleted, and for an agent that deletion is fatal, because an unmonitored agent does not sit and wait, it keeps acting. The platform emits the raw events; you compose them into instruments; and when an instrument cannot be produced, that is a finding about a surface you never designed, not a gap in your data pipeline.

Picture a Thursday afternoon. A supplier calls your procurement team: she has been getting order confirmations all week and wants to check the delivery schedule. Someone pulls up the dashboard while she is still on the line: three hundred and forty transactions, a hundred percent completion, every entry marked confirmed. Not one order has arrived. The agent has been confirming work it never did, for six months, and the completion metric was real the whole time. The work was not. The scenario is a composite, but every piece of it is drawn from failures that have already happened; assemble them and this is the shape they take.

That gap, between a metric that is true and a result that is right, is what this chapter is about, and it exists because the agent is probabilistic. A deterministic system that passes once passes forever, so you can instrument it after you build it. An agent has to be watched across time, which means the instruments have to be designed in before launch, because the behavior you need to see is behavior that only appears in production and only matters if someone is looking. Most teams are not looking, and not out of carelessness. They deleted the part of the lifecycle where looking happens.

The phase nobody ships

The product lifecycle was never a line; it was a loop. Discover, define, build, launch, and then the half that mattered most for learning: observe what happens in the real world, fix what broke, feed it back into the next release. Over the past fifteen years that back half quietly disappeared. Agile accelerated the build, MVP culture reframed speed as the virtue, continuous delivery made shipping itself feel like the unit of success, and none of that was wrong in principle. In practice it compressed the loop by eliminating the observation phase, which became a retrospective bullet point instead of a resourced stage. This is the house of cards again, applied to observability: observation was the layer that stacked highest and got deferred first, and a deterministic product could absorb the deferral. An agent cannot. For a traditional product the cost of not watching is missed opportunity, a feature nobody uses, a retention dip that takes a quarter to notice; the tool sits and waits and does nothing wrong while no one looks. An agent that no one watches does not sit and wait. It keeps acting, every day, on its last-known configuration, compounding whatever it gets wrong, autonomously, at scale, until someone outside the system, a supplier on a phone, notices what the dashboard could not say.

The launch dashboard, meanwhile, was tracking the right things for the wrong product. Daily active users, session length, task completion: every one of those presumes the human is the actor and measures whether the human kept showing up. In an agentic system the agent acts, and the questions that matter are different in kind. Did the agent do what the user intended, inside the bounds they authorized, and when it was wrong, could the user tell, could they stop it, could they recover? None of those appear in a session-length chart.

Platform emits, you compose

One thing has to be settled before any list of metrics, because it is the working contract for all of them. The instruments you need are composed from events the platform emits; they are not features you buy. Most agentic platforms in 2026 emit the right raw events, tool invocations, approval pauses, recovery actions, confidence scores, incident tickets, but few ship “override frequency” or “task success rate” as a configurable metric with a threshold. That cuts two ways. A PM who waits for the platform to ship the instrument waits forever. And a vendor page that claims to ship the instruments as primitives is usually selling distributed trace capture with a model-as-judge bolted on top and relabeled. What you own is the composition: which events are required, how they combine, what threshold signals intervention. What the platform owes you is the events, reliably and queryably. Platform emits, you compose is the line to hold when a vendor blurs it.

The six instruments

The instruments map directly onto the four runtime artifacts from the design chapter and the agent’s place on the autonomy ladder, and that mapping is the point: if the artifact was not designed, the instrument cannot be produced, and the missing instrument is how you discover the missing design.

Two instruments watch what the agent did. Task success rate is the most deceptive metric on the list, because most teams measure whether the agent completed the task when the question is whether it completed the task the user intended. The refund agent makes the trap concrete: “refund issued” is completion, and it is easy to measure and reassuring on a dashboard, but the success question is whether the refund was one a manager would endorse, and an agent quietly approving the cases it should have escalated posts a perfect completion rate while its real success rate falls. The procurement agent is the same divergence in its starkest form, a hundred percent completion and a zero percent task success, the agent reporting done on work it never did. Track only completion and you will never see either gap. Unintended action rate is how often the agent did something outside its authorization, and it exists only if the autonomy boundary was built to be legible and logged; if you cannot produce this number, the boundary was never a real product surface, and if it climbs, the boundary is too wide or the people around the agent do not understand what it may do.

Two instruments watch the humans around the agent. Override frequency is the one teams misread in both directions. A high rate looks like a trust failure and sometimes is; a rate that stays too low in a domain with real uncertainty is just as worth investigating, because it usually means people cannot override easily or have stopped paying attention. Tracked over time it is the earliest signal that supervision is eroding, the number that would have told the procurement team something was accumulating months before the supplier called. Confidence calibration asks whether the agent is certain about what it gets right and uncertain about what it gets wrong; an agent that presents everything with equal confidence builds the fragile kind of trust that holds until the first visible failure and then falls off a cliff, and if it never exposes uncertainty, calibration cannot be measured and the cliff is somewhere ahead of you, unmarked.

Two instruments watch recovery. Rollback time is how long it takes to undo a specific error once detected, and it exists only if a recovery workflow was designed; if the answer to a consequential mistake is an email thread and a manual fix, rollback time is measured in days and logged nowhere. Incident recovery time is the organizational version, the freeze-attribute-notify-reauthorize sequence when an agent misbehaves at scale, and if that was never designed, this instrument measures the speed of improvisation, which is slow, variable, and instructive exactly once.

The six instruments.
  • Task success rate. Did it do what the user intended, not just complete the task?
  • Unintended action rate. How often did it cross a boundary?
  • Override frequency. Is supervision holding?
  • Confidence calibration. Is the agent’s certainty earned?
  • Rollback time. Can a single error be undone?
  • Incident recovery time. Can a systemic failure be contained?

Each maps to a runtime artifact. If an instrument cannot be produced, the corresponding surface was not designed, and that absence is the finding. A diverging reading is a sprint input, a specific story for the next cycle, not a line item for a quarterly review.

Real-time when the consequence is faster than the alert

The six instruments are mostly retrospective, and for most systems that is the right altitude: review the day or the week, surface divergence, feed it back. Some actions cannot wait for retrospect. The nine-second deletion from the design chapter is the case: a dashboard would have caught it the next morning, and the next morning was several hours after the data and its backups were gone. Real-time observability is the pattern for actions whose consequence arrives faster than a human can respond, and it is not a faster dashboard. It is the kill switch firing before the action commits, which means the work is upstream, in the autonomy boundary, not downstream in the monitor. The PM questions are concrete: which actions in this agent’s tool set execute in seconds and cannot be undone, what is the kill-switch surface for those, who has the authority to fire it, and what does the agent do when it fires mid-action, hold the queue, roll back, surface the incomplete state? If the team’s answer to “how do we stop it mid-action” is “we shut down the service,” that is not a kill switch. It is an emergency shutdown preceded by the damage.

The wrong answer usually comes from the data

Every conversation about monitoring an agent eventually reaches the same question: where does the wrong answer actually come from? More often than teams want to admit, not from the model’s reasoning but from the data it was handed. The model does its job; the substrate is wrong, and infrastructure monitoring sees none of it because it is not looking at the data. So the monitoring architecture has to reach the data layer too, watching freshness, completeness, the integrity of the identifiers the agent joins on, and the accuracy of the relationships it traverses, because the tools built to watch analytics pipelines were not built to watch retrieval freshness or provenance. Trustworthy output requires trustworthy input. This failure is distinctive enough, and invisible enough to the instruments above, that it gets the next chapter to itself; here it is enough to say that a monitoring design that watches the agent reason while ignoring what it reasons from is watching the wrong half.

The metric is the test of the design

Here is the reversal that agentic products force on teams trained on SaaS instrumentation. In SaaS you design the product and instrument it afterward. Here the metric is the design requirement stated in advance. You design the approval moment because you need to measure override frequency. You design the recovery workflow because rollback time has to become a sprint target. You design the audit surface because traceability is a product commitment. So when a team cannot produce override frequency, the override surface was never made legible; when they cannot produce rollback time, recovery was never treated as a surface; when they cannot draw a calibration curve, confidence was never exposed in measurable form. The absence of the metric is not a data-infrastructure problem to be fixed later. It is proof that the corresponding phase of the lifecycle was never built, which is exactly why the instruments belong in the design brief and not the postmortem.

Scale makes the stakes literal. A large enterprise assistant deployment can field on the order of half a million questions a working day, which is what these systems look like when they work, and at that volume a one-percent unintended-action rate is five thousand unintended actions a day. The same rate at two hundred interactions a day is two. The instruments are scale-dependent, so the thresholds have to be modeled for the volume you will actually hit, not the pilot, and the review has to be sampling-based and exception-escalated, because any observation design that assumes a human reads every output assumes a workforce you do not have.

The lifecycle management is yours. Not the launch, the loop. Engineering builds the instrumentation; you define what to instrument, why each reading matters to the next release, and what threshold trips an action, set before the incident, because setting it after is deciding under pressure. So pick one deployed agent and write the threshold for each of the six instruments. The one whose breach should page someone tonight is the one you most need to have designed before tonight, and if you cannot produce one of the six at all, you have just found the surface you shipped without.