You Can’t Measure What You Didn’t Design
Stage: post-launch, observation
The call came from a supplier. She had been receiving order notifications all week and wanted to confirm the delivery schedule. Someone on the procurement team pulled up the dashboard: 340 transactions, 100 percent completion, every entry marked confirmed.
She was still on the line.
Not one order had arrived. The agent had been confirming work it never did, for six months.
The completion metric was real. The work was not.
The metrics in this chapter exist because the agent is probabilistic. If it were deterministic, one pass would be one pass forever. Because it is not, behavior has to be observed across time, and the instruments must be designed in before launch. That is the through-line of this chapter.
The Phase Nobody Ships
The traditional product lifecycle was not a straight line. It was a loop. You discover, define, build, and launch, but launch was never supposed to be the last step. It was the checkpoint where the work shifted from building to learning. The observation phase followed: you monitored what was happening in the real world, fixed what was broken, and collected the feedback that seeded the next release.
What happened over the past fifteen years is that one half of that loop quietly disappeared.
Agile culture accelerated the build cycle. MVP culture reframed speed as virtue. Continuous delivery made shipping feel like the unit of success. None of these were wrong in principle. But in practice, they compressed the loop by eliminating the back half. Ship fast. Celebrate the launch. Move to the next roadmap item. The observation phase became a retrospective bullet point, not a product phase.
This is what the observation phase’s disappearance looks like applied to observability. Deterministic MVPs could absorb that deferral. Agentic MVPs cannot. The supervisory half is the probabilistic system’s nervous system. Defer it, and you ship without one.
Most enterprise product teams today do not have a named, resourced, designed observation phase. They have a support queue, an NPS survey sent at thirty days, and a dashboard built for the launch presentation that nobody updated after week two.
For a SaaS product, the cost of not monitoring is missed opportunity. A feature nobody uses. A workflow with friction nobody reported. A retention drop that took a quarter to diagnose. The tool sits and waits. It does nothing wrong while no one is watching.
Agents are different. An agent that is not monitored does not sit and wait. It continues to act. Every day without a corrective loop is another day the agent operates according to its last known configuration, inside its last-understood autonomy boundary, guided by instructions that may no longer match what users intend. A SaaS product with no observation phase stagnates. An agentic product with no observation phase compounds its errors autonomously, at scale, until a supplier calls.
The Six Months
The six months between launch and the supplier’s call were not a careless gap. They were the predictable output of something more fundamental: trust building naturally, in the absence of a system designed to hold supervision stable.
In the first weeks after launch, the procurement team was paying close attention. Spot checks came back clean. A manager who had been approving every batch above a certain threshold stopped scheduling the weekly review, because three months of clean data made it feel like unnecessary overhead. Someone adjusted the approval threshold upward because the agent had never made a mistake at the original level. None of these decisions were announced or logged. Each was a rational response to observed reliability, made by a reasonable person in the moment.
By month six, the autonomy level the team had actually granted the agent bore no resemblance to what was designed at launch. The agent was not running at the authorized level. It was running at the level the team had drifted into, through a series of small trust updates that were never made visible.
Two mechanisms produce this supervisory drift, usually at the same time. Complacency is the long-running finding from aviation human factors: as reliability increases, operator vigilance decreases, not because operators become careless, but because sustained reliability is cognitively indistinguishable from reliability you should trust unconditionally. Expectation is the more actionable framing for a PM. Users do not only drift into trust over time; they arrive pre-trained by every other AI tool they have used: hand off, accept, move on. Policy cannot fix complacency. Onboarding cannot undo the expectation. The supervisory surface has to push against a reflex set before the product opened, and hold it across the weeks when reliability silently erodes attention.
What the Old Metrics Were Measuring
The launch dashboard tracked the right things for the wrong product.
Daily active users measures whether users found the tool worth returning to. Session length measures whether the workflow held attention long enough to complete. Task completion rate measures whether the interaction was designed well enough for a human to reach the destination. These metrics presuppose a system where the human is the actor.
In an agentic system, the agent acts. The questions that matter are fundamentally different: did the agent do what the user intended, within the bounds they authorized, and when it got something wrong, could the user tell, could they stop it, and could they recover?
Those questions do not appear in a session-length chart.
Platform Emits, PM Composes
One point belongs ahead of the six-instrument list, because it is the working contract for everything that follows.
The six instruments below are composed from events the platform emits, not shipped as named features on the dashboard you bought. Most enterprise AI and agentic platforms in 2026 emit the right raw events: tool invocations, approval pauses, recovery actions, incident tickets, confidence scores. Few ship “override frequency” or “task success rate” as a built-in metric with a threshold you configure.
That distinction matters twice. First, a PM who waits for the platform feature will wait forever. Second, a vendor page claiming to ship the six instruments as primitives is usually offering distributed trace capture plus an LLM-as-a-judge on top, relabeled. What you own as PM is the composition: which events are required, how they combine into each instrument, what threshold signals intervention. What the platform owes you is the events, reliably and queryably. That division, platform emits, PM composes, is the working contract for every metric in this chapter. Appendix A, the platform taxonomy, is where to look when a vendor page blurs the line.
The Six Instruments
The metrics an agentic product requires map directly onto the four runtime design artifacts from Chapter 5 and the autonomy level of the agent. If those artifacts were not designed, these metrics cannot be produced, which is precisely the point. The absence of a metric is a design finding, not a data infrastructure gap.
Each one is a design requirement stated in advance, not a KPI added after launch.
Six instruments constitute the observation phase for any agentic product: task success rate, unintended action rate, override frequency, confidence calibration, rollback time, and incident recovery time. If any cannot be produced, the corresponding product surface was not designed. That is a design gap, not a data gap.
Each instrument is a sprint input: a diverging reading generates a specific user story in the next sprint, not a quarterly review discussion.
Observe
Task success rate. The most deceptive metric on the list. Most teams track whether the agent completed the task. The correct version measures whether it completed the task the user intended. The procurement agent in this chapter’s opening had a one-hundred-percent completion rate and a zero-percent task success rate. There is a failure mode called a background failure: the agent finishes, no error state registers, and the result is wrong. What divergence tells you: if task success diverges from task completion, you have background failures in production.
Unintended action rate. How often the agent took an action it was not explicitly authorized to take. The metric that surfaces boundary violations, but only if the autonomy boundary was designed to be legible and logged. What divergence tells you: if this metric is missing, the boundary was never designed as a product surface. If it rises, the boundary is too broad or the people around the system do not understand what the agent is permitted to do.
Monitor
Override frequency. The metric most teams misread. A high rate looks like a trust failure; sometimes it is. A rate that is persistently too low, in a domain with genuine uncertainty, is equally worth investigating; it may mean users cannot override easily, or that they have stopped paying attention. What divergence tells you: tracked over time, override frequency is the earliest signal that supervision is eroding before the agent drifts past its authorized boundary. It is the metric that tells you, before the supplier calls, that something has been accumulating for months.
Confidence calibration. A well-calibrated agent expresses high certainty on things it gets right and surfaces uncertainty on things it gets wrong. An overconfident agent builds fragile trust: adoption that holds until the first visible failure, then collapses faster than it was built. What divergence tells you: if the agent never exposes uncertainty, calibration cannot be measured, and the trust graph will have a cliff in it somewhere ahead.
Fix
Rollback time. How long it takes from detecting an error to completing recovery for that specific case. This metric only exists if a recovery workflow was designed as a product surface. If the response to a consequential error is an email thread and a manual correction, rollback time is measured in days and logged nowhere. If a compensating workflow was built, staged, and accessible from launch, rollback time is measured in minutes and becomes a sprint target.
Incident recovery time. The organizational equivalent of rollback time. When an agent misbehaves at scale, the response involves freezing the agent, attributing the failure, notifying affected users, and reauthorizing before resuming. If that sequence was not designed, incident recovery time measures the speed of improvisation. It will be slow, variable, and instructive exactly once.
Real-Time vs. Retrospective Observability
The six instruments are mostly retrospective. They tell you what happened. For most agentic systems, that is the right altitude: review the past day or week, surface divergence, feed the findings back into the next sprint. Retrospective observation is the working horse of the observation phase.
Some classes of action require something stronger. The PocketOS case in Chapter 5 closed in nine seconds, which is faster than any alert-and-respond cycle is designed to operate. A coding agent with broad write access to production data did not need a dashboard. It needed a kill-switch that the autonomy boundary fired before the action executed. By the time a downstream alert fired, the data and the volume backups were both deleted by the same operation. The retrospective dashboard would have caught it the next morning. The next morning was too late.
Real-time observability is the architectural pattern for actions where the timescale of consequence is shorter than the timescale of human response. The PM design questions are concrete. Which actions in this agent’s tool boundary execute in seconds and have irreversible consequences? For those, what is the kill-switch surface, who has authority to invoke it, and how is the agent designed to fail safely when it does? When does the action commit, and is there a brief commit-delay window (the Gmail Undo Send pattern) during which a fast intervention can reverse the action? What is the agent’s fallback when the kill-switch fires mid-execution: an incomplete state, an automatic rollback, a held queue?
Engineering owns the implementation. The PM owns the question of which actions are in scope for real-time intervention, what the timescale must be, and what the failure mode looks like when the kill-switch is invoked. If the team’s answer to “how do we stop the agent mid-action” is “we shut down the service,” that is not a kill-switch. That is an emergency shutdown preceded by damage.
Multi-Agent Observability
One brief note on a class of system that is becoming more common in 2026, because the trace surface gets harder when agents call agents.
A single-agent system has a single trace: goal, steps, tool calls, action, result. A multi-agent system has a trace per agent plus a coordination trace that records which agent invoked which other agent, what context was passed between them, and how disagreements were resolved. The events you need to compose the six instruments are still there. They are spread across multiple traces that need to be stitched together to reconstruct what happened.
Most platforms in 2026 do not handle multi-agent trace correlation cleanly. OpenTelemetry has primitives for cross-service tracing that are being extended to agent-to-agent calls, but the standardization is incomplete. The PM question to ask engineering when multi-agent architecture is on the roadmap is concrete: can we reconstruct, after an incident, the full coordination trace across all agents involved in the workflow, including which agent passed which context to which other agent and at what point a disagreement was resolved? If the answer is partial, the incident-recovery-time metric will be larger than the team is estimating, because reconstruction will be the first task before recovery can begin.
The coordination cost from Chapter 3 is a useful anchor here too. Coordination is not just expensive in tokens. It is also expensive in trace complexity. The PM who is sold on multi-agent architecture for capability reasons should be priced on coordination cost in both dimensions before signing off.
Data Observability
Every discussion of agentic monitoring eventually surfaces the same question: where does the wrong answer actually come from? The answer is more often than teams want to admit: from the data the agent was given, not from the model’s reasoning.
An agent reasoning over stale pricing data will approve purchase orders that miss market conditions. An agent working from a knowledge base with outdated product policies will give customers guidance the company stopped supporting six months ago. An agent whose knowledge graph has incorrect entity mappings will classify support tickets into the wrong workflows, confidently, every time, with no error signal at the application layer. The model is doing its job. The substrate is wrong. Infrastructure monitoring sees none of it; it is not looking at data quality.
Data observability is the discipline of detecting when the data the agent reasons from is not what the agent thinks it is. Five properties matter for an agent. Data freshness: is the retrieval layer pulling from current sources, or from a snapshot that is now stale? Data completeness: are there gaps that will cause the agent to reason from partial evidence without flagging the gap? Referential integrity: do the identifiers the agent uses to connect concepts actually connect to the right things, or has an entity-mapping change broken the joins? Context availability: is the context the agent needs to make a correct decision accessible in the form the agent can use it, or is it locked behind a permission the agent does not have? Knowledge graph mapping accuracy: are the semantic relationships the agent traverses actually correct?
Tools like Monte Carlo, Acceldata, and Great Expectations monitor freshness, schema drift, and volume anomalies on data warehouse layers. They were built for analytics pipelines. They do not natively monitor vector index freshness, embedding drift, or whether retrieved document provenance is intact. The data observability stack and the retrieval stack are still largely separate concerns in most enterprise architectures, with no unified surface showing whether the data the agent is reasoning from is trustworthy at the moment it reasons from it.
Chapter 6 referenced PoisonedRAG: a documented attack pattern where a small number of malicious texts injected into a knowledge base can achieve a high attack success rate. The model behaves as designed. The retrieval works as designed. The corrupted evidence is what the agent reads, and the eval that scored the output against the retrieved evidence scores it as correct, because the output was faithful to the input. The data layer is not just an upstream concern. It is part of the trust boundary the agent operates inside.
Trustworthy AI results require trustworthy data. That is not a philosophical position. It is an architectural requirement. The monitoring architecture that does not include the data layer is watching the agent reason while ignoring the quality of what it is reasoning from.
The Metric Is the Test of the Design
There is a forcing function in the measurement problem that exposes the design gap directly.
If a team cannot produce the override frequency metric, the override surface was not designed to be legible or logged. If they cannot produce rollback time, recovery was not treated as a product surface. If they cannot produce a confidence calibration curve, confidence was not surfaced to the user in a measurable form.
The absence of the metric is not a data infrastructure problem. It confirms that the corresponding phase of the product lifecycle was never built.
This is the reversal agentic products require from teams trained on SaaS instrumentation. In traditional SaaS thinking, you design the product and instrument it afterward. In agentic product design, the metric is the design requirement stated in advance. You design the approval moment because you need to measure override frequency. You design the recovery workflow because rollback time must become a sprint target. You design the audit surface because traceability is a product commitment, not a logging exercise.
There is a second-order point hidden in this reversal that is worth naming separately: the interface is the instrument. What you can measure about an agent depends on how the user is allowed to interact with it. Constrain the interaction to forced-choice answers and you can only measure agreement and confidence on those choices. Constrain it to short prompts and you cannot measure how the agent behaves when reasoning unfolds across multiple turns. The instrument cannot read what the interface does not allow to happen.
Medicine has known this for fifty years. The patient history alone, in one well-replicated set of outpatient studies, produced the correct diagnosis in the large majority of encounters. The physical exam and lab work added confirmation, not diagnosis. The conversation was the diagnostic instrument. If you constrain the interface to remove the conversation, you remove the instrument that produces the answer.
The same pattern appeared in AI triage research. A widely cited study found that a leading model under-triaged a large share of emergency presentations, a rate that alarmed patient-safety researchers. A later partial replication ran the same models under naturalistic conditions: free-text descriptions, clarifying questions allowed, multiple turns, rather than clinician-authored vignettes with forced single-turn answers. The same models that scored near the floor under forced choice recovered most of the way to correct under free text. The model did not change. The interface did. The original study had measured the interface, not the model. (The replication and its figures are in Appendix C and developed in the companion volume.)
This is what your observation phase must surface. Not just whether the agent gave the right answer, but whether the interface let the agent demonstrate its actual capability. If your override frequency metric is high, the supervisory layer is working. If your task success rate is low, you may have a model problem, or you may have an interface that prevents the model from showing its work. The six instruments cannot distinguish those two cases unless the observation design includes the interface itself as an experimental variable.
The PM consequence: if you ship a forced-choice interface for a complex agentic decision, your dashboards will tell you the model is unsafe. They will not be wrong. They will also be measuring the interface you chose, not the model you bought. Design the interface as deliberately as you design the metric.
Scale: Five Hundred Thousand Questions
One point of scale worth naming as context. Microsoft’s internal deployment of its agentic assistants reports that employees ask the system on the order of five hundred thousand questions every working day. That volume is not an outlier. It is what enterprise agentic deployments look like when they work. The implication for observation design: sampling-based review is mandatory, exception-based escalation is mandatory, and the six instruments must be dashboarded, not pulled manually. If your observation phase design assumes a human reviews every output, your design assumes a workforce you do not have.
The volume point also cuts the other way. A one-percent unintended-action rate at five hundred thousand questions per day is five thousand unintended actions per day. The same rate at two hundred interactions per day is two. The six instruments are scale-dependent. Your thresholds must be modeled for the volume you will actually hit, not for the pilot.
The PM Owns the Loop
Product lifecycle management is a PM responsibility. Not the launch. The loop.
Engineering builds the instrumentation. The PM defines what to instrument, why each metric matters to the next release, and what decision each signal should enable. That means owning the six instruments before the first production incident, not as a retrospective task. It also means setting the thresholds that trigger action: what override frequency rate causes an autonomy demotion, what incident recovery time is unacceptable, what confidence calibration gap signals a model problem. Set those thresholds before launch, because setting thresholds after an incident is deciding under pressure.
In practical terms: override frequency and unintended action rate belong in every sprint review. Rollback time and incident recovery time belong in every post-incident review and feed directly into design changes. Confidence calibration and task success rate inform whether the agent is ready to climb the Autonomy Ladder or needs to stay where it is.
The teams furthest along treat the observation phase as a design brief, not a postmortem. Launch is not the end of the product lifecycle. It is the moment the observation phase begins.
- The Six Instruments are sprint inputs, not KPIs. If a metric cannot be produced, the product surface was never designed; the absence is the finding.
- Platform emits, PM composes: the platform owes you events; you own the composition into each instrument and its threshold.
- The interface is the instrument: what the interface does not allow to happen, the metric cannot read.
- For fast-consequence actions design a kill-switch, not a dashboard; retrospective observation covers everything else.