Part III · The Craft · Chapter 11

Chapter 11: The Steady State

Six months after the gate, the refund-triage agent is a good citizen.

The suite is green and stays green. The six instruments report what healthy looks like: task success steady, unintended actions at zero for nine straight weeks, override frequency low and flat, rollback untested because nothing has needed rolling back. The supervisor reads the dials every day and the dials say the same thing every day. The team that built the agent has moved to the next agent, which is what teams do, and the weekly status line for this one has become a single word. Stable.

Nothing is failing. And in the queue of cases the agent hands back to humans, something is being said that nobody is positioned to hear.

Four out of every ten escalations are the same case: the goodwill exception, the loyal customer a few days past the window. The agent escalates them because the boundary says it must, and the human reviewers approve nearly all of them, which means several hundred times a quarter a person is being paid to say yes to a question the data has already answered. Meanwhile a particular phrasing keeps arriving in the free text, customers describing items as defective in a specific, slightly off way that the adversarial cases in your golden set taught you to read as gaming, except these customers are not gaming; they missed an event the delivery was for, there is no remedy called “late for the thing I bought it for,” and they have learned that “defective” is the only word the system listens to. And the partial-refund authority the agent has carried since launch, the power to resolve a case with half a remedy, has fired exactly never, because the brief authorized it and no one ever defined when.

None of this is on any dashboard, because dashboards were built to answer one question: is the agent safe. Nobody built the surface that answers the other question: what is the agent learning about your product that you are not hearing. This chapter is about that question, which the field has the data for and has not yet made a practice of asking.

Two reads of the same telemetry

Everything the running agent emits can be read two ways, and the field has built the profession for only one of them.

The first read is defensive. It watches for harm: the drift, the burn rate, the boundary violation, the loop that will not stop. Where teams have built anything at all, they have built this read properly: the six instruments, a kill switch, a standing seat whose whole week is the running agent. Nothing in this chapter takes any of it back. The defensive read is the floor. You do not get to do anything else until someone’s actual job is watching the running agent with the authority to stop it.

But notice what the defensive read does with a healthy week: it files it. Nine green weeks produce nine identical reports and zero decisions, because the defensive read treats the absence of failure as the absence of information. And the absence of failure is not the absence of information. It is the condition under which the second read becomes possible.

The second read is generative. It treats the running agent as the richest discovery instrument you have ever owned. Consider what you used to pay for the data the agent now produces as exhaust. You ran interview programs to learn what users actually wanted, and got a dozen conversations a quarter, mediated by what people could remember and were willing to say. You shipped surveys into the void. You watched session recordings of people clicking. Now look at the queue. Every escalation is a user interview that wrote itself: a real case, in the user’s own words, that your product’s judgment could not resolve, sorted and timestamped and attached to an outcome. Every override is a senior employee disagreeing with your written policy, on the record, with reasons. Every workaround is a feature request from a user who never filed one, expressed in the most honest currency there is, the words they had to distort to get what they needed. The agent’s day is a thousand structured encounters with what your customers actually mean, and the defensive read files it all under “no incidents.”

Everything you have built so far asks whether the agent is behaving. The steady-state read asks what the behavior is telling you to build. Same telemetry. Different question. Different owner, as it turns out, and the difference is the seam this chapter has to mark.

The hour, run once

The read is not a vibe and it does not need new infrastructure; everything it uses, the defensive layer already collects. What it needs is an hour, and the hour has a shape. Here is one, run on the refund agent, late on a Thursday. The numbers stay illustrative, as in the last chapter. The shape is the tool.

Before you open anything, write three sentences: what you expect this week’s usage to show. Which category you expect at the top of the escalation queue, roughly what override rate you expect and where, what you think has changed since the gate. This is Chapter 8’s commit-first discipline pointed at the fleet, and it does here what it does everywhere: it turns the session from confirmation into measurement. Open the console first and you will find what the console shows you and call it what you expected. Commit first, and the gap between your three sentences and the actual week is a reading on how well you understand your own product in production.

Now open the escalation queue, and do not read cases; read shapes. You expected a long tail, the odd residue too strange for policy. What you find is a spike: four of every ten escalations are the same case, the goodwill exception, the loyal customer a few days past the window, and the reviewers approve nearly all of them. That is not residue. That is a boundary decision waiting in a queue, costing reviewer hours while it waits. The thirty-day window came from a policy written before the agent existed, and your own senior reviewers have been quietly legislating the correction one case at a time. Call this signal escalation clustering, and notice that its release decision arrives with the evidence half-collected: every one of those approved escalations is a labeled case.

Next, the override sheet, read in both directions. Where do humans say no to the agent, and where do they keep saying yes to what it hands them? Reviewers approving ninety-plus percent of an escalated category means the boundary is set tighter than the judgment it routes to, which is the agent asking for a promotion in the only language it has. Reviewers reversing the agent’s own decisions somewhere means the opposite, a line looser than the people who own the outcome can live with. The rule for the first case is already on your shelf: autonomy is earned by demonstrated competence in the specific failure modes that matter, never scheduled. This sheet is where the earning becomes visible. The promotion case for auto-approving goodwill exceptions is not an opinion in a roadmap meeting; it is two hundred labeled cases, an override precision number, and an eval set you can grow from production. Call the signal override disagreement, and notice it is what earned, not scheduled, looks like as a workflow rather than a slogan.

Then a stranger comparison, two lists side by side: every authority the agent holds, against the action log. The partial-refund power, granted at launch, has fired exactly never. An authority that never fires is one of two problems wearing the same line item. Either it should not exist, in which case it is blast radius waiting for a failure mode, the over-broad unused credential that security reviews exist to catch, or it should exist and cannot fire because the brief never defined its trigger, in which case there is a remedy your product offers in theory and withholds in practice. Both answers are product decisions. Call it the untouched boundary, and note that this is the cheapest audit you will ever run: two lists, one diff, every line that never fires either scope to revoke or behavior to specify.

Fourth, trawl the free text for what users are learning to say. The phrasing that keeps arriving, “defective” written in a specific, slightly off way, reads like the gaming your golden set taught the agent to resist. Except these customers are not gaming. They missed an event the delivery was for, there is no remedy called late-for-the-thing-I-bought-it-for, and they have learned that “defective” is the only word the system listens to. Users bending vocabulary around a need is the oldest move in product usage, and in an agentic product they write it down, in sentences, daily. Your product is now trained to defeat its own most articulate feature request. Call this the workaround trace, and read it as some of the most honest discovery you will collect this quarter: a remedy that does not exist, named by the people who need it, in the disguise they were forced to wear.

Last, watch the humans who work alongside the agent, because their behavior is a measurement nobody enters into a system. Which outputs have the reviewers stopped checking, and which do they still redo by hand? The parallel spreadsheet that quietly retired in month three is trust, earned and banked. The one still maintained in month nine, through nine green weeks, is a verdict on your supervisory design, delivered by the only jury that counts, and it is telling you which surface, which explanation, which calibration display would earn the spreadsheet’s retirement. Call it the trust topology, the deference data of your own disagreement ledger, collected at workforce scale.

Then close the hour with the only output that matters: one page, dated. The three commit sentences and what the week actually showed. One line per signal. And a single recommendation, stated as a release decision, or the explicit sentence “no move this week.” A boundary to shift with its labeled cases. A behavior to specify for the authority that never fires. A remedy to spec from the workaround cluster. A surface to build where trust has stalled. The steady-state page joins the record you have been keeping since Part II, and a quarter of them, read in sequence, is something close to a history of your product written from what it did rather than from what the team intended.

Five signals, then, for the recap your notes will want: the cluster, the disagreement, the untouched authority, the workaround, the trust topology. None of them appears on a dashboard built for harm. All of them were in the refund agent’s exhaust by month six, waiting for someone whose question was not “is it safe.”

Product managers in regulated health industries will recognize this scenario, because it is where the read most obviously changes a roadmap. A remote-monitoring product is generating exactly the exhaust this chapter describes. The dashboards are green: engagement healthy, alerts firing, retention holding. What the defensive read files as “working” is hiding a cluster in the part of the data nobody has been assigned to read, a population of users repeatedly bumping into a boundary the product drew conservatively at launch and never revisited, doing a manual workaround to get what they need because the supported path withholds it. The boundary was the safe call on day one and has quietly become the wrong call by month six, and the only place it shows is in the behavior of the people working around it. The roadmap move that follows does not come from an interview program or a survey. It comes from reading what the running product is already telling anyone whose question is not “is it safe” but “what is this teaching us to build.”

The loop closes

Notice what the steady-state read does to the shape of Part III, because the three crafts you have built are not a list. They are a cycle, and this chapter is the arc that closes it.

The brief defined the agent’s behavior and its boundaries, with an eval set as its teeth. The gate tested the behavior against the brief and you signed it. The steady state reads what production did to both: which boundaries the world is pushing on, which cases the eval set never imagined, which judgments the humans keep correcting. And those readings are the raw material of the next brief. The goodwill cluster becomes a boundary revision with two hundred labeled cases attached. The workaround trace becomes a new behavior specified in the next Executable Brief. The production cases that surprised everyone walk into the golden set, which is how the last chapter’s claim, the golden set as the new backlog, stops being a metaphor and becomes a workflow: the suite grows the way a backlog grows, from the field, case by case, each one a hypothesis the next gate will test. Brief, gate, read, brief. The agent does not just execute the product. Run this loop and the agent becomes the chief instrument by which the product learns what it should become.

This is also, quietly, the answer to a fear that has been sitting under this whole book. The fear is that the practitioner who maintains all this judgment has nothing left to point it at, that the agent ate the job. Look at the loop again. Every turn of it ends in a decision only a human with domain judgment and product authority can make: where the boundary moves, what the remedy is, whether the trust is earned. The agent multiplied the evidence available for those decisions a thousandfold and did not move one of them a millimeter closer to making itself. The steady state is not the agent running unattended. It is the largest standing supply of judgment calls your career has ever offered, arriving weekly, pre-sorted, in the words of your own customers.

Whose hour it is

One seam to mark before this chapter hands off, because unmarked seams are where the expensive failures live.

The defensive read has an owner; the supervision column staffs it, and the watching-and-stopping work belongs there. The eval owner grades; the mechanics of the suite belong there. The steady-state read belongs to neither, and assigning it to either breaks it. Hand it to the supervisor and it becomes anomaly hunting, because that is what the supervisor’s instruments and incentives are built for; the goodwill cluster does not look like an anomaly, it looks like Tuesday. Hand it to the eval owner and it becomes coverage analysis, cases the suite is missing, which is one signal of five. The read is discovery, and discovery against running behavior is product work: it requires the seam vantage, the person who holds the brief in one hand and the customer in the other and can recognize that an escalation pattern is a boundary decision and a workaround is a feature request. That is the gate owner. That is you. One hour a week, on the calendar like the regime in Chapter 8, and defended the same way, because it will lose the negotiation with any individual Tuesday and win it over any quarter.

The supervision column watches so that you can read. That is the deal, and both halves of it have to be staffed.

The read, though, is only as good as the instruments you bring to it. You just spent a chapter reading production through five signals that did not exist in any framework you were trained on, and that should raise a question about the rest of the kit: which of the instruments in your head still measure anything, and which only produce confident-looking artifacts about a world that has quietly stopped existing. Somewhere on a wall near you, beautifully rendered in three colors, hangs a map of a journey nobody takes anymore.

Eval Literacy for the Gate Owner Which Frameworks Still Hold