Part VI · Carry the Weight · Chapter 22

Chapter 22: A Day in the Life of the Agentic PM

The book opened on a week and closes on one. Not the before-and-after of the field snapshot, but a single week lived forward, where the hours the agent freed have gone into the work the rest of the book named. The shape of that week is the argument: the job did not shrink, it shifted, and the new center of gravity is judgment, supervision, and the weight of a decision no human in the building actually made.

The whole book has been a set of disciplines; the question this chapter answers is what they look like assembled into five ordinary days. The week below is a composite, close enough to real that you will recognize your own in it, and it is deliberately not a heroic week. Nothing in it goes catastrophically wrong. That is the point. The work is what the week looks like when the disciplines are working, which is quieter and more demanding than the week when something is on fire.

Monday: deciding what should exist

The week opens with a proposal that has momentum and a champion, which is the dangerous kind. Someone senior wants an agent to handle a class of exceptions that a small team currently works by hand, and the demo a vendor showed last week was impressive. Three years ago Monday would have gone to a status deck. This Monday goes to the suitability sheet, and the first hour is spent not building anything but stating the repeating problem in one sentence with no mention of AI, which is harder than it sounds and kills more bad ideas than any test that follows. The problem survives that sentence. Then the four tests, and the honest cost model, the architecture multiplier rather than the model’s sticker price, run against the task’s value and volume. The agent clears three tests and stumbles on the fourth, the measure of good output, because the team can compute a quality score but no one would actually stake the decision on it. That is the finding. The morning’s deliverable is not a green light; it is a written sentence naming exactly where the idea is weak and a decision to defer until the measure is one you would defend to your CFO. The afternoon’s harder conversation is telling the champion the demo was real and the product is not yet, which is a different sentence and the one the job now turns on.

Tuesday: a prototype to decide, and a team to align

A different bet, further along, where the question is desirability rather than suitability: will the people with this problem actually want the thing. Tuesday morning is a vibe-coded prototype, built in an afternoon’s worth of hours, and the discipline is remembering what it is for. It is a learning instrument, not a demo and not a product, and the one thing that keeps it honest is naming the riskiest assumption and the specific result that would kill the idea before building it. The prototype goes in front of someone who has the problem, and they reach for something it does not do, which is the most useful hour of the week because it is data a spec review could never produce. The afternoon is the seam. The prototype proved its point, and the work now is handing engineering the learning, not the code, and walking the runtime ownership map with the people who own the domains the PM does not: who owns the kill switch, who owns rollback, who designs the approval moment, which domain expert certifies the correctness no eval can. By the end of the day every domain has a name against it, and the two that resolved to “me” or to “no one” are the real output of Tuesday, because those are the domains about to fail.

Wednesday: designing the behavior

Wednesday is the chapter most people would not have recognized as product work three years ago. An agent is being built, and the job is designing what it does, when it asks permission, what trace it leaves, and what happens when it is wrong. The morning is the four runtime artifacts written as four sentences, and the test is brutal in its simplicity: the artifact you cannot state in one sentence is the one you have not designed. The boundary sentence is easy, the recovery sentence is the one that exposes how little thought has gone into the agent being wrong mid-action. The afternoon is the oversight decision, output class by output class: which of these need a human looking at each individual one before it executes, and which are fine on aggregate review. The decision turns on harm asymmetry and reversibility, and for the two output classes that land on transaction-level review, the real question is the uncomfortable one, how many seconds the reviewer will actually have, because a transaction-level gate funded like a sampling budget is the performative review the book warned about. Wednesday ends with a security pass that did not exist as a discipline two years ago and now has frameworks: every external input untrusted by default, every tool’s blast radius bounded, the destructive actions set to require approval by construction rather than by a sentence in the prompt.

Thursday: operating the loop

Thursday is the half of the job that begins at launch, and it is the day the agents already in production get their attention. The morning starts with the boring number that matters most: the three ceilings on the highest-volume agent, tokens, requests, wall-clock, and the confirmation that each is enforced before the call rather than logged after, because the difference between those two is the difference between a brake and a receipt. The burn-rate alert has a name attached to each tier, and the name at the emergency tier is a real person, not a shrug. Midday is the observation review, the six instruments, and the one that earns its keep this week is override frequency, which has drifted low on an agent that has been reliable for months, which is either good news or the early signal that the supervisor has stopped really looking. The afternoon is the slow, unglamorous discipline the operate chapters kept insisting on: a re-calibration date on the monitoring instruments, because a frontier model shipped six weeks ago and the instruments built before it are now measuring an agent that no longer quite exists, and a check that the consequential decisions are leaving a sealed record that a stranger could reconstruct in five years, because the appeal, when it comes, will come long after the agent has been rebuilt.

Friday: the human system and the weight

Friday is people. The morning is a supervisor who is not failing at the job so much as drowning in it, because the oversight role was bolted onto someone already full, which means the team shipped the agent and assumed the supervisor, and the assumed product is the one failing first. Fixing it takes time on a calendar and an artifact that person owns, not a memo. Midday turns to a quieter erosion: a skill the agent now performs that a senior engineer used to, and the three questions that decide whether that is leverage or loss, are the experts still practicing it, are the juniors still learning it, is anyone still checking the agent’s version against their own judgment. Two of the three answers this week are no, which is the kind of finding that does not page anyone and compounds for a year. And the week closes where the book’s gravity is heaviest, on the person the product is actually for and will never see: the one inside the error rate, whose sparse record makes the agent confidently wrong about them, who would expect a human to have decided and would be harmed to learn none did. Friday afternoon is not a triumphant close. It is the unglamorous work of naming that person and asking whether anyone designed for them, the work that shows up in no metric and that no one will notice you did until the week you did not.

What the week proves

Look at the week as a shape rather than a list. Almost none of it was the production work that used to own the calendar; the decks drafted themselves, the PRD read like an adversary’s checklist rather than a week of writing, the grooming and the summaries were a check rather than a creation. And almost all of it was hard in a way the old week was not, because every hour was spent on a judgment that is now exposed, a behavior that had to be designed under uncertainty, a system that acts faster than anyone can watch, or an accountability that sits in a gap no person in the building chose to hold. The freed hours did not become free time. They became the work that used to live in the cracks, now with room to breathe, plus a whole category of work that did not exist before. That is the spine, lived rather than argued: the job did not shrink, it shifted, and the new shape is heavier than the old one, not lighter. The reason this book is more practical than theoretical is that the shift is not coming. It is on this week’s calendar, and the only open question is whether the hours went where the work actually is or quietly disappeared into checking the agent.

So map your own next week against the six parts of this book, deciding, prototyping, designing, operating, the human system, and the weight. The part that gets the least of your time is the tell. If it is a deliberate choice, defend it. If it is a default, it is the part of the new job you have not yet started doing, and it is almost certainly the part that will be on someone’s incident review before it is on your calendar.

If you have a live agent and no Channel 2: start here

The book asks you to design a great deal, four runtime artifacts, six instruments, the audit record, the supervisor role, the affected-person analysis, and a reader with one agent already in production and no time cannot do all of it at once. So if you do nothing else, do these, in this order. They are sequenced by what fails first and hurts most, not by the order of the chapters.

Set the autonomy boundary and make it enforceable before the action, not after. Write down, today, the single line the agent may not cross without a human, the dollar limit, the action class, the irreversible step, and confirm the system stops it at the gate rather than logging it once it has happened. An agent acting past a boundary that exists only on a slide is the failure that arrives fastest.

Make sure there is a reachable kill switch you do not have to ask permission to use. If stopping the agent requires a deploy, a ticket, or a meeting, you cannot stop it in time. This is one afternoon of work and it is the difference between an incident and a catastrophe.

Stand up the two instruments that catch silent failure: task-success-versus-completion, and override frequency. You do not need all six this week. These two tell you whether the agent is doing what it claims and whether the humans around it have quietly stopped watching, which are the two failures that compound invisibly for months.

Name the human who is accountable when the agent is wrong, out loud, in writing. Not a committee. One person who owns the bad outcome. A team that cannot answer this question answers it in the moment by blaming whoever is nearest, and that is how good people get burned for a system nobody designed.

Run the affected-person check once. Ask who absorbs this agent’s output and never sees it, and whether the error rate is the same for them as for everyone else. This is an hour, and it is the one item on the list that no platform will ever do for you.

To make the list concrete, run it against the refund agent the book has followed. Move one: the boundary is the five-hundred-dollar ceiling and the fraud-and-window triggers, and you confirm the payment service rejects an over-limit refund before it executes, not after. Move two: the kill switch is a flag the refund agent checks before issuing, reachable without a deploy, so you can stop it issuing anything the moment you need to. Move three: you stand up task-success-versus-completion (is “refund issued” actually a refund a manager would endorse?) and override frequency on the escalated cases (have the reviewers quietly started rubber-stamping?). Move four: you name the support lead who owns a wrongful refund, in writing, before the first one happens. Move five: you ask who absorbs a wrong refund decision and never sees it, the customer wrongly denied, the one flagged as fraud who was not, and whether the agent is wrong about them at the same rate as everyone else. Five moves, one afternoon and one hour, on an agent you already understand. That is the whole supervisory layer in its smallest runnable form.

Everything else in this book deepens these five. But if your agent is live and your Channel 2 is an assumption, this is the Monday that stops the assumption from becoming the postmortem.

Agent Behavior Governance The Agent This Book Was Written With