Part III · The Craft · Chapter 13

Chapter 13: Operating Inside the Org

Two PMs raised the same concern in the same review, and only one of them is still in the conversation a week later.

The product is an agent that auto-approves vendor invoices under a threshold. The concern, in both their heads, is identical and correct: nobody has decided what happens when the agent pays an invoice it should have held, and there is no fast way to claw a payment back. Dana said it first. She said it the way most of us would, plainly and with conviction: this feels risky, we are moving too fast, we have not thought through what happens when it is wrong, and she meant every word. The room heard a brake. The VP nodded, said good flag, let’s not let perfect be the enemy of shipped, and the meeting moved on. Dana is now, quietly, the person who is not bought in.

Marcus had the same concern. He put a one-page document on the table titled “Rollback and recovery for invoice auto-pay,” and it had four short sections and three of them said “open” in red. The room could not move past it the way it moved past Dana, because it was not an objection; it was a gap in the launch, written in the launch’s own language, with the missing pieces named. Engineering started filling in the red sections in the meeting. Marcus is now the person the VP messages before the next review.

Same concern. Same room. Same week. The difference was not courage and it was not seniority. It was the form the concern arrived in. This chapter is about that form, because an organization right now is calibrated to ship, and a PM who cannot deploy judgment into a room calibrated to ship is a maintained capability that never reaches the field. Part II kept your judgment alive. The gate and the brief gave it two places to land. This is the third place, and it is the one where being right is the least of what is required.

Why being right is not enough

The conventional advice is to speak up early, escalate, bring data. It assumes a room calibrated to receive the data. That room does not exist in this cycle. The board wants an agent count. Demos close cycles. The CEO mandate is real and it is pointed at velocity. In that environment a PM who raises architectural or governance concerns in an open meeting does not read as rigorous. They read as slow, and slow is the one thing the room has been organized to punish.

So the move is not to argue harder, because the argument is not where you are losing. The concern is usually correct; Dana’s was. The delivery is what put her on the wrong side of the room. The move is to change what the room is looking at, and there are four ways to do it, each one trading a gut-feeling objection for something the room can inspect, track, or measure. The dialogues that follow are composites, built to show the move, not transcripts of real meetings.

Move one: build the artifact, not the argument

The most reliable way to shift a room is to put a document on the table that the room now has to respond to. For agentic systems the useful documents are the four runtime artifacts that make an agent governable: the autonomy boundary, the approval moment, the audit surface, the recovery workflow. Half-drafted is enough.

Watch the difference in the room.

The argument version: “I really think we’re not ready. There’s too much that could go wrong here and I don’t think we’ve considered the failure cases.” The VP: “What specifically?” The PM: “Just, broadly, the risk profile.” The VP: “Let’s capture that as a risk and keep moving.”

The artifact version: “I drafted the recovery workflow for this. Here’s what we have and here’s what’s open.” She slides over one page. “Boundary’s defined, that’s green. Approval moment’s defined, green. Audit surface, we can reconstruct what the agent did, green. Recovery workflow, red, because right now if it pays a wrong invoice there’s no path to claw it back inside the settlement window. That last one’s the launch blocker, and it’s a two-day spike to close.” The VP, now looking at the page: “Can eng scope the spike this week?”

Nothing changed about the concern. What changed is that it arrived as specification rather than objection, so engineering answers it instead of debating it, security and compliance attach to it because the language matches their intake forms, and the not-bought-in framing collapses, because the PM is not asking for a pause. She is doing the work the launch requires. Artifacts are not agreement. They are forced specificity, and forced specificity is a thing a room calibrated to ship cannot route around, because it looks exactly like shipping.

I learned this the slow way, by first doing it wrong. Early on, watching a team treat security and recovery as the sprint that always moved to next sprint, I raised it as a concern, repeatedly, in the language of worry, and I became the person whose worry the room had learned to absorb and move past. The thing that finally changed the conversation was not a better argument. It was a one-page document that laid out the four runtime artifacts every agentic system needs before release, the autonomy boundary, the approval moment, the audit surface, the recovery workflow, with each marked done or open. The open rows were no longer my opinion; they were unfinished launch work in the launch’s own language, and the room could not file them under “noted.” The same gap I had been voicing for weeks got scoped and assigned in a single meeting once it stopped being an objection and became a specification with red cells in it.

Move two: translate the concern into CEO math

The CEO is not asking a technical question. They are asking a capital-allocation question, and a concern that does not land as a number does not land. The durable frame is unit economics, because agentic systems have a cost structure that looks nothing like traditional software, and the costs that kill them, human review time, failed-transaction rework, coordination overhead, are exactly the ones the demo hides.

The dialogue:

The risk version: “I’m worried this is going to be expensive to run once the review load shows up.” The CFO, politely: “We’ll watch the run-rate.”

The math version: “Our per-successful-task cost on this agent is about four dollars once you load in the human review on everything it escalates. Our per-handoff cost, when it routes to a person, is about eleven. Those curves cross at roughly forty percent escalation. Right now the agent escalates sixty percent of invoices, because we tuned it conservative for launch, which means below this volume it’s a negative-contribution feature. It gets to positive only when we can safely push the auto-approve rate up, and that depends on the recovery workflow we were just talking about.” The CFO, to the CEO: “So the rollback work isn’t a tax, it’s what makes the unit economics turn positive.”

That last sentence is the one the CFO quotes back to the CEO, and it is the sentence that moves, because it reframes the safety work as the thing standing between this product and a positive margin rather than the thing slowing it down. The reference point every CEO already knows is Klarna, the company that went furthest on agentic customer service and then publicly rebalanced back toward humans when agent count stopped being the right metric. You do not have to say slow down. You have to show the curve, and the curve says it for you, in the only language the room’s most senior person is actually optimizing.

Move three: replace “when does it ship” with “what is our rollback time”

Every agentic program is tracking when does it ship, and that question selects for one behavior, launching. It does not select for landing. The single substitution that changes the room is to ask, instead, what is our rollback time when this agent takes a wrong action in production. That one question forces the team to name the observation layer, and six instruments make the answer concrete: task success rate, unintended action rate, override frequency, confidence calibration, rollback time, incident recovery time. If the team cannot answer for any of them, the team is not ready, and the gap becomes visible without anyone raising their voice.

The dialogue:

The brake version: “I don’t think we should ship until we’re more confident this won’t go wrong.” The VP: “We’re never going to be a hundred percent confident. At some point we ship and learn.”

The rollback version: “I’m not asking to delay the ship date. I’m asking one question we should be able to answer before it. When this agent pays a wrong invoice in production, what’s our rollback time, start to finish, payment out the door to payment recovered?” Silence. The eng lead: “We... don’t have a clean number for that.” The PM: “Then that’s the work between now and ship, not a reason to move the date. What’s the number we’re aiming for, and can we get there by Thursday?”

The PM who asks when does it ship is asking the room’s question and gets the room’s answer. The PM who asks what is our rollback time has changed the question the room is optimizing, from launch to survival, and has done it without once being the person who said no. The cautionary cases are public: agentic deployments that stayed live after their flaws were exposed because nobody had built a fast path to contain them. The primary failure in those was never the model. It was the absence of a specified control surface and a fast-enough rollback. That is the category of failure every concerned PM is actually flagging, whether they name it or not. Name it, as a number, and the room has to look at it.

Move four: use governance as a procurement filter

Internal governance conversations rarely move a shipping team, because internally governance reads as a committee that meets quarterly and slows things down. External governance moves teams, because it reads as revenue. The move is to bring the buyer’s questionnaire into the room.

The dialogue:

The committee version: “We really need to get our governance story together before we ship this.” The VP: “Loop in the governance team, they can review post-launch.”

The procurement version: “Here are the four questions our next enterprise customer is going to put in the contract. Audit surface, rollback SLA, human-review architecture, incident disclosure. Here’s what we can answer today. We can answer audit surface. We cannot answer rollback SLA, which is the same gap we’ve been circling all meeting. In regulated buyers this is a closed-won-or-lost question, not a nice-to-have, and the health systems and banks are already asking it.” The VP: “So this isn’t compliance, it’s a deal blocker on the enterprise pipeline.” The PM: “It’s the same work either way. I’m just saying which budget it comes out of.”

Governance as a committee document is a tax. Governance as a revenue filter is a moat. Present it as the second thing, never the first, and the same control surface you could not get funded as risk reduction gets funded as pipeline enablement. The EU AI Act attaches legal obligation to several of these categories for high-risk systems, which means in some markets the questionnaire is not the buyer’s preference; it is the law’s, and the room understands a law it cannot ship around.

The question underneath all four

Strip the politics away and every room shipping agentic AI right now is quietly asking one question: when this fails in public, what happens to us. That is not a cynical question. It is the correct one, and it is the only question that aligns the CEO, the CFO, the head of engineering, the head of legal, and you. Every one of the four moves is a different way of answering it. The artifact says here is how we contain it. The math says here is what it costs us either way. The rollback time says here is how fast we recover. The procurement filter says here is the obligation we are exposed to. A PM who answers that question, in those four forms, is not the skeptic in the room. They are the person the room turns to when the press release breaks. You do not slow the launch down. You make the launch answerable.

The translation nobody wrote down: explaining the agent upward

The four moves get the concern into the room. There is a second communication the job requires and almost nobody has written down, which is explaining the agent’s behavior upward, to someone who controls the budget and does not hold the runtime in their head.

Frame the shift plainly. The agentic enterprise externalizes the fast, intuitive, often-wrong part of reasoning into the machine and leaves the human doing the deliberate, skeptical, slow part, the supervision of outputs that look fluent and might be wrong. Your VP is now, whether they know it or not, the System 2 over your System 1 agent, and they will make resourcing decisions about a thing whose behavior they understand only through you. So the upward translation is its own craft, and it has two parts.

The first is locating the agent on the autonomy ladder in the exec’s language. Not “the agent has high autonomy in this domain,” which means nothing to a VP, but “this agent acts on its own up to a five-thousand-dollar invoice and routes everything above to a human, which means on a normal day it handles eighty percent of volume without us, and the twenty percent it sends up is where your team’s judgment still lives.” That sentence tells a budget-holder exactly what they own and exactly what they are exposed to, in terms they can act on. The autonomy ladder is your map. Their version of it is one sentence about what the agent does alone and what comes to a person.

The second part is expectation-setting before the first incident, not after. The single most valuable conversation you can have with an exec is the one where you tell them, while everything is calm, that this agent will at some point take a wrong action, that this is a property of probabilistic systems and not a sign the project failed, that here is the rate you should expect and here is what happens when it does, the boundary that catches it and the rollback that recovers it. An exec who hears that in calm reads the first incident as the system working as designed. An exec who hears about the failure mode for the first time during the incident reads it as betrayal, and you become the person who did not warn them. Same incident, opposite outcome, and the only variable is whether you set the expectation while you still had the room’s attention and none of its panic.

The post-mortem for agent behavior

Eventually one of them is wrong in production, and how you write up that failure is its own deliverable, distinct from anything you wrote before, because an agent-behavior post-mortem is not an outage post-mortem.

An outage post-mortem asks why the system stopped doing what it was supposed to do. The agent-behavior post-mortem asks the harder question: the system did exactly what it was built to do, and the outcome was still wrong, so where was the judgment supposed to live and why did it not catch this. The system was up the whole time. It was not a bug in the usual sense. It was a decision the agent was allowed to make, made wrong, inside boundaries that either held or did not.

So the post-mortem must contain three things the outage template has no field for. First, whether the boundary held. Did the agent act inside the autonomy it was granted, in which case the boundary itself was set wrong and the fix is a product decision, or did it act outside its boundary, in which case the control failed and the fix is an engineering one. These are different failures with different owners and the post-mortem has to say which. Second, which instrument caught it, or which instrument should have and did not. The six observation instruments exist precisely so that a wrong action is caught by telemetry rather than by a supplier’s phone call, and a post-mortem that does not name the instrument gap has not found the real failure, which is usually that nobody was watching the dial that would have shown it. Third, what changes, stated as a specific edit to the boundary, the eval set, or the instrument, not as a vow to be more careful. A post-mortem whose action item is increased vigilance is a post-mortem that will be re-run, verbatim, in a quarter.

Chapter 8 already showed you the end state of review-shaped motion, the reviewer with seconds per denial and nothing reviewed. The agent-behavior post-mortem is the instrument that keeps the loop honest, because it forces the question of whether the review that was supposed to catch this could possibly have caught it at the speed and the depth it was actually performed. If the answer is no, the fix is not a sterner reviewer. It is a redesigned boundary or a new instrument, and the post-mortem is where that gets decided in writing.

That template, the agent-behavior post-mortem with its three required sections, is the artifact this chapter leaves you, and it joins the record you have been building since Part II.

The dissent ledger, and when to stop

One more record, and it points back at you rather than the agent.

Every concern you raise in these four forms is a prediction. You are predicting that the gap matters, that the failure mode is real, that the room is moving too fast on this specific thing. Predictions can be scored. So log them, the way the proficiency regime had you log your overrides of the model: the concern, the date, what you predicted would go wrong, and when ground truth arrives, whether it did. Read the ledger quarterly and compute your dissent precision: when you raised a concern, how often were you right.

This is calibration data about you, and it does two jobs. It tells you whether your judgment in the room is sharp or whether you have drifted into reflexive caution, which is its own failure mode, the PM who objects to everything and so is heard on nothing. And it gives your concerns standing, because a PM who can say “the last four times I flagged a rollback gap, three of them became incidents” is no longer offering an opinion; they are quoting a track record, and a track record is an artifact too.

It also tells you when to stop. The proficiency regime had a demotion rule, a threshold written in advance that fires when the numbers go bad, applied to your own reliance on the model. Apply the same standard to your own dissent. If your dissent precision falls, if you are raising concerns that ground truth keeps not validating, that is not a signal to raise them louder. It is a signal that your read of this domain has drifted and your objections need recalibrating before they need amplifying. The room routes around the PM who is always concerned for the same reason it routes around the agent that always escalates: a signal that never varies carries no information. Knowing when to stop dissenting is part of the craft of dissent, and the ledger is how you know, rather than guess.

You can do the job now. You can write the brief that makes the decision instead of deferring it, read the gate instead of the dot, sort the tools that still cut from the ones that only shine, and carry all of it into a room calibrated to ship without becoming the person that room routes around. The capability is built, maintained, and deployed. The last part of this book asks the question that capability cannot answer on its own, which is where the job itself is going, and what it means to hold a seat the market has not yet learned how to price.

Which Frameworks Still Hold Part IV: The Career