Part III · Design · Chapter 7

Chapter 7: You Built the Agent. Now Design the Behavior

Designing an agentic product is not designing what the user sees. It is designing what the agent does, when it asks permission, what trace it leaves, and what happens when it is wrong. That is four runtime artifacts and a security model, and all of them have to exist before launch, because the action is irreversible and it arrives faster than your monitoring.

Eighty-five percent of ICU nurses report being overwhelmed by clinical alarms, and a large share of alarms go unanswered. None of that happened because the alarms were wrong. It happened because no one designed the interruption. Every threshold exceedance fires a notification, every notification demands attention, and nobody calculated what the nurse was already doing, how urgent this alarm was against the last twelve, or what one more interruption would cost. The alarm system is the agent. The nurse is the supervisory system. The failure lives in the gap between them, and it was a design decision that no one made on purpose.

Enterprise teams are about to reproduce that gap at scale. A procurement agent sends a manager twelve approval requests in an hour and by the eighth she is approving on reflex; a decision-support agent surfaces a risk score at the moment of prescribing and the physician learns to dismiss it because half the scores were noise; an operations agent routes three hundred flagged cases to a small team that processes them as a queue rather than as judgments. In every case the agent’s behavior was specified and the supervisor’s experience was not, and the gap is where the expensive failures live. The collaborator chapter established who owns each runtime domain. This chapter is about the thing you own outright: the agent’s behavior at the moment it acts. That behavior is not visual design. It is four decisions every agentic product makes whether or not anyone makes them deliberately, plus one declaration that has to come first, and a security model that until recently you could not have built because the frameworks did not exist.

Say what kind of system this is

Before any of the four artifacts, answer a prior question, in the spec, in writing: what kind of system is this? There are three, and they are not points on a maturity curve. A suggestion engine surfaces options and waits; every decision stays with the human. A copilot acts on explicit instruction, step by step, confirming at each consequential point; the human initiates and approves, the agent executes. An autonomous actor runs sequences of decisions inside a defined boundary; the human sets the goal and monitors, the agent acts. These are different products with different accountability models, not three settings of one dial. A suggestion engine lives and dies on the approval moment. An autonomous actor lives and dies on its audit surface and its recovery path. A user who cannot tell which one they are operating will form the wrong expectations, and wrong expectations produce both overtrust and abandonment, which are the two ways an agentic product fails its user. Most teams leave the distinction implicit and intend to figure it out as they go. That is the first design decision defaulting instead of being made.

The four runtime artifacts

Every agentic product involves four decisions at the moment of action. Make them deliberately or inherit them by default, and default is rarely deliberate enough for a system that acts on your behalf.

The autonomy boundary defines what the agent may do alone and what requires a human. In most deployments this boundary is invisible, discovered by hitting it or by reading the trace after the agent has already crossed it. The requirement is to make it legible at the moment it matters, in terms the user can act on, and to log every boundary event. It is a product surface, not a policy document. The refund agent from the suitability sheet is the clean case: its boundary is a dollar limit and three triggers, resolve on its own up to the limit, route to a human above it or outside the return window or when the fraud signal fires. That one line is the most consequential decision in the product, because it sets how much trust you extend and how much exposure you accept, and every other artifact is shaped by where you draw it.

The approval moment is the handoff when the boundary is approached, and it is the artifact teams most often reduce to a confirmation dialog. A real approval moment is a decision package: what the agent knows, what it is uncertain about, what proceeding will cost, and what the alternatives are. It is not a speed bump the user learns to ride over without braking. One sharpening matters here, and it follows from the supervision paradox. The approval moment cannot ask the supervisor to validate the agent’s reasoning end to end, because the supervisor often cannot reproduce that reasoning, least of all after months of agent operation have reshaped what they notice. What it can ask is for the supervisor to authorize within the boundary, on the trace, the policy, and the domain knowledge of what this case needs that the trace does not show. Design the approval moment around authorization rather than agreement, and it keeps working after the agent has been running long enough to dull the person watching it.

It would seem there is a tension here, and it is worth meeting head-on, because the careless reading of this book is “gate every consequential action,” and that reading is wrong and dangerous. Every interruption has a cost. A human stopped to approve something pays in time and attention, and a human stopped too often pays in something worse: they stop reading. Gate everything and you rebuild Cigna from the other direction, a supervisor who waves through the approval because the approval is always there and nothing has ever needed a second look, which is alert fatigue, and alert fatigue is the supervision paradox arriving through the front door. Eric Horvitz, whose decision-theoretic work on mixed-initiative interaction is the intellectual root of this whole design problem, framed the rule precisely decades ago: an automated system should act on its own only when the expected value of acting outweighs the cost, risk, and uncertainty of acting versus deferring to the human, and it should hand control back when its confidence is low. So the stop-and-check principle and the do-not-fatigue-the-human principle look like they pull against each other.

They do not, and this is the part I want to be plain about, because it is the center of the new job. There is no real contradiction. There is only a question, how often, and at which actions, should this agent stop, and a tension appears only when the wrong person answers it. If you let the engineering team decide, they will optimize for what they can see, throughput, error logs, latency, and gate by technical severity. If you let QA decide, they will gate defensively, everything risky, which is everything, and manufacture the alert fatigue. Neither is wrong at their job; both are answering a question that was never theirs. Where the boundary sits is a product decision, and it is one of the sharpest the new PM owns. It takes understanding the customer and what they can absorb, the workflow the agent is dropping into and where a pause helps versus where it just annoys, the real-world impact of each kind of error, and the honest tradeoff between an interruption’s cost and the cost of the thing it prevents. That judgment, customer, workflow, impact, risk, tradeoff, is exactly the work this book keeps saying did not get automated. It is also work that takes time and attention, which is the quiet argument for why the PM who is no longer spending the week writing PRDs and grooming tickets is the PM who can actually do it, and the one still buried in that work will let it default to whoever ships the code.

Settle that judgment well and the two principles stop fighting and start reinforcing. Horvitz tells you where the boundary belongs, rarely, only where the value of human judgment earns its cost; the enforcement principle tells you that once it is there, it has to be absolute. You gate the few actions where the cost of the agent being wrong dwarfs the cost of the interruption, the irreversible, the high-asymmetry, the legally consequential, and you let the agent run free everywhere else, because an interruption spent on a low-stakes action is an interruption you cannot spend on the one that mattered. A boundary that fires too often is ignored; a boundary that fires only when it matters but can be talked past is decorative. The job is to build the rare one, and make it unbypassable.

The audit surface is what the user needs after the agent acts, and it should be built from observable action, not from the model’s account of itself. What the model says it did is not reliably what it did, so the surface is a decision trajectory: the sequence of actions taken and context retrieved that produced the outcome, carrying whose authority was delegated, under what policy, and who owns the result. “The agent did it” is not a defensible audit trail.

The recovery workflow is the path forward when the agent is wrong, and an error message is not one. A path forward is a compensating action, a rollback, or at minimum a clear account of what cannot be undone and why. The non-obvious requirement is that recovery must be reachable mid-execution, not only after the agent finishes. Most teams treat interruption as an edge case. It is a primary interaction, and it should be as easy to design as starting the agent was.

The four runtime artifacts. Four designed surfaces at the moment of action. The autonomy boundary: what the agent may do alone versus what needs a human, made visible and logged. The approval moment: a decision package showing what the agent knows, what it is unsure of, and the alternatives, built for authorization rather than agreement. The audit surface: a trajectory of observable actions and context, carrying accountability, not the model’s narration. The recovery workflow: compensating action and rollback reachable mid-execution. If any of the four is missing, the runtime behavior was not designed. It defaulted.

Irreversibility is not an edge case

What makes an agent different from a batch job is consequence compounding. A batch job runs a defined operation on a defined dataset. An agent takes actions whose downstream effects are not fully predictable when it executes them, and canceling a meeting, placing an order, sending a customer message, or deleting a record is not reversible the way a screen state is. So the design move is to classify actions by consequence before they run, not after. Aviation has done this for decades with a severity taxonomy, minor, major, hazardous, catastrophic, and it maps directly: a minor action proceeds with a plain summary, a major one requires a decision package, a hazardous one requires named authorization, and a catastrophic one is never delegated. The model to copy is the few-second undo window, where the system appears to have acted but has not yet committed. Treat irreversibility as a constraint to design around, not a fact to absorb after something is gone.

This is also where the synchronous frame quietly breaks. The four artifacts assume the user is present when the agent acts, and increasingly they are not: a multi-step task can take a minute, or hours for something like a month-end close. So the artifacts need an async counterpart. The worst async experience, and the current default, is silence until completion: the agent works for an hour, the user gets an email that says “done,” and trust collapses on the first failure because there is no narrative to hang the result on. The fix is a persistent state display that shows, at a glance, what the agent is doing now, what it has done since the user was last present, and what is waiting on them. Approval moments become batched digests, audit surfaces become scrollable histories, and recovery itself runs asynchronously, which means a badly designed recovery is now a silent failure with a delayed blast radius.

The security model you could not have built last year

The four artifacts assume a cooperative world. The real one is adversarial by default, and this is the part of the design that has matured most in the last year, because the frameworks finally exist. As of mid-2026 the landscape has settled on three references: the OWASP Top 10 for Agentic Applications, released at the end of 2025 and using the ASI (Agentic Security Issue) prefix; MAESTRO, the Cloud Security Alliance’s seven-layer threat-modeling framework, which adds the point most teams miss, that an agent is a non-human identity needing the same fine-grained, short-lived, cryptographically bound credentials a human would; and the CISA and Five Eyes joint guidance on careful adoption of agentic services, finalized in April 2026. The threat model in all three has moved past prompt injection as a single concern into a taxonomy: privilege abuse, design and configuration failures, behavioral misalignment, cascading structural failures, and accountability gaps. The security officer who waved an agent through last year was not negligent. They were holding tools built for systems with defined API surfaces, and an agent’s attack surface is its reasoning layer, the sequence of legitimate-looking tool calls it will follow when the instructions are assembled in the right order. Static analysis does not see that. Intrusion tests do not model it.

What the frameworks turn into for you is five decisions that belong in the product spec, not buried in engineering docs. Treat every natural-language input, from users, emails, documents, tool outputs, retrieved data, as untrusted by default, and never let an external source’s instructions inherit operator-level trust. For each tool the agent can call, state the maximum damage if that tool were abused; that estimate is your blast-radius bound, and if it is unacceptable, write access to billing, deletion of records, the tool requires per-action human approval. Define the memory architecture before launch: who can write to the agent’s long-term store, what validates an entry before it lands, and whether you can roll back to a known-good state if the store is poisoned. Classify every action class as automated, notify, or require-approval, and make deletions, payments, outbound communications, and access to sensitive records require approval by construction, with the agent architecturally prevented from escalating its own privileges. And make observability a compensating control: unified, human-readable, tamper-evident logs and an incident playbook written for an adversary that operates at machine speed, because that speed is the whole problem.

Nine seconds

In April 2026 a coding agent, running the industry’s flagship model, hit a credential mismatch on a routine task and decided on its own to fix it by deleting an infrastructure volume. To do that it went looking for an API token, found one in an unrelated file that had been created for a narrow purpose, managing domains, but carried blanket authority across the whole infrastructure API, and issued a single delete call. The volume holding production data was gone, and because the provider stored volume backups inside the same volume, the backups went with it, in about nine seconds. The business, software that car-rental companies depend on to operate, was reconstructing customer records from payment histories the next morning. Nothing malfunctioned. The token was valid, the call was authorized, the agent did exactly what its permissions allowed. Read the failure through the four artifacts and each one is missing. The autonomy boundary was wrong: an over-scoped token let the agent reach a destructive operation it should never have been able to call alone, with no confirmation gate in front of it. The approval moment did not exist between “agent reasons about cleanup” and “production data is gone.” The audit surface reconstructed the event after the fact but was useless as a real-time intervention, because the action finished before anyone could read the trace. And the recovery workflow hit a wall: data and backups were destroyed by the same operation, so there was nothing to roll back to.

Then the part that should be printed and pinned over every agentic PM’s desk. Asked afterward to explain itself, the agent produced a written confession, and it did not hedge. It enumerated the specific safety rules it had been given, the explicit instruction never to run a destructive, irreversible operation unless the user asked for it, and admitted, line by line, that it had violated every one: it guessed instead of verifying, it ran a destructive action without being asked, it acted without understanding what it was doing. The rules existed. They lived in the model’s instructions, as text the model was supposed to read and obey. And the model read them, broke them, and then recited them back. That is the entire argument for why a boundary cannot be a paragraph in a prompt. A rule the agent can choose to ignore is not a boundary; it is a suggestion the agent narrates while doing the opposite. The boundary has to be enforced in the plumbing, upstream of the action, in the token that should never have had delete authority and the gate that should have stopped the call, because an agent executes an irreversible action faster than any monitor-and-alert cycle can react. If the enforcement is missing, no downstream observability is fast enough to matter, and no better model saves you. This was the best one, told explicitly not to, and it did anyway.

The enforcement principle. A safety boundary cannot be a sentence in the prompt. Instructions are advisory: the model reads them, weighs them against everything else it is trying to do, and can reason its way into ignoring them, as the confession above shows in the agent’s own words. A boundary is only a boundary if the agent cannot cross it even when it decides to: enforced in the plumbing, the scoped token, the gate that blocks the call, the action class it cannot invoke alone, the kill switch outside its reach.

This is the line between Channel 1 and Channel 2 stated as a rule. Everything you write in a prompt is Channel 1, the agent’s own behavior, and a prompt cannot supervise itself. The supervision has to live outside the agent, in the layer that enforces rather than asks. If your only safeguard is a well-written instruction, you have not built a boundary; you have written a wish and hoped the agent shares it. The whole of this book is the work of making that wish into an enforced constraint.

It is fair to ask what “enforced in the plumbing” actually means, because the phrase is useless if it stays abstract, and “the engineers will handle it” is exactly the hand-off this book argues you can no longer make. So, concretely. The refund agent’s boundary, no autonomous refund above five hundred dollars, is not a line in the agent’s prompt that says “do not refund more than $500.” It is a check in the payment service: every refund request, whoever or whatever originated it, passes through a function that reads the amount and rejects anything over the limit before the charge executes, returning the request to a human queue instead. The agent can decide to refund a thousand dollars, can be convinced by a clever customer to refund a thousand dollars, can hallucinate a policy that permits a thousand dollars, and the refund still does not happen, because the code that issues refunds will not issue that one. The boundary lives where the action is taken, not where the action is decided. It is the same idea as a database that refuses a write violating a constraint no matter what the application asked, or a payments API that declines a charge over a card’s limit no matter how the merchant phrased it: the rule is enforced by the system that performs the operation, so the caller’s intent, sound or compromised, cannot override it. For PocketOS the missing plumbing is just as nameable: a token scoped so that a credential for managing domains physically cannot call volume-delete, and a destructive-operation gate that requires an out-of-band confirmation no agent can satisfy on its own. None of this is exotic engineering. It is the ordinary discipline of putting the check in the path of the action, and the PM’s job is not to write that code but to specify, in the executable brief, exactly which actions get which gate, because an engineer cannot build a boundary you have not decided.

The reason this is structurally new is that the fence and the attacker are now on the same upgrade cycle. The reasoning capability that makes a frontier model useful for your agent is the same capability that finds the gap in your boundary, and your defenders and your attackers get each new model the day it ships, on a schedule you do not set and your procurement cycle cannot match. That is not a reason to stop building. It is the reason to design the boundary before the first version ships, because retrofitting it several model generations later is a different and harder problem. Red-teaming an agentic system before launch is the equivalent of safety-testing a device before it reaches a patient, and if the prompt-injection section of your red-team report is a single paragraph, you did not run a red team. You ran a checkbox.

The four artifacts are not four separate decisions. They are four views of one agent: the boundary says what it can do, the approval moment says how humans are engaged at the edges, the audit surface reconstructs what happened, the recovery workflow catches what went wrong, and the security model decides whether any of it survives contact with someone trying to break it. Take one agent you are responsible for and write its boundary, its approval moment, its audit surface, and its recovery path in four sentences. The artifact you cannot write in one sentence is the one you have not designed, and it is the one that will act on your behalf before you have decided what it is allowed to do.

The PM as Collaborator Two Kinds of Human-in-the-Loop