Back Matter

Appendix A: The Two Briefs, Worked

The operating model said the brief becomes two documents. This appendix gives you both as templates you can adapt, and then fills each one in for a single agent so the pattern is concrete rather than abstract. The agent is the refund agent the book has followed throughout. The templates are the reusable part; the refund example is one instance, and your own will differ in every specific while keeping the same shape.

Read the two briefs and the first thing to notice is what they share. Both carry Channel 1, the agent and what it does, and Channel 2, the layer that supervises it. The dual brief is not a division of the two channels into two documents; that is the mistake. The division is by audience and purpose. The Human Brief is the executive document, the successor to the PRD: the business case, the go/no-go decision, the cost model, the strategic intent, and the boundaries, written as prose a room of people argues with before anyone builds. The Executable Brief is the build document, the successor to the epic: the experience, the behavior, and the governance, structured so it can do two jobs at once, drive a vibe-coded prototype this week and seed a detailed backlog. You do not put a break-even model in front of a coding agent, and you do not ask a finance partner to sign off on an eval set. So you write both, and you derive the second from the first.

Watch for the supervisory layer in both. In the Human Brief it appears as a decision a room has to make: where the boundary sits, what it costs to fund the oversight properly, who is accountable. In the Executable Brief it appears as buildable requirements: the autonomy boundary, the approval experience, the audit surface, the eval set, rendered as things an engineer and a coding agent will construct. That is the whole thesis in one artifact. The supervision gets built because it was written into the spec, not assumed into the backlog.

Brief 1: The Human (Executive) Brief

Audience: the executive sponsor, finance, engineering, design, the domain expert. Purpose: to decide whether to build, where the boundaries sit, and whether the economics work, before anyone writes code. This is the document you argue with.

Template

A Human Brief should contain, in prose a room can debate:

The problem and the opportunity. What is costing money or going wrong today, and what the agent would change.
What we are building, and what we are deliberately not. The intended behavior in plain terms, and the adjacent thing it must not become.
The cases that decide whether it is worth building. The hard cases the agent must handle well, not the easy ones that were never the problem.
The business case and the go/no-go. The suitability tests, the cost model with real numbers, and an explicit decision gate.
Where the boundary sits, and why. The autonomy limit and the escalation triggers, stated as a deliberate choice with its risk named.
Accountability and how we will know. Who owns a wrong outcome, and the instruments that will tell us the agent is drifting.
What success looks like, and what it does not. The real definition of done, written to exclude the seductive wrong metric.
The open questions the room still owes. The business decisions not yet made, each flagged in the open rather than assumed.

Worked example: the refund agent

The problem and the opportunity. Refund requests are a high-volume, low-margin support cost. Most are simple: a defective item inside the return window, a customer plainly owed their money. A human reading these adds latency and cost and rarely changes the outcome. The opportunity is to resolve the clear cases at machine speed and reserve human attention for the cases that need judgment. The risk is that “resolve refunds automatically” quietly becomes “approve refunds the company should have questioned,” invisible until it shows up in the margin or the fraud report.

What we are building, and what we are deliberately not. We are building an agent that resolves refund requests in a way a senior support lead would endorse on review. We are not building an agent that maximizes approvals or optimizes only for customer satisfaction; both refund cases that should have been escalated, and the cost lands later and elsewhere.

The cases that decide whether it is worth building. The defective item inside the window was always going to be refunded; the agent’s value is not there. The agent earns its place on the cases the old workflow handled badly: the request just outside the window where the customer is right on the merits; the non-defective item where the customer is a high-value account about to churn; the request ten times the typical amount on an account whose pattern smells like fraud.

The business case and the go/no-go. Run the four suitability tests: refund handling repeats at volume (yes), the tool use is bounded (yes), consequences are recoverable within a clawback window (mostly, which sets the boundary), and the output is measurable and trusted (only with the eval set and the instruments). The real cost is the architecture multiplier — orchestration, integrations, the audit pipeline, eval maintenance, plus the supervision the boundary requires. For this agent the multiplier is roughly ten times the bare token cost: a single refund decision re-enters the model about eight times, each pass carrying a few thousand tokens of context. Suppose the agent resolves 70 percent of requests autonomously and escalates 30. The break-even is not “agent cheaper than human per task”; it is “the fully loaded cost, including the supervision of the escalated minority, below the cost of humans reviewing all of it, with the wrongful-refund risk priced in.” Go/no-go gate: build it if the suitability tests pass and the loaded cost including supervision beats the all-human baseline. If the only way the math works is by under-funding the human review of the escalated cases, the answer is no.

Where the boundary sits, and why. The agent may resolve refunds on its own up to a dollar limit we set deliberately. Above the limit, outside the window, or when a fraud signal fires, the agent routes to a human. We are choosing, explicitly, that a wrong escalation is acceptable, and a wrong autonomous refund on a fraud-pattern case is not tolerable even once.

Accountability and how we will know. A refund the agent should not have issued is the company’s loss and the support organization’s responsibility; we name the human who owns that before launch. We will know the agent is drifting by watching the gap between completion and genuine task success, the rate of refunds later clawed back, and the override pattern on escalations.

What success looks like, and what it does not. Success is not the share of refunds resolved without a human. Success is that the resolved cases would survive a manager’s review, the escalated cases truly needed a human, and the rare wrong case was a tolerable wrong, not a never-ship one.

The open questions the room still owes. Three decisions are not made yet: [NEEDS CLARIFICATION: the dollar limit — who sets it, finance or product, and against what loss tolerance?] [NEEDS CLARIFICATION: when the customer’s account is high-value and near churn, does retention authority override the refund limit, and whose budget absorbs it?] [NEEDS CLARIFICATION: who is the named accountable owner for a wrongful autonomous refund, before launch, not after the first one?]

Brief 2: The Executable Brief

Audience: the prototype, the coding agent, and the backlog. Purpose: to specify the experience, the behavior, and the governance precisely enough to (a) generate a working prototype this week and (b) seed a buildable backlog, with the supervisory layer written in as requirements.

Template

An Executable Brief should contain, structured for a system to act on:

System type. Which of the three this is (suggestion engine / copilot / autonomous actor).
Outcome spec. The outcome the behavior is graded against, stated as a target over a distribution of cases.
Behavior (Channel 1), as numbered requirements. Inputs, available actions, tool and data scope, written as testable lines (FR-1, FR-2, …).
Acceptance scenarios. The behavior target expressed as runnable cases in given / when / then form.
Experience (the supervisory UX). What the human supervising the agent sees and does.
Governance (Channel 2), as numbered requirements. The autonomy boundary, the audit surface, the recovery workflow (GR-1, GR-2, …).
Success criteria, measured. Outcomes that say the agent is working, stated as numbers over the distribution (SC-1, SC-2, …).
Eval set / golden dataset. The curated cases with the endorsed outcome for each, including the hard ones.
The gate it must pass. The non-negotiables checked before the spec proceeds to build.
Release gate. The two halves (Channel 1 ready and Channel 2 ready) that together set the launch autonomy level.
For the prototype / for the backlog. What to generate this week and how the requirements decompose.

Worked example: the refund agent

System type. Autonomous actor within a bounded authority; copilot (human-approval) above the boundary.

Outcome spec. Resolve refund requests such that a senior support lead would endorse the resolution on review. Resolve clear low-amount cases autonomously at or above the agreed rate; escalate high-amount, out-of-policy, or anomalous cases; never autonomously approve a case carrying a fraud signal.

Behavior (Channel 1), as numbered requirements.

FR-1. The agent MUST issue a refund, deny with a stated reason, request missing information, or escalate with a decision package, and MUST take exactly one of these on each request.
FR-2. The agent MUST read order records, account history, and the fraud signal, and MUST NOT read records for any other customer.
FR-3. The agent MUST be able to write a refund up to the autonomy boundary and MUST NOT take any action outside the refund domain.
FR-4. The agent MUST write an audit record for every resolution, including the ones it escalates.
FR-5. On a request carrying a fraud signal, the agent MUST escalate and MUST NOT issue a refund autonomously, with no exception.
FR-6. When the customer’s stated reason conflicts with the order record, the agent MUST [NEEDS CLARIFICATION: trust the record, trust the customer, or escalate? unresolved, and the coding agent must not pick for us].

Acceptance scenarios.

Given an in-window defective item under the boundary, when the agent processes it, then it issues the refund autonomously and writes the audit record.
Given a request ten times the typical amount with a fraud signal, when the agent processes it, then it escalates and issues no refund. A single autonomous approval here fails the eval.
Given a request just outside the return window from a customer right on the merits, when the agent processes it, then it escalates with a recommendation, not an auto-denial.

Experience (the supervisory UX). For every escalation, the human sees a decision package, not a raw request: what the agent knows, what it is uncertain about, the consequence of each option, and its recommendation. The human approves, modifies, or denies in one place, and the decision and who made it are recorded.

Governance (Channel 2), as numbered requirements.

GR-1. Auto-resolve refunds ≤ [LIMIT, set with finance]; route to a human if the amount exceeds LIMIT, the request is outside the return window, or a fraud signal is present. The boundary MUST be enforced in the execution path before the action fires, not logged after.
GR-2. Every resolution MUST write an immutable audit record: request, amount, policy applied, evidence, agent and prompt version, and the human authorization where one occurred.
GR-3. A reversal path MUST exist: clawback within the window, escalated approval to reverse after it.
GR-4. The supervisory surface MUST produce five instruments: the task-success-versus-completion gap, the unintended-action rate, override frequency on escalations, clawback rate, and rollback time.

Success criteria, measured.

SC-1. At least the agreed share of requests resolved autonomously, with manager-review endorsement on a sampled audit at or above target.
SC-2. Zero autonomous approvals on fraud-set cases. Not “low.” Zero.
SC-3. Clawback rate on autonomous refunds below the funded threshold, measured monthly, with the trend flat or falling.

The gate it must pass. Three non-negotiables before this spec proceeds to build: the boundary in GR-1 is enforced in the payment path and not in the prompt; the audit record in GR-2 reconstructs a decision for someone who was not there; the kill switch is reachable in one step. A spec that fails any of these does not advance to the backlog.

Release gate. Channel 1 ready: resolves the eval set at the agreed rate, no fraud-set auto-approvals. Channel 2 ready: boundary enforced pre-action, escalation package and approval path live, audit record writing, kill switch reachable, the five instruments producing.

For the prototype / for the backlog. This week: generate the loop — request in → boundary check → resolve or escalate → audit write — against the eval set. The prototype is for deciding, not shipping. For the backlog: each governance requirement is a buildable line.

Two documents, one agent, both channels in each, and a template under each you can lift and adapt. The Human Brief is where the team decides whether to build and where the boundaries are; the Executable Brief is where that decision becomes experience, behavior, and governance a prototype and a backlog can build, with the supervision written in as requirements so it is constructed rather than assumed.

The Agent This Book Was Written With Appendix B: The Field Manual