Back Matter

Appendix B: The Field Manual

How to use this. Each chapter closed with a question you apply to your own product. This is those exercises in one place, in book order, stripped to the doing. Run one against a real agent, not a hypothetical. The line you cannot complete is the finding, every time.

Part I: Decide

AI literacy: name the four layers. Take one agent on your roadmap and name all four layers: its model, its orchestration, its tools, its context. The layer you can describe least precisely is the layer your quality problems will come from.

Autonomy: place it on the ladder, honestly. Place one product on the Autonomy Ladder at the rung it has earned, not the one the roadmap wants. Then write the single demonstrated competence that would justify moving it up one rung. If you cannot name it, you do not yet have permission to climb.

Suitability: fill the sheet. For one candidate agent, run the four tests (repeats at volume, bounded tool use, recoverable consequences, measurable and trusted) and the break-even. If all four pass and the break-even holds, build it. If any test fails, write the one sentence explaining why you are building it anyway, and notice whether that sentence would survive being read aloud to your CFO.

Part II: Prototype & Collaborate

How the work splits: walk the shape. For your next agent, walk the six seams. Did you run the suitability test and a separate affected-populations scan before the brief? Is the autonomy boundary set before you design the agent? Is there one human brief and a derived executable brief? Is the agent’s behavior written as outcome-centric specs graded by an eval set? Is the release gate two halves, Channel 1 ready and Channel 2 ready? After launch, who owns the governance stream that never ends?

Vibe coding: prototype to decide. Take one roadmap bet. Name the single riskiest assumption, the one-day prototype that would test it, and the specific result that would kill the idea. Build it, put it in front of someone who has the problem, record what it proved or disproved, and hand engineering the learning, not the code.

Collaboration: name the owner of every domain. For one agent, name the owner of each of the five runtime domains (goal definition, control plane / kill switch, rollback, supervisory UX, correctness certification) and the four launch co-owners. Any domain that resolves to “me” or to no one is an unowned domain about to fail.

Part III: Design

Behavior: the four artifacts as four sentences. For one agent, write the four runtime artifacts as four sentences (boundary, approval moment, audit surface, recovery workflow), then run the five security decisions. Any artifact you cannot state in one sentence is undesigned.

Oversight: which kind does each output need? For each output class the agent produces, assign program-level or transaction-level oversight using the four axes (harm asymmetry, reversibility, regulatory exposure, per-review cost). For every output you mark transaction-level, confirm a competent human could exercise judgment in the time actually available.

Evals: run the sign-off as a checklist. For one agent: (1) end-to-end pass rate across ten or more runs, with the worst slice; (2) state validation present for every action with a state change; (3) judge calibrated against a human-labeled failure set; (4) coverage statement; (5) model version policy plus a rehearsed rollback. Any item you cannot produce is a known gap to sign off on explicitly.

Part IV: Operate

Guardrails: three ceilings and a name. For one production agent, write the three hard ceilings as numbers: tokens (per run and per rolling window), requests per window, wall-clock per run. Confirm each is enforced before the call, not after. Verify the kill switch lives outside the agent’s runtime. Name the human paged at each burn-rate tier.

Observation: a threshold per instrument. For one deployed agent, write the threshold for each of the six instruments (task-success-versus-completion gap; unintended action rate; override frequency; confidence calibration; rollback time; incident recovery time). Mark which breach pages someone tonight and which feeds the next sprint.

Silent degradation: date the last re-calibration. For one deployed agent: (1) write the launch-time scope statement and diff it against what the agent does today; (2) list every substrate that can drift and who watches each; (3) date the last re-calibration of every monitoring instrument; (4) put the currency question in the vendor contract; (5) schedule one external audit on a held-out sample.

Audit: build the sealed decision artifact. For one consequential decision class, specify the sealed decision artifact with all six components: decision record, model-version reference, data and retrieval-corpus version, governance record, human-intervention record, appeal record. Store write-once and signed at write time.

Part V: The Human System

Deference: find the disagreement. For one agent where a human reviews the output, find a recent case where the agent and the human disagreed. Which was right, and did the workflow help the reviewer decide or did they defer by habit?

Change management: name the supervisor. Name the supervisor for one agent you have shipped. Is it a real role, with time on a calendar and an artifact they own? Or a responsibility bolted onto someone already full?

The Loop Test. Pick one agent where your design says a human reviews the output, and run the Loop Test: time, skill, attention. If it fails one, you have found the chapter’s point sitting in your own architecture.

Skill erosion: the three questions. Pick one skill your agent now performs that a human on your team used to perform. Ask: are your experts still practicing it, are your juniors still learning it, and is anyone still checking the agent’s version against their own judgment?

The agent as team member: write the onboarding packet. Write the one-page onboarding packet for one deployed agent: its scope of work, its access scope, who manages it, how you re-evaluate it on the next upgrade, how you decommission it.

Part VI: Carry the Weight

The people the agent never sees. For one agent: (1) name every party affected beyond the user; (2) run a stratified performance analysis across those populations before launch; (3) enumerate the actions the agent will never take regardless of instruction; (4) apply the disclosure heuristic; (5) confirm the moral architecture is enforced in the product, not described in a deck.

The before/after week, and the week ahead. Sketch your week three years ago against now. Then map your next week against the six parts. The part that gets the least of your time is the tell.

One diagnostic question per phase

  • Decide: would the suitability sentence survive being read aloud to your CFO?
  • Prototype: which runtime domain resolves to “me” or to no one?
  • Design: which of the four artifacts cannot you state in one sentence?
  • Operate: which ceiling is uncapped, and who is paged when it breaks?
  • Human system: is the supervisor a role with time on a calendar, or an assumption?
  • Carry the weight: who is inside the error rate, and did anyone design for them?

Red flags

A single green checkmark presented as proof on a non-deterministic system. A “freeze” or a boundary that lives in the prompt rather than the architecture. A monthly cost alert described as a safeguard. A transaction-level review funded at a sampling budget. A monitoring instrument no one has re-calibrated since the last frontier model shipped. A consequential decision with a six-month log retention. A supervisor who is “everyone.” An agent doing work no launch review authorized. An aggregate accuracy number with no one asking who is in the error rate.