Chapter 11 · Field reference

Operating the Loop, A Field Manual

Field reference

This chapter is reference material. Use it before a build decision, before a launch sign-off, or whenever a sprint review reveals something unexpected. Each section maps to a chapter in this guide and condenses the key decisions into a working tool.

The two threads of this book, probabilistic systems and dual-product responsibility, appear in every section below. The checklists exist because probabilistic systems cannot be verified by single-pass testing, and the dual-product structure cannot be specified without explicit artifacts for both the agent and the supervisor. If any checklist below feels like rote, re-read it with the two threads in view. The item is almost always there because of one of them.


1. Agent Candidacy Checklist

(Chapter 3)

Four conditions must hold before committing to build. If any one fails, do more scoping first.

  • Volume: The decision repeats at meaningful scale (daily or weekly, not quarterly). How many times per day or week does this decision get made today?
  • Measurability: You can score a sample of past decisions as right or wrong without significant disagreement. Can you build a ground-truth dataset from historical decisions?
  • Bounded tools: You can enumerate every system the agent needs to access, and nothing beyond. What is on the tool list, and what is explicitly off it?
  • Recoverable consequences: The worst-case failure, at the expected error rate, is survivable. What does recovery cost, in time and money, per incident?

Break-even check. At what monthly task volume does this agent break even against the current best alternative? Is that volume realistic within twelve months? If neither question has a clear number, the scoping is not done.

Cost correction. Take the number in the business case. Multiply by 1.5. That is the number to present to finance. The interactive cost calculator in Chapter 3 lets you stress-test the assumptions across the six variables that drive break-even.


2. Autonomy Ladder Rubric

(Chapters 2, 4)

Movement up the ladder is earned, not scheduled. Each promotion requires documented evidence.

StageWhat the agent doesWhat the PM must designEvidence required to advance
1. AugmentationSurfaces options; human decidesRecommendation quality; confidence legibilityCalibration tested; override rate stable; no unintended action incidents
2. Limited automationExecutes bounded actions with approvalApproval moment; decision package; rollback on rejectionApproval friction measured; edge cases handled; error rate within spec
3. Semi-autonomousActs within scope; escalates outside itEscalation triggers; audit surface; recovery workflow30 days clean at Stage 2; supervision training complete; incident response tested

Table 11.1. Autonomy ladder: stages, PM design focus, and evidence to advance.

Demotion triggers. Define before launch what causes a step back down: unintended action rate above threshold, override frequency diverging from baseline, incident recovery time trending up. If demotion thresholds are not defined at launch, they will be defined under pressure after the first incident.

Earned, not scheduled, criterion. Movement up the ladder requires demonstrated competence in the specific failure modes that matter for this decision type, a real-time or near-real-time signal that catches errors before they compound, and a defined demotion path. If the team is proposing a count-based or calendar-based promotion, the autonomy is being scheduled, not earned. The Utah Doctronic case in Chapter 3 is the canonical example of how scheduled autonomy fails.


3. Runtime Design Worksheet

(Chapter 4)

For each major workflow the agent handles, answer these four questions before design is considered complete.

Autonomy boundary. What may the agent do without any human interaction? What requires a human decision before proceeding? Is the boundary logged for every event? How does the user see the boundary at the moment it matters?

Approval moment. What information does the human see when asked to approve? How is uncertainty represented? (Not “are you sure?” A decision package, not a speed bump.) What are the alternatives presented? What happens if no decision is made within the defined window? And critically, given the supervision paradox: is the approval framed as authorization within a designed boundary, or as validation of the agent’s reasoning? If the latter, the framing will not survive the predictable erosion of supervisor competence.

Audit surface. What is the decision trajectory available after the action? (Observable actions and retrieved context, not chain-of-thought.) Who is the named accountable person for this action type? Is the name visible in the product UI, not only in the governance document? Is delegated authority logged and traceable?

Recovery workflow. What compensating action is available if the agent was wrong? Is rollback reachable mid-execution, not only after completion? What cannot be undone, and how is that communicated before execution? For actions whose consequence timescale is shorter than alert-and-respond cycles (the PocketOS case in Chapter 4), is there a kill-switch built into the autonomy boundary upstream of execution?

Async state. If the workflow is long-running, what does the async state display look like? How does the user return to context after the agent has completed work in the background? What is the notification channel, and how does it handle a batched digest?


4. Eval Sign-Off Checklist

(Chapter 5)

Before approving a release based on eval results, confirm the following. Yes to all is required.

  • Distribution, not a single pass. Were evals run at least five times? Is a pass rate reported, not a binary result? Is the P10 reported alongside the median?
  • End-to-end threshold set. Is there a system-level success rate defined, not just component accuracy? Was the question “What is the end-to-end success rate on the full workflow, run ten times?” asked and answered?
  • State validation performed. Did evaluation verify actual system state changes, not only semantic output correctness?
  • Judge calibration documented. Is the false-pass rate of the automated judge characterized? Are the documented bias patterns (longer-answer, position, same-family) accounted for in the calibration?
  • Coverage statement produced. Are the scenarios in the eval suite explicitly documented? Which user intents, failure modes, and adversarial inputs were tested? Which known gaps were deliberately deferred? Does the coverage statement list the inputs whose fidelity the eval assumed, to address upstream-data-wrong failure? Does it include adversarial retrieval (RAG poisoning) scenarios?
  • Graceful failure defined. Is the agent’s expected behavior when it cannot complete a task specified and tested?
  • Production monitoring in place. Is there a canary or shadow test active from day one of launch?
  • Incident response path documented. Before launch, not after the first incident.
  • Model version policy in effect. Is there a regression eval re-run against every model update, and a rollback path to the prior version, tested?
  • Outcome metric matches contract. Is the eval suite measuring the metric that proves the contract is being delivered, or just adoption? The DAX Copilot case in Chapter 5 is the warning.
  • Acceptable tradeoffs named. Is the intended balance between cost, latency, reliability, and human oversight documented, and are the eval criteria weighted accordingly?

5. Six Instruments Dashboard

(Chapter 6)

These six instruments constitute the observation phase. If any cannot be produced, the corresponding product surface was not designed. That is a design gap, not a data gap.

InstrumentWhat it measuresWhat divergence tells you
Task success rateAgent completed the task the user intended, not just that a task was completedBackground failures; agent operating outside intended scope
Unintended action rateActions the agent took that were not authorizedBoundary design failures; boundary not legible or logged
Override frequencyHow often supervisors modify or reject agent actionsToo high: trust or approval design problem. Too low: passive supervision or automation bias
Confidence calibrationWhether expressed certainty matches actual accuracyOverconfident agent building fragile trust
Rollback timeTime from error detection to completed recoveryRecovery is improvisation, not a designed surface
Incident recovery timeTime from scale misbehavior to full reauthorizationOrganizational response was not designed before it was needed

Table 11.2. Six instruments of production observation and what divergence signals.

Sprint input rule. Each instrument is a sprint input, not a KPI. A diverging reading generates a specific user story in the next sprint.

Instrument maintenance. Add a maintenance column for each instrument: owner, last re-calibrated, next scheduled. Schedule re-calibration on the frontier-model-release clock, not the calendar. If the last re-calibration predates the most recent frontier model generation in your stack, the instrument is measuring the agent you used to have, not the one you have now. No instrument ships without a named owner and a re-calibration date on the roadmap.


6. Channel 2 Checklist

(Chapter 7)

Before launch, confirm that supervision was designed as a product, not left to documentation and training.

  • Supervisor role does not require a new workflow. It slots into an existing one.
  • Approval moments appear where decisions are already being made, not in a separate monitoring dashboard.
  • Supervision training includes practice on what to do when the agent is wrong or unavailable.
  • Override frequency will be logged and reviewed in sprint cycles, not only quarterly.
  • Shadow workflow risk assessed: does the supervision interface give enough confidence that teams will not maintain parallel manual processes?
  • Accountability is named: one person per action type carries the outcome. That name is in the product UI, not only in the policy document.
  • Observability, data lineage, and model transparency are available to the supervisor at the moment of decision.
  • Environmental reversion risk assessed: which conditions must the product import from the structured setting to the production environment to hold the supervisory behavior stable?
  • Apprenticeship pipeline considered: how will the next generation of supervisors build the judgment the agent is about to remove from the work?
  • Recurrent-proficiency-without-AI plan in place. The aviation institutional model applied to whichever profession your supervisors come from. Schedule, cadence, and accountability assigned.

7. The Currency Question at Contract

(Chapter 8)

At contract signing and at every renewal, ask the vendor these five questions in writing. Absence of an answer is an answer.

  • Corpus: On what training corpus was the current model trained, and when was it last refreshed?
  • Curation: Who curates the corpus, with what retraction policy, and how are retractions propagated to deployed models?
  • Retractions: Which known retractions or regulatory reversals in our domain have been pulled, and on what schedule?
  • Change notes: Does the vendor publish behavioral change notes at each model release, and are those notes scoped to the tasks our agent performs?
  • Re-ask cadence: When will these questions be re-asked, and does the re-ask sit on the renewal calendar as a contractual event, not a best effort?

Risk register rule. If the vendor declines any of the five, the decline is a named entry in the product risk register, with an owner, a mitigation, and a re-review date. Silence is not neutral.


8. External Audit Trigger

(Chapter 8)

Commission an independent external audit when any one of the following conditions holds. The audit tests the monitoring, not the agent. If the monitoring is healthy, the audit confirms it cheaply. If the monitoring is not, the audit is how you find out before the incident.

  • Autonomy promotion. The agent is about to move up the Autonomy Ladder. The audit establishes a clean performance baseline at the new rung.
  • Eighteen months since the last audit. Regardless of visible health. The half-life of your observation instruments is already past.
  • Frontier model generation change. The foundation model underneath has turned a generation since the last audit. Your instrument calibration has reset whether you re-calibrated or not.
  • Post-incident reauthorization. After any incident that forced a demotion, the re-promotion requires independent evidence, not vendor or internal evidence.

Independence definition. The auditor has no commercial relationship with the vendor, did not help design the agent, and does not use a held-out sample the vendor curated. Vendor-selected benchmarks are vendor marketing, not audit.

What the audit produces. A stratified performance report, a confirmation or refutation of each observation instrument’s current calibration, a coverage statement of what was not tested, and a recommendation on whether the monitoring can be trusted for the next twelve months.


9. The PM’s Working Practice With AI

(Chapters 1, 4)

Building agentic products well requires the PM’s own practice to match. Using AI to draft PRDs faster is still operating as a bridge. The upgrade is building decision infrastructure that challenges your thinking before conclusions reach a human room.

There are five levels of working with AI in your own practice. Most PMs are at Level 1 or 2. The gap is a specification problem, not a capability problem.

  • Level 1 (Query). You ask, the AI answers. A fast reference tool.
  • Level 2 (Frame). You write a structured prompt with role, context, task, and constraints. Output improves immediately.
  • Level 3 (Persona). You build a persistent collaborator with a defined framework, documented failure modes, and an explicit instruction to never agree just to agree. This is the highest-return investment in the stack.
  • Level 4 (Council). Multiple personas with conflicting frameworks run against the same problem. The system generates friction, not consensus.
  • Level 5 (Digital twin). Each persona is grounded in the documented decision record of a real figure: their mental models, their signature questions, their specific failure modes.

The Persona Template

One persona, built well, raises the ceiling of what AI does for you more than anything else in this list. Five elements. Skip any one and the persona will agree with you when it should not.

ElementWhat to specify
RoleWho is this person? What is their job and their stake in the decision they are reviewing?
FrameworkHow do they evaluate decisions? What lens do they use first? Charlie Munger’s inversion, McKinsey’s seven-S, the supervision paradox, the four suitability conditions, the cost model. Whatever they reach for.
ConstraintsWhat claims will they refuse to accept without evidence? What language patterns will they push back on? (“Strategic alignment” without specifics. “Best practices” without sources. Confidence without uncertainty bounds.)
Failure modeWhat does this persona over-weight or miss? Naming it keeps the persona honest. (A skeptical CFO over-weights short-term cost and under-weights option value. Saying so prevents the persona from becoming a cartoon.)
Integrity constraintOne rule that prevents the persona from agreeing just to agree. Example: “Must push back on any revenue projection not supported by unit economics.” Or: “Must surface at least one assumption I am making that the data does not support.”

Table 11.3. The five-element persona template.

The Council Session: Six Roles

Level 4 working in production. When a decision matters and you need it stress-tested before it enters a room of humans, run it through six personas with irreconcilable frameworks. The value is not the individual outputs. It is the friction between them.

RoleThe question they ask
AuthorityWhat does precedent say? What has been tried before, and what happened?
OptimistWhat is the best-case path? What would need to be true for this to work better than expected?
PragmatistWhat can actually be executed? What are the real constraints on time, money, and people?
GuardrailWhat could go wrong? What is the failure mode that would hurt most?
SecurityWhat are the adversarial risks? Who could exploit this, and how?
ExpertWhat does the technical or domain evidence say? What do people who have done this before know that we do not?

Table 11.4. The six-role council. Run a decision through all six before the meeting.

The council is not a substitute for the human room. It is the stress-test that gets you to the human room with a Point of View statement already pressure-tested. The PM walks in having heard the six objections that would have surfaced in the meeting. The conversation that follows is sharper because the easy objections have already been answered.

Level 4 in practice: the nine-agent workshop

One Sunday I ran a full design-thinking workshop without the hotel, the flights, or the sticky notes. I built nine AI agents, each configured with a different lens on a healthcare problem I had been genuinely uncertain about: whether a hospital-based data platform could meaningfully address heart failure readmissions. Roughly one in four discharged patients returns within thirty days. Most are considered preventable. The technology to close the gap exists. The business case looked straightforward.

Three hours and under ten dollars in API fees later, the workshop had reframed the entire problem.

It was the community physician agent who said it first. When a heart failure patient walked into her clinic three days after discharge, she often did not have the discharge medication list. She was doing medication reconciliation from the patient’s memory and whatever documents they had brought. The patient was attempting to comply. They were complying with the wrong information.

The hospitalist agent had not said this, because this failure happens after her patient leaves. The clinical architect had not said this, because his lens is the data infrastructure inside the hospital. It came from the one agent positioned to see the patient on the other side of the handoff, and it landed with the particular weight of an observation that is obvious in retrospect and invisible beforehand.

What followed was a finding that cut directly against my original hypothesis. The problem was not inside the hospital. It was in the community, in the gap between discharge and the moment a community physician, a neighborhood pharmacist, or a health worker at a federally qualified health center could act.

That is Level 4 working. AI agents did not replace the workshop. The human workshop still matters: the community health worker who knows things no configuration captures, the executive with institutional authority, the customer who tells you something surprising in person. The point is that when those people enter the conversation, they can enter a different one. A Point of View statement already written, already stress-tested across nine irreconcilable perspectives, already carrying the finding that the problem is not where everyone assumed.

One note on the stack. Prompt engineering is a temporary skill, not a durable investment. Every generation of computing has required humans to speak the machine’s language, and each layer has been abstracted away. The next abstraction will arrive. The durable investment is not a more elaborate prompt. It is a specification that survives the abstraction: what framework, what failure modes, what refusal criteria, what diagnostic question. Those are the artifacts that port to whatever tool replaces the current one.

The diagnostic question after any AI-assisted session: Did this produce something I could not have reached without the AI’s specific framework, or a well-organized version of what I already believed? If the latter more often than the former, the personas need stronger failure modes.


10. Obligations Checklist

(Chapter 10)

Before launch, confirm that the product accounts for the people it affects, not just the people who use it.

  • Affected-person identification. For every consequential action the agent takes, the downstream person (beyond the user) has been identified and the outcome they experience has been mapped.
  • Stratified performance analysis. Model accuracy has been broken down by demographic, geographic, and population subgroups. No systematic underperformance for any affected group has been left unaddressed.
  • Hallucination type coverage. The eval suite explicitly addresses which of the six hallucination clusters (factual errors, outdated references, spurious correlations, fabricated sources, incomplete reasoning, upstream-data-wrong) it is designed to catch, and which remain open gaps.
  • Delegation boundary defined. Decisions that require human accountability have been identified and excluded from full automation, regardless of model accuracy. The non-delegable list is enumerated and enforced at runtime.
  • Constitutional Runtime Layer in place. A runtime-enforced refusal mechanism exists for non-delegable actions, not only a design-time policy.
  • Governance in the product. Every governance policy adopted for AI has a corresponding product surface (not just a document). Escalation paths, authorization requirements, and boundary events are visible in the UI and logged.
  • Regulatory framework as floor, not ceiling. The product’s safety architecture does not depend solely on the regulatory human-in-the-loop framing. The supervision paradox literature is acknowledged, and the product has its own safety mechanisms that do not assume supervisor competence is preserved across the deployment’s operational life.

11. One Diagnostic Question Per Phase

Before build: Can I answer all four suitability tests and name the break-even volume honestly?

Before design sign-off: Can I specify what the agent may do, what it may not do, what happens when it is wrong, and who is accountable for each action type, with the accountable name surfaced in the product UI?

Before launch: Do I have all six observation instruments instrumented, and is production monitoring active from day one?

Before increasing autonomy: Does the agent’s behavior over the past thirty days justify moving up the ladder, or am I just optimistic because nothing went visibly wrong? And: is this an earned promotion based on demonstrated failure-mode coverage, or a scheduled one based on a count?

After any incident: Could the supervisor have caught this before it compounded, given the interface they had? If not, what surface was missing? And: was the supervisor in a competence state the framework assumed, or had the deployment been eroding that competence?

Before any release affecting a new population: Have I stratified performance for the people this agent will now affect? Would I be comfortable explaining this agent’s error rate to the person in the six percent?

At month eighteen: Have the observation instruments been re-calibrated against the current frontier model generation, has the Currency Question been re-asked of the vendor, and has an external audit been scoped? If none of the three, the agent I am operating is not the agent I launched.

Quarterly, throughout the deployment: When was the last time we measured whether our supervisors can actually catch the agent’s failures, with no AI assistance? If the answer is “never,” the safety case is unevidenced.


12. Red Flags

Use this section when you are asked to evaluate someone else’s agentic project, or to audit your own. Each flag is a signal that the team has not done the work described in this book. Each maps to a chapter.

“The model passes all our evals.” (Chapter 5) Passing evals is necessary, not sufficient. Ask for pass@k, coverage statement, state validation, and judge calibration. If any is missing, the evals are telling you less than the team thinks.

“We will figure out the context later.” (Chapter 1) The context is the product. If the team does not know which data, with what fidelity, with what relationships, the agent will reason over, you are looking at the next quietly-rolled-back shopping assistant.

“Training will handle the supervision problem.” (Chapter 7) Training creates structured-setting behavior. The production environment erases it. If Channel 2 is a training slide rather than a designed product surface, the agent is shipping without a nervous system.

“Our supervisors are well trained.” (Chapters 1, 7) Trained when. Trained on what. Practicing what skill, on what cadence, without AI assistance. If the answer is “they were trained at onboarding,” the supervision paradox has had whatever months or years to erode the training the deployment depends on.

“We cannot produce the override frequency metric yet.” (Chapter 6) The approval moment was not designed to be logged. The absence of the metric is a design finding, not a data gap.

“The dashboard is running fine.” (Chapter 8) Ask when the dashboard was last re-calibrated, and against which frontier model generation. An instrument that has not been re-calibrated since the last model release is measuring the agent you used to have, not the one you have now. A green dashboard is a design artifact before it is evidence.

“The vendor has not published release notes on behavioral change.” (Chapter 8) Silence on behavioral change is not evidence of stability. It is the absence of the artifact the buyer needs to detect silent degradation. Put the release-notes obligation in the contract and the renewal calendar. Absence of the answer is the answer.

“We have not run a full eval since last year.” (Chapter 8) The eval harness is an observation instrument with an eighteen-month useful life, pegged to frontier model release cadence. If the last full eval predates the last generation turn, the team is not monitoring the agent in production. They are remembering how it performed in staging.

“A governance committee is reviewing our approach.” (Chapter 9) A committee reviewing the approach is not governance. A committee reviewing the product surfaces that enforce governance, where those surfaces exist, is governance. Ask to see the surfaces.

“The agent has three hundred skills out of the box.” (Chapter 1) Three hundred skills is three hundred permissions your organization just inherited. Ask which ones the agent may combine in ways the vendor did not plan for, and who owns the answer.

“We will handle irreversibility in a later phase.” (Chapter 4) Irreversibility is not a phase. It is a constraint to design around from sprint one. If the team is deferring it, the MVP is a house of cards.

“The kill-switch is the service shutdown.” (Chapters 4, 6) That is not a kill-switch. That is an emergency shutdown preceded by damage. For actions whose consequence timescale is shorter than alert-and-respond cycles, the autonomy boundary upstream of execution is the kill-switch. The PocketOS case in Chapter 4 is the warning.

“The affected person is out of scope.” (Chapter 10) The affected person is the person the product is for. Out of scope means the product has no moral architecture. Ask where Maria, or the equivalent, appears in the product.

“Our model is ninety-four percent accurate.” (Chapter 10) Who is in the six percent. If the team cannot stratify the error rate, the product is shipping with a hidden disparate-performance liability.

“Chain-of-thought shows exactly what the agent did.” (Chapters 1, 4) Chain-of-thought is generated text, not an audit log. Research on model explanations has shown that written reasoning and actual computational path can diverge. If the audit surface is built on chain-of-thought rather than observable action, it is not a defensible audit trail.

“We will worry about prompt injection if it becomes a problem.” (Chapter 4) Prompt injection is not an edge case. It is the single most exploited vulnerability of LLM-based agents in 2026. If the red-team report has one paragraph on injection, there was no red team.

“The security review was done at launch.” (Chapter 8) Security posture has a half-life. The threat environment evolves on the same clock as the model. A launch-time security review is a snapshot of a moving target. The fifth degradation vector applies.

“We will earn the next rung after thirty days clean.” (Chapters 2, 3) That is scheduled, not earned. Thirty days clean tells you nothing about whether the agent has demonstrated competence on the failure modes that matter for the next rung. The Utah Doctronic case is the warning.

“The agent works. Now we ship.” (Chapter 2) You have built half a product. The other half is the supervisory system around it. You are not ready to ship until both are deliberately designed.


13. Reading List for the PM

The framework in this book is grounded in a body of writing developed over the past several years on the intersection of AI product management, healthcare AI, enterprise data infrastructure, and the human-systems side of autonomous deployments. For readers who want to go deeper on specific topics, the following pointers map roughly to chapters.

Foundations and AI literacy (Chapter 1). “The Healthcare AI Spectrum,” on the seven generations of healthcare AI in simultaneous production. “The Quiet Erosion,” on the cognitive-decay literature underneath the supervision paradox. “The Last Generation That Can Supervise AI,” on the Bainbridge irony as it applies to clinical and software domains in 2026.

Cost and suitability (Chapter 3). “The Cost Model Your Business Case Is Missing,” the SAP Community version of the brownfield-versus-greenfield argument. “Utah Climbed the Autonomy Ladder. Nobody Designed the Rungs,” on earned-versus-scheduled autonomy. “Stop Waiting for Clean Data,” on the design-for-incompleteness principle.

Runtime design and security (Chapter 4). “You Built the Agent. Nobody Designed the Experience,” the original four-runtime-artifacts framing. “Security Was the Next Sprint,” on the STAC paper, the 847-deployment audit, and the OpenClaw incident. “The Agent Worked, Limitless and Unguarded,” on the flea-magnet metaphor and the fence-and-model upgrade-cycle problem.

Evaluation (Chapter 5). “What the Checkmarks Actually Prove,” the SAP Community version of the three-eval-breaks framework. The DAX Copilot case study is treated in detail in Chapter 5 of this book.

Observation (Chapter 6). “The Stack Is Green. The Agent Is Wrong,” on the five places the traditional monitoring contract breaks for agentic systems, and on data observability as the foundation for everything above it. “You Cannot Measure What You Did Not Design,” on the observation phase as the missing half of the lifecycle.

Change management (Chapter 7). “Your Agent Worked. Your Users Bypassed It,” on environmental reversion and the Stankowski study. “What Physicians Know That Cannot Be Written Down,” on the medium constraint that makes the supervision paradox structural rather than contingent. “The Education Model Is Cracking,” on the four apprenticeship conditions and what AI does to each.

Silent degradation (Chapter 8). “Silent Degradation: What a Deployed Clinical AI Looks Like at Month Eighteen,” on the Epic Sepsis Model and the four (now five) drift vectors. “The Architect Who Should Have Read JAMA,” on why the regulatory signals appear in medical journals before they appear as RFP requirements.

Governance (Chapter 9). “Governance: The Word Nobody Agrees On,” on the data-AI-master-data spectrum that most enterprises already partially have. “Why Healthcare AI Governance Isn’t What You Think It Is,” on architecture-versus-committee governance. “The 3 A.M. Problem,” on adaptive governance for clinical-pace decisions. “The Guide Is Not the Business,” on the Michelin Condition.

Obligations to the affected person (Chapter 10). “What You Owe the People Who Will Never Be in the Room.” “Not All AI Errors Look Like Errors,” on the Kim et al. hallucination taxonomy. “Trained on the Wrong End of the Story,” on first-contact training data mismatch. “Do No Harm, Encoded,” on the Constitutional Runtime Layer.

All of the articles cited above are at data-decisions-and-clinics.com. The SAP Community series, written for PMs working inside the SAP ecosystem, runs from posts 0 through 9.


The Loop

Every red flag on this list reduces to one of two root causes: a missing probabilistic mental model, or a missing Channel 2 artifact. Name which of the two is on the table, and the conversation sharpens immediately.

If you can answer the diagnostic questions in Section 11 honestly and concretely, at each phase, not just at launch, and you can run through Section 12 without any of the red flags landing on a project you own, you are doing the job this guide was written for: not just building agents, but designing the supervisory systems that make them safe, effective, and worth the complexity they add.

That is the work vertical SaaS PMs are uniquely positioned to do. The domain expertise you accumulated in years of workshops, customer calls, and hard launches is not obsolete. It is exactly what is needed to draw the boundaries an agent requires. The frameworks still work. The judgment call is the same. The surface is different.

You built the agent. You designed the supervisor. You held the loop together against the deployment that was reshaping it. That is the job. Now go do it on Monday morning.