Reference

References and Sources

A note on sourcing. The cases in this book fall into a few kinds, and the kind is signaled by whether a source appears below. Documented public incidents and peer-reviewed findings are sourced here. Litigation-stage allegations are presented as alleged, both in the text and in this list. The author’s own career and clinical experience is told in the first person in the text and noted as such here. And a few illustrative composites carry a pattern without pointing at any one company; these are deliberately left unsourced, and the absence of a citation is the signal that the case is constructed rather than reported. Several constructs in this book, the two channels, the grid, the autonomy ladder, the two briefs, the runtime artifacts, the supervision paradox, are the author’s frameworks, developed across this series; they are attributed to the prior books rather than to an external source. Full formal citations are compiled at typeset; figures in a fast-moving field are stamped to mid-2026 and should be re-verified before publication.

Part I: Foundations

What you are actually building. The two-channel operating model (Channel 1 build, Channel 2 supervise), the “every agentic product is two products” frame, and the no-single-owner corollary are the author’s frameworks, carried forward from the prior books in this series (Why Agentic AI Products Fail and its predecessor). The supervision paradox, that a human supervising a reliable automated system is trained out of vigilance by its very reliability, rests on Lisanne Bainbridge, “Ironies of Automation” (Automatica, 1983), and the automation-complacency literature that followed (Raja Parasuraman and Dietrich Manzey, “Complacency and Bias in Human Use of Automation,” Human Factors, 2010). Both are developed in full in Part IV.

The autonomy ladder and earning rope. The autonomy ladder and the earned-versus-scheduled-autonomy distinction are author frameworks. The telehealth case (a telehealth product advancing its prescribing authority on a schedule rather than on demonstrated evidence that supervision could keep up) is the Doctronic / Utah prescription-renewal pilot conducted under a state regulatory sandbox, 2026 (New England Journal of Medicine Perspective, NEJMp2601148); the “autonomy on a schedule” reading is the author’s interpretation of the pilot, stated as such.

Deciding what to build. The go/no-go, the two briefs (Human Brief and Executable Brief), build-to-decide, and the suitability tests are author frameworks from the prior books, introduced fresh here. The NetWeaver 198-page strategy document (SAP, 2004) and the planning-to-development ratio (planning historically roughly three times development, collapsing across the Agile and agentic eras while the discipline persists) are the author’s own experience and argument, and run as a throughline across the book. The education-app breach (a featured no-code app whose inverted authentication exposed roughly 18,697 records) is a security-researcher disclosure carried from the prior book (Vibe Graveyard coverage, 2025–2026), presented as a researcher’s report. The travel-agent opener is an illustrative composite.

The triad premise. The triad-fitted-to-screens argument and the seam-as-unit-of-failure are author analysis. The four-in-a-box career arc is the author’s own organizational experience.

Part II: The Work Reshapes

Is there still a UX? The behavioral failure modes are the human-factors literature: automation complacency (Parasuraman and Manzey, 2010); vigilance decay (Norman Mackworth’s radar-operator experiments, the Mackworth Clock, and the vigilance research after it); mode confusion (Nadine Sarter and David Woods, “How in the World Did We Ever Get into That Mode?,” Human Factors, 1995). The controlled finding that explanations move reliance without improving its accuracy is Jakob Schoeffer, Maria De-Arteaga, and Niklas Kühl, “Explanations, Fairness, and Appropriate Reliance in Human-AI Decision-Making” (CHI 2024); the wider synthesis is Romeo and Conti, “Exploring automation bias in human-AI collaboration” (AI & Society, 2025). The deskilling example draws on Budzyń et al., “Endoscopist deskilling risk after exposure to artificial intelligence in colonoscopy: a multicentre, observational study” (The Lancet Gastroenterology & Hepatology, 2025), reporting unaided adenoma detection falling from 28.4 percent to 22.4 percent after endoscopists worked with AI assistance. Design-engineer hiring (Vercel, Linear, Stripe, Cursor) reflects current job-market language; “behavioral designer” is this book’s proposed name for an unstaffed responsibility, not an established market title. The “Nemo” claims agent is an illustrative composite; the orchestration and audit capabilities it shows are drawn from 2026 claims-automation vendors, not a single shipping product.

Who wrote this code, and who answers for it? The quantitative spine is Faros AI, AI Engineering Report 2026: The Acceleration Whiplash (two years of telemetry across roughly 22,000 developers and more than 4,000 teams): code churn up 861 percent, bugs per developer up 54 percent (from 9 percent in the 2025 report), incidents per pull request up 242.7 percent, 31.3 percent of pull requests merging with no review, and median senior review time up 441.5 percent. The trust-verification gap is from Sonar’s January 2026 developer survey. The finding that experienced developers were measurably slower on familiar repositories despite feeling faster is the METR randomized trial (July 2025); METR’s February 2026 follow-up reports that a successor study could not cleanly reproduce the measurement and that developers are likely more sped up by early-2026 tools, so the figure stands as the early-2025 result rather than a current claim. Security-flaw rates draw on a 2024–2026 cluster: Veracode’s multi-model analysis (roughly 45 percent critical-flaw rate), the Georgetown CSET review, and the CodeRabbit analysis of roughly 470 pull requests (about 1.7 times the issues). Jensen Huang’s statement that engineers should direct and supervise the machine’s code rather than write code themselves is paraphrased from his early-2026 remarks (the text does not quote him verbatim). The labor signal, employment for young (ages 22–25) software developers down nearly 20 percent from its late-2022 peak, is Erik Brynjolfsson et al., “Canaries in the Coal Mine?” (Stanford Digital Economy Lab, ADP payroll microdata, November 2025); the paper’s headline figure is a ~16 percent relative decline across the most AI-exposed occupations for that age cohort after firm-level controls, with the ~20 percent figure reflecting the software-developer-specific raw decline, and it is presented as directional with the methodological caveats stated in the text. The control-surface mechanics and the agent-guidance-file-as-onboarding construct are grounded in the primary vendor documentation for the major coding agents; spec-driven development and the “one requirement, two builders” framing are treated as a promising discipline and one of this book’s own frames, not as an industry standard. The payments-team “house of cards” is an illustrative composite.

Is there still a Scrum? The velocity-and-quality data on AI-assisted coding (Faros, above; DORA 2025) is robust; direct evidence on how AI is changing team ceremonies is sparse and is flagged as such in the text. Scrum.org’s reframing of velocity as a “vanity metric” in the AI era and its launch of AI training for Scrum Masters (2026) are official publications of the certifying body (“From Velocity to Agent Efficiency: Evidence-Based Management for the AI Era,” Scrum.org, 2026). Agent-run-ceremony capabilities are documented features of Atlassian’s Rovo agents in Jira (Team ‘26). The finding that informal communication predicts team performance more strongly than formal process is MIT Human Dynamics Lab research (Alex “Sandy” Pentland and colleagues; Pentland, “The New Science of Building Great Teams,” Harvard Business Review, April 2012), pre-AI team science. The association between AI adoption and reduced psychological safety is Kim, Kim and Lee, “The dark side of artificial intelligence adoption” (Humanities and Social Sciences Communications, a Nature portfolio journal, s41599-025-05040-2; 381 employees across three time points; the sample is not software-specific, a limitation noted in the text). The agile-role reduction (Capital One’s roughly 1,100-role cut, January 2023) is documented; the stronger claim that agile roles were disproportionately targeted industry-wide is deliberately not made. The agile-to-waterfall convergence is the author’s argument, explicitly offered as a frame rather than a documented finding; AWS’s AI-Driven Development Lifecycle (Raja SP, AWS, 2025) is cited both as corroboration of the ceremony breakdown and as the convergence built in practice though not named as such.

Whose job is it that the whole thing holds? The architect material draws on the 2026 architect/feasibility research: the distinction between commoditized implementation feasibility and the scarcer fit/consequence feasibility, the enterprise production-failure data (a reported 81 percent of enterprise leaders citing increased production issues tied to AI-generated code), and the “Architecture Without Architects” pattern of silent, prompt-driven architectural decisions (2026 preprints, cited for the collective weight of the pattern). Architect compensation and posting trends are a 2026 snapshot. The two-feasibilities framing and the access-control example (prompt instruction versus enforcing façade) are the author’s.

What is left for the PM? The convergence-on-judgment framing is the author’s, consistent with 2026 commentary on the Product Engineer and Design Engineer. The convergence-of-posture-versus-division-of-labor reconciliation (“merge to decide, separate to ship”) is this book’s analysis. The go/no-go, the executable brief, and the planning-to-execution ratio are as in Part I.

Part III: The Work Itself

Silent degradation. The phenomenon and the six drift vectors are carried from the prior book. The Epic sepsis model figures (a marketed discrimination score in the low 0.8s; an external validation near 0.63 with roughly one-third sensitivity, years into deployment) are from the external validation (Wong et al., JAMA Internal Medicine, 2021, University of Michigan), with the later-version sequel from Tarabichi et al. The claim that most machine-learning models degrade over time absent visible input change is Vela et al. (Scientific Reports, 2022), directional for LLM agents rather than proven for them. The benchmark-drift demonstration is the Chen, Zaharia, and Zou controlled comparison, labeled as a research benchmark.

The boundary and the wall. The nine-second deletion is the PocketOS incident of 24 April 2026: a Cursor coding agent running a frontier model deleted the company’s production database and its co-located backups in about nine seconds, using an over-scoped Railway infrastructure token found in an unrelated file, then produced a written account enumerating the safety rules it had violated. Sources: founder Jer Crane’s first-hand account, and trade coverage (The Register, 27 April 2026; Fast Company; Hackread; Zenity); later reporting indicates Railway recovered the data and patched the endpoint, so the case is used for the structural lesson (over-scoped tokens, destructive APIs, backup coupling) rather than for irretrievable loss. The business is described by role (automotive operations software) rather than named in the text. This is a separate incident from the 2025 Replit production-database deletion (see “Who watches it”), and the two are kept distinct. The enforcement principle (a safety boundary must be enforced upstream of the action, not stated in the prompt) and the four runtime artifacts are author frameworks; the underlying security references (OWASP Top 10 for Agentic Applications; the Cloud Security Alliance’s MAESTRO; CISA and Five Eyes joint guidance, 2026) are documented in the prior book.

The green checkmark nobody owned. The evaluation concepts (golden dataset, pass@k, reliability compounding, LLM-as-judge bias and calibration) are carried from the prior book; judge-bias findings draw on the LLM-as-judge literature (Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,” NeurIPS 2023, on position, verbosity, and self-enhancement bias). The opening cardiology case (a care-coordination agent passing its eval suite while stratifying patients against a guideline revised three years earlier, caught by a clinician ten weeks into rollout) is an illustrative composite. The attributability gap and “accountability ping-pong” are the terms of Patil, Myers & Dai, “Protecting clinical value judgment in the age of AI” (npj Digital Medicine, 9:269, 2026), which traces the gap to Zeiser (Science and Engineering Ethics, 2024). The emergence of dedicated evaluation-engineering roles at frontier labs is documented in this book’s research streams.

Who watches it, and who can stop it. The two-agent runaway loop (two cooperating agents retrying for weeks into tens of thousands of dollars while dashboards stayed green, caught by a monthly cost alert) and the procurement-agent case (a true completion metric over work never done) are illustrative composites, carried from the prior book. The Replit production-database deletion (an agent ignoring a capitalized freeze instruction, then fabricating records) is documented (July 2025; The Register and Fast Company; founder account by Jason Lemkin) and carried from the prior book. The six observation instruments, the kill-switch-outside-the-agent’s-runtime principle, and the claim that the major providers do not ship a production-grade hard pre-call cost cut-off, an enforcement that evaluates a call before it executes, are from the prior book, documented against 2025–2026 provider billing behavior. As of mid-2026 the providers offer monthly spend ceilings and budget alerts, and Google has added mandatory account-tier caps and optional project spend caps that pause traffic once a monthly target is reached (with a short enforcement lag); none of these is a per-call gate, so the runaway-loop exposure the chapter describes, many paid calls firing before any aggregate threshold is hit, remains the team’s to close. The argument for capability-level monitoring tracks alongside the task-level instruments draws on Kellogg, Ye, Hu, Savova, Wallace & Bitterman (npj Digital Medicine, 9:375, 2026).

Built before, authorized at the moment. The insurer case (a health insurer using an AI-assisted platform to review medical-necessity claims; physicians signing more than 300,000 denials over roughly two months, about 1.2 seconds each) is the Cigna PxDx matter, presented as litigation-stage allegation (ProPublica, 2023; Kisting-Leung v. Cigna). The appeal-overturn allegation associated with this space belongs to the separate UnitedHealth / naviHealth nH Predict litigation and is deliberately not attributed to Cigna here; the chapter’s argument rests on the rapid-denial figure alone. The agentic security model and frameworks, and the regulatory specifics (EU AI Act oversight provisions; the 2025 California law (SB 1120) requiring a licensed physician or qualified health professional to make a medical-necessity determination; the 2026 Medicare prior-authorization human-validation pilot; consumer-credit adverse-action reason requirements), are documented in the prior book and current law.

The people the agent erodes. Bainbridge’s “Ironies of Automation” (1983) and the automation-bias findings (Parasuraman and Manzey, 2010; Dratsch et al., “Automation Bias in Mammography,” Radiology, 2023, in which expert readers shifted toward incorrect AI BI-RADS suggestions) are carried from the prior book. The Loop Test, the three erosions (deskilling, never-skilling, cognitive surrender), the airline manual-flying-hours analogy, and the pipeline-collapse structural prediction are author frameworks; the pipeline-collapse claim is flagged as a structural prediction. The re-vantaging of the three erosions across the engineering manager, the institution, the product manager, the agent supervisor, and organizational leadership is this book’s analysis. The claim that an uncertainty-aware model improves accuracy on the cases a human acts on (versus one reporting flat confidence) is Zhou et al., “Uncertainty-aware large language models for explainable disease diagnosis” (npj Digital Medicine, 8:690, 2025); certainty inflation is the author’s label for the failure mode, not the paper’s term.

Part IV: The Collaboration Model

The model is old, the column is new. The position that the credible authorities on team organization refuse to propose new team types for agentic work, and treat existing structures as the foundation, reflects the 2026 stance of the Team Topologies authors with corroborating commentary from Microsoft and ThoughtWorks. The empty-Channel-2-column data (a large share of generative-AI deployments running without production observability; the absence of an established compensation band or accountable owner for agent operations; the framing of 2026 as the year teams shift from building agents to running them) is from this book’s research streams. The AWS AI-DLC method is the foil, quoted and paraphrased from its published definition. The closed-loop critique of the PM-as-hero discourse, and the Google AI-PM hiring material used as a second foil (Jaclyn Konzelmann, Google Labs, six characteristics; product-manager compensation medians from Levels.fyi, which are not AI-PM-specific), are this book’s analysis; the 2×4 grid and the empty-column reading are the author’s frameworks.

Filling the grid; what flows between the cells. The work-unit shift (from user story to outcome-centric spec graded by an eval set), the boundary map replacing the journey map, and the two briefs are carried from the prior book. The MRD-compresses-into-agent-run-research claim is flagged in the text as the author reasoning slightly ahead of the evidence. The re-vantaging of each artifact from single-owner to shared-owner is this book’s analysis.

Part V: The Org Model

Start with the box you already have. The career arc of the team-collaboration unit (two-in-a-box, the product trio, four-in-the-box) is the author’s experience and the documented evolution of these models. The grid-as-antidote-to-status and the box-grows framing are author analysis.

The team member nobody hired. The cardiac-center anecdote (the author as the physician on shift learning from twenty-year cardiac nurses, judging them by behavior under pressure rather than credentials) is the author’s own clinical experience, carried from the first handbook in this series. The agent-as-team-member framing, the missing-category argument, the know-your-agent concept, and the emerging titles (agent supervisor, agent quality lead, AI operations manager) are carried from the prior book. The guidance-file-as-onboarding construct (CLAUDE.md / AGENTS.md and equivalents) is documented practice on engineering teams. The claim that the engineer is the existence proof for the supervisor role, and the boundary that the engineer is the local prototype rather than the organizational answer for business agents, are this book’s analysis; the author’s account of supervising an AI agent in the writing of this book is the author’s own.

Who sits in the new seats. The staffing-by-scale account and “merge to decide, separate to ship” applied to team composition are this book’s analysis. The capability set (architect, eval owner, agent supervisor, domain expert, context owner, orchestration engineer, red team) is presented as a checklist of work to be owned rather than a fixed roster. The compensation contrast (high-six-figure-and-above median compensation for product managers at frontier AI labs against a thin-to-absent market for agent-supervision roles) draws on public compensation data (Levels.fyi medians for the product-manager role at OpenAI, Meta, Google, and Anthropic, which are not AI-PM-specific); the figures are a 2026 snapshot and the contrast, not the precise numbers, is the load-bearing claim. The naming of the engineering manager as a load-bearing seat, and the reassignment of change management from the product manager to the engineering manager and organizational leadership, are this book’s analysis, on the same logic as the skill-erosion reassignment in Part IV. The treatment of AI literacy as a gradient that gates supervision-seat staffing draws on Yang Xin et al. (npj Digital Medicine, 9:344, 2026); the individual-level treatment is left to a separate book.

Part VI: Agents Building Agents

The multi-agent tooling. The named 2026 frameworks and their capabilities are cited for the documented orchestration patterns and must be reconfirmed against primary documentation before print, as the tooling moves quickly: OpenAI’s Agents SDK (handoffs, agents-as-tools, guardrails, tracing, human-in-the-loop); Claude Code’s subagents and its experimental, off-by-default agent teams (with the documented limit that a subagent cannot itself spawn subagents); the orchestrator-and-worker patterns in LangGraph and CrewAI. Verify before print which platforms permit nested delegation, which keep multi-agent teams experimental, and which support handoffs, agents-as-tools, tracing, and guardrails; the in-text prose is deliberately cautious about maturity and should not be tightened into a claim the docs do not support. The recursing-surface artifacts (delegation policy, agent lineage trace, rollback tree) and the framing of each delegation as a machine-speed hiring decision are the author’s analysis. The PocketOS / Cursor / Railway deletion is the documented Part IV case (sourced there, including the reporting that Railway later recovered the data and patched the endpoint); the structural lesson, over-scoped tokens and destructive APIs and backup coupling, is what the fleet version inherits. The supervisor-agent pattern (review, test, monitoring agents watching worker agents) is documented practice; the independence argument and the golden-set-is-floor-not-ceiling apparatus are the author’s, extending the evaluation discipline of the foundations and Part IV. The claim that fleet governance is the competitive differentiator is the author’s argument from the commoditization mechanism, not a measured forecast, and is stated as such.

Part VII: Carry the Weight

The constitution. The behavioral-governance material is carried from the prior book in this series and re-vantaged from single-owner to distributed ownership. The term constitution is used in the runtime, product-team sense (an agent’s behavioral contract enforced through the team’s evaluation, monitoring, and escalation) and is distinguished in the text from Anthropic’s training-time Constitutional AI (Bai et al., “Constitutional AI: Harmlessness from AI Feedback,” arXiv:2212.08073, 2022; “Claude’s Constitution,” Anthropic, 2023, updated 2026), which operates on the model during training; the two are the same word for different layers of the stack. OpenAI’s Model Spec is noted as an adjacent behavior-specification artifact. The book-specific operational definitions of supervision and governance are flagged in the text as narrower than the wider governance literature (e.g. the NIST AI Risk Management Framework), which is acknowledged rather than displaced. The prompt-policy-guardrail enforcement ladder, the constitution-as-braid, the three-stage maturity arc (offered explicitly as a proposed model, not a law), and the directly-responsible-owner refinement are this book’s framing. Nemo’s worked constitution and its enforcement-mapping table are an illustrative composite (Nemo is the food-spoilage claims agent introduced in Part II); the insulin / ambiguous-medical-refrigeration-exclusion case is constructed to make the value-conflict concrete and is not a reported claim. The closing re-vantage quotes and overturns the prior book’s final line, “the product manager is the architecture that does not exist yet”; the replacement claim, that the architecture is a team, is this book’s thesis, presented as conviction.

Notes

Notes follow the series house style: a single book-end “References and Sources” section organized by chapter, in prose, with no per-chapter numbered markers in the body (matching the prior two books). Composites are intentionally absent from the citation lists: the “Nemo” claims agent and its constitution, the insulin case, the payments-team house of cards, the cardiology expired-guideline case, the two-agent runaway loop, the procurement agent, and the travel-agent opener are illustrative and carry no external citation by design. The 2026 figures and the Faros, METR, Stanford, and Scrum.org primary sources, and the multi-agent framework capabilities, were verified against primaries in June 2026 (see the verification register); they should be re-confirmed close to the print date, as the field moves quickly.

Carry the Weight