Chapter 10 · Across all stages, obligations

What You Owe the People Your Agent Will Never See

Across all stages: obligations

In March 2023, a patient named Maria, a forty-two-year-old community-health-center client in a mid-sized city, filled out an intake form that a clinical-decision-support agent summarized into a two-line risk assessment. The agent flagged her as low-risk for cardiac events. The physician reviewing that morning’s queue saw the summary, confirmed it, and scheduled a routine follow-up for six weeks later. Eleven weeks later, Maria died of a preventable myocardial infarction. The coroner’s report found four risk factors in her chart that the agent’s summary had compressed away.

Maria was never in a beta cohort. She had never filled out a user-research survey. She had never seen the product. She was the person the product was actually for, and no one on the team had ever seen her face.

The previous nine chapters are about building agentic products well. This chapter is about building them responsibly. These are not the same thing.

A product can have a well-designed autonomy boundary, a carefully crafted approval moment, a full eval suite, six observation metrics in production, and a supervisory interface that holds trust stable over time. It can do all of this and still cause harm, because the design was optimized for the user operating the system and never accounted for the person downstream who lives with the consequence.

That person is almost never in your user research. They are not in your beta cohort. They are not in the sprint review. They are the patient whose pathology result was processed by your clinical decision support tool. The loan applicant whose file was scored by your credit model. The job candidate whose resume was ranked by your screening agent. The supplier whose payment was delayed by your procurement system’s autonomous decision. The name is Maria, or a thousand other names the product will never know.

These people will never give you feedback. They will never appear in your analytics. They will never file a support ticket. And the quality of your product, measured honestly, depends on what happens to them.

The User and the Affected Person Are Not the Same

In every previous generation of enterprise software, the user and the person affected by the decision were usually the same, or close enough that the distinction did not drive design choices. The procurement manager who used the ERP was also the person accountable for the purchase order. The clinician who used the EMR was also the person making the treatment decision.

Agentic AI widens this gap. An agent that processes claims affects patients who never see the interface. An agent that scores applications affects candidates who never interact with the product. An agent that manages a supply chain affects small suppliers whose payment timing is determined by a model they have no visibility into.

Channel 2, as defined in Chapter 2, focused on the supervisor. The affected person is a third party outside both channels. Designing for them is a further obligation, not an extension of the supervisor’s interface. The user is optimizing their workflow. The affected person is experiencing a consequence. Missing that distinction is how teams end up with a great Channel 2 and a Maria.

This gap has a practical design consequence. If your product optimization targets are adoption, task completion, and time saved for the user, you are measuring the experience of the person who operates the system. You are not measuring the experience of the person who absorbs its output. Those metrics can be green while the affected person’s experience degrades, because nobody designed a signal for it.

The design requirement: for every consequential action the agent takes, identify who is affected beyond the user, what outcome they experience, and whether there is any signal in your system that would surface harm to that person before it compounds.

Equity Means Looking Inside the Error Rate

A model at ninety-four percent accuracy looks like a success. The question you have to ask is: who is in the six percent?

The six percent is not a random six percent. In a probabilistic system, errors cluster. A credit model trained on historical data will underperform for first-generation borrowers who built creditworthiness through patterns that were never in the training set. A hiring algorithm trained on past successful hires will systematically discount candidates whose career paths do not match the historical template. A clinical risk model trained on one demographic will miss early warning signs in patients whose presentation patterns differ from the training population.

The aggregate accuracy looks fine. The harm is invisible in the summary. Who is in the six percent is the question.

Concept

Disparate Performance

Aggregate accuracy hides population-level failure. A model that is ninety-four percent accurate overall can simultaneously be failing a specific population entirely. This does not show up in summary metrics. It does not trigger an alert. It requires someone with enough domain knowledge and enough conviction to stratify performance by population before launch, and to refuse to ship when the stratification reveals systematic underperformance for a group.

If you did not build this check into your definition of done, nobody else will do it. The populations inside the error rate are almost always the ones least represented in your discovery process.

This is not a technical problem. The tools to measure disparate performance exist: fairness metrics, demographic stratification, subgroup analysis, the broad category of ML-evaluation tools that break overall accuracy down into performance on specific population subgroups. The engineering is straightforward. The question is whether anyone on the product team made it their responsibility to run the analysis and act on the findings.

There is a structural source of disparate performance that most PMs do not name, and it explains why the errors cluster where they do. Chapter 4 introduced first-contact training data mismatch: models are trained on documentation written after an outcome was known (the completed case note, the resolved ticket, the adjudicated claim), and deployed at first contact where the critical signals are incomplete or missing. The populations underrepresented in historical documentation, new-immigrant patients, first-generation loan applicants, non-native speakers, candidates whose career paths do not match the template, are systematically underserved because their first-contact signals were never recorded in the training data at sufficient volume. Disparate performance is the outcome; first-contact mismatch is often the mechanism. The affected person is the one whose documentation was always sparse.

The PM’s obligation: before any agent reaches production, require a stratified performance analysis across the populations the agent will affect. If the analysis reveals systematic underperformance for any group, that is a product defect, not a statistical footnote. Ship when it is fixed, not when the aggregate number looks acceptable.

Not All AI Errors Look Like Errors

Chapter 5 covered the mechanics of evaluation: pass@k, compound probability, background failures. This section addresses a different problem: the structure of the errors themselves.

Recent research from MIT, Harvard Medical School, and Google Research (Kim et al., 2025) mapped how AI hallucinates in clinical settings. The taxonomy they built applies to any high-stakes AI deployment. The failures are not random noise. They are structured patterns of fabrication, each with a distinct detection challenge.

Factual errors. The model states something false, contradicts data it was given, or overrides information that was explicitly provided. The most dangerous subtype: the model applies a value from its training distribution rather than the value in the input. A drug dosage that ignores the patient’s weight. A contract summary that states the liability cap incorrectly despite the correct figure being in the document. The output does not look wrong.

Outdated references. The model draws on information that was accurate at the time of training but has since been revised. Guidelines change. Standards are updated. Regulations are amended. The recommendation is delivered with the same authority as current-standard information.

Spurious correlations. The model merges information from unrelated contexts into a single output that sounds coherent but is internally incoherent. Both underlying data points may be real. The combination is a fabrication. This cluster is the most likely to pass initial review, because it uses the correct vocabulary and produces an answer that feels right.

Fabricated sources. The model invents references, certifications, studies, or standards that do not exist. It does not flag uncertainty. The fabrication is presented in the same format as a legitimate citation.

Incomplete chains of reasoning. The model reaches a conclusion without completing the logical steps that should have produced it. It skips the differential. It ignores relevant factors. The endpoint is stated confidently. The path that should have led there is missing.

There is a sixth category worth naming, because it is the one most likely to pass every model-level eval and most likely to cause harm to the affected person. Upstream-data-wrong. The model was faithful to the input it received. The input was wrong in a way nobody anticipated: a stale record, a mislabeled field, an upstream system returning a default value when the real value was missing, a classification that was accurate at training time but has since drifted. The output was correct given the input. The input was the failure. Standard hallucination taxonomies describe what the model fabricated. This category describes what the model faithfully reproduced from a complete-looking but misleading input. It passes every model-level eval. It is a data-integrity problem that has been redefined as a model problem because the model is what the user sees.

Concept

The Hallucination Taxonomy, Extended

AI errors are not random. They fall into five structured patterns: factual errors, outdated references, spurious correlations, fabricated sources, and incomplete reasoning chains (Kim et al., 2025). A sixth category is worth naming, because it escapes every model-level eval: upstream-data-wrong, where the model was faithful to a misleading input.

The critical insight: the first two categories can be caught by someone who knows the facts or can check the citations. The last three original categories pass initial review because they use correct vocabulary and arrive in the format of a correct answer. The sixth, upstream-data-wrong, passes every model-level eval and can only be caught by someone who looks at the upstream data integrity. Knowing which cluster an error belongs to changes what you do about it.

The categories differ in how they fail and, more importantly, in who is positioned to catch them. That determines where the detection belongs in the system.

Category	Description	Who can catch it
Factual errors	Contradict known facts or the input provided	Domain reviewer with access to the ground truth
Outdated references	Use superseded guidelines, standards, or regulations	Reviewer with date-aware validation
Spurious correlations	Coherent but internally incoherent blends across unrelated contexts	Domain expert with cross-source check
Fabricated sources	Invented citations, certifications, or standards	Reference verification
Incomplete reasoning	Conclusion reached without visible steps	Require explicit reasoning trace
Upstream-data-wrong	Faithful to a wrong or stale input	Data-integrity and lineage review (not catchable at the model level)

Table 10.1. Hallucination categories and detection strategy.

The practical consequence for PMs: if your eval suite and production monitoring treat all errors as one category, you are measuring frequency without understanding structure. A detection strategy designed for factual errors (check the output against the input) will miss spurious correlations entirely. It will also miss upstream-data-wrong entirely, because the input itself was the failure. The coverage statement from Chapter 5 should include which hallucination types the eval suite is designed to catch, and which, particularly the sixth, require a separate data-integrity review.

What Should Never Be Delegated

Some decisions should not be delegated to an agent, regardless of how accurate the model is.

This is not a technical claim. It is a moral one. There are decisions where the act of explanation, the ability to be questioned, and the willingness to say “I made this call and here is why” are part of what makes the decision legitimate. An algorithm can be accurate. It cannot be accountable in the way that some decisions require.

There is a useful analog from medicine. Physicians carry an internalized constraint that no algorithm replicates: a version of the Hippocratic oath enforced not by an external regulator but by the physician’s own runtime behavior at the moment of decision. A physician who sees that a proposed intervention is wrong does not defer to the protocol. She refuses, explains, and carries the accountability for the refusal. That refusal is a runtime constraint, enforced at the moment the action would be taken, by a person who has internalized the principle.

AI systems do not have this layer. They have design-time constraints (what the system was trained to do), deployment-time constraints (what the governance layer authorizes), and output filters (what the response can contain). None of these is a runtime-enforced ethical constraint in the same sense. The system cannot refuse on moral grounds. It can only refuse on technical grounds, and the technical grounds are designed into it by someone who anticipated the refusal in advance. If the situation was not anticipated, the system does not refuse.

Concept

The Constitutional Runtime Layer

A product design gap that matters more than most teams recognize. The two-layer requirement for responsible agentic systems: (1) Ethical principles declared explicitly, not assumed. (2) A runtime architecture that enforces those principles before output reaches the user, with mechanisms for refusal that do not depend on every situation being anticipated in advance.

Physicians carry this as internalized oath. Pilots carry it as two-challenge rule and mandatory go-around authority. Agentic systems need an equivalent: a runtime-enforced layer that can refuse an action on principled grounds even when the technical path to action is open. Without it, every action the agent can technically take is an action it will eventually take, if the conditions align.

The PM artifact: an enumerated list of actions the agent will never take regardless of instruction, a runtime mechanism that enforces the list, and an audit trail that records refusals as first-class events, not error conditions.

The delegation boundary is not always obvious, and it shifts depending on the stakes, the context, and the populations involved. But the obligation to find it before you ship is not negotiable. The PM who does not think about this before the product is built will be forced to think about it after something goes wrong, under conditions that are far less forgiving.

A useful heuristic: if the affected person would reasonably expect a human to have made the decision, and would be materially harmed by learning that no human was involved, the decision should not be fully delegated. That does not mean the agent cannot assist. It means the human must be in the loop at the moment of consequence, with enough information to exercise genuine judgment, not just rubber-stamp an automated output.

One further elaboration. The decisions that should never be delegated are not a finite list that a committee defines in a document. They are a moving surface, because the moral weight of a decision depends on the context, the population affected, the reversibility of the consequence, and the legitimacy of the authority that is being transferred. The PM’s ongoing job is to surface these decisions as they appear in the product, test them against the affected-person heuristic, and encode the answer as a non-delegable boundary in the product. Some categories that consistently clear this bar: decisions that deprive someone of a resource they had (credit, coverage, employment, custody); decisions that expose someone to physical or medical risk they did not consent to; decisions where the affected person would have no path to appeal if the decision goes wrong. In healthcare, credit, and many hiring workflows, this is not just good architecture. Under the EU AI Act, these are classified as high-risk systems: human oversight and the ability to interrupt or override are legal requirements for market entry, not PM preferences. The design work here is how to make that oversight real in the product instead of nominal in a policy document. If any of these are on your agent’s action list without an explicit human accountability stage, the product was not built with the obligations of this chapter in view.

When the Supervisor in the Loop Is Being Eroded by the Loop

One subsection belongs in any honest treatment of obligations to the affected person, and it ties the supervision paradox from Chapter 1 to the regulatory framework that purports to protect Maria.

Healthcare AI regulation, like most AI regulation, rests on the premise of a competent human in the loop. The FDA’s Clinical Decision Support guidance is built around Criterion 4: a qualified clinician independently reviews the AI recommendation before acting. The EU AI Act’s high-risk-system designation requires “human oversight measures” that include “the ability to intervene” and “the ability to override the system’s output.” Both frameworks assume the human in the loop has the capability to perform the review.

That assumption is not being tested. It is being assumed. And the deployment patterns the regulatory framework permits are the same patterns that the supervision paradox literature shows erode the supervisor’s capability over time. The Budzyń ACCEPT data on three-month deskilling. The Anthropic 17 percent comprehension deficit on coding tasks. The NEJM review’s never-skilling category. The Klarna supervisor population reshaped by twelve months of agent escalations. None of these are clinical edge cases. They are the structural condition under which Criterion 4 and the EU AI Act’s human-oversight requirements are being deployed.

Maria’s case is the operational version. The clinical decision support agent flagged her low-risk. The physician reviewing the queue confirmed it. By the regulatory framework, the loop ran as designed. By the supervision-paradox literature, the question is whether the physician’s ability to catch the agent’s compression of four risk factors into a two-line summary was the unimpaired ability the regulator imagined or the eroded ability the deployment had been shaping for the previous twelve to eighteen months. The regulatory framework cannot tell us, because the framework does not measure the supervisor’s state.

The PM design implication is concrete and uncomfortable. The product cannot rely on the regulatory human-in-the-loop framing as the safety mechanism. The framework is necessary. It is structurally insufficient. The product needs its own safety architecture that does not assume supervisor competence is preserved. The Constitutional Runtime Layer above is one part of that architecture: a non-delegable list enforced at the moment of action, regardless of whether the supervisor would have caught the error. The data-integrity review for upstream-data-wrong errors is another: a check that does not depend on a supervisor reading every output. Disparate performance stratification before launch is a third: it catches population-level failures that no supervisor, however unimpaired, could detect from individual cases.

None of this is a critique of regulation. The EU AI Act and the FDA CDS guidance are doing what regulation can do at the framework level. The PM’s obligation is to recognize that compliance with those frameworks is the floor, not the ceiling. The affected person depends on the safety architecture the PM built into the product, not on the safety architecture the regulator imagined when they wrote the rule.¹

Moral Architecture, Not a Moral Committee

Chapter 9 treated governance as an operational product surface: authority, escalation, logged boundary events, policy encoded into UI. That treatment is correct and not repeated here. This section addresses the moral-accountability dimension of governance, which is distinct.

A governance committee is not moral architecture. A moral architecture is a set of enforced constraints that operate at the moment of consequence, not in a quarterly review. The moral architecture of a hospital is not the ethics committee. It is the attending physician’s signature on the order, the nurse’s mandate to question an order that looks wrong, the morbidity-and-mortality review that forces an institution to confront its own failures publicly, and the license the regulator can revoke. These are enforced constraints with real consequences. The committee is where the constraints get calibrated. The constraints themselves live in the product.

Agentic systems inherit the same structure or they do not have one. If the moral architecture is a slide deck, there is no architecture. If it is a runtime-enforced layer with a non-delegable list, a named accountability person per action class, a refusal mechanism, and a public incident-disclosure posture, there is architecture. The difference is whether the constraints operate at the moment of consequence or only in a retrospective conversation.

The Obligations Are Not Optional

Each of these obligations, designing for the affected person, stratifying performance across populations, understanding the structure of AI errors including the upstream-data-wrong category, defining what should never be delegated with the Constitutional Runtime Layer in mind, recognizing that the regulatory human-in-the-loop framing is structurally insufficient under the supervision paradox, and embedding the moral architecture in the product, points at the same thing: there are people affected by your product who will never give you feedback, never appear in your analytics, and never be in the room when you make the decisions that shape their experience.

These are not features you add when the roadmap has room. They are not compliance checkboxes. They are the difference between a product that generates value and a product that generates value for some people while quietly causing harm to others.

The PMs who carry these obligations are not slower. They build better products. They build products that survive the first regulatory inquiry, the first public failure, and the first moment when someone asks “who was responsible for this?” and expects a real answer.

In an industry moving as fast as agentic AI, that seriousness is exactly what the market will eventually select for. The only question is whether you built it in from the start or tried to retrofit it after the incident. Maria’s family will not ask politely.

Notes

The supervision-paradox-meets-regulation argument is consistent with the recent peer-reviewed literature naming the “AI supervisory fallacy” (Chen, Pfeffer, Longhurst, BMJ Digital Health 2026), which identifies the same structural gap from a different angle. The Budzyń ACCEPT colonoscopy result, the Abdulnour NEJM 2025 deskilling-mis-skilling-never-skilling taxonomy, and the EASA SIB 2025-09 manual-flying-skills bulletin together constitute the empirical anchor for the claim that the regulatory framework’s human-in-the-loop assumption is structurally weakening at exactly the moment the framework is being deployed at scale. Friedman, “The Last Generation That Can Supervise AI,” data-decisions-and-clinics.com, 2026, develops the argument at length.