Part III · Design · Chapter 8

Chapter 8: Two Kinds of Human-in-the-Loop

“Human-in-the-loop” names two different things. Program-level oversight asks whether the system is right in aggregate. Transaction-level review asks whether this output reaches the person without a human checking it first. They are different design choices with different failure modes, and assuming the first covers the second is the error the courts are now treating as no oversight at all.

A health insurer ran an AI system to review medical-necessity claims. A physician signed each denial, so on paper a human was in the loop on every decision. Internal records later showed physicians signing off on more than three hundred thousand claims in a two-month window, which works out to about one and a half seconds per claim. The litigation further alleged that of the small fraction of denials patients appealed, around nine in ten were overturned, a reversal rate that, if it holds, says the underlying decisions were wrong far more often than not. There was a human in the loop on every single one of those denials, and there was no human oversight of any of them.

That gap is the subject of this chapter, and it is not a story about a bad actor. It is a story about a design confusion that good teams make constantly, because the term “human-in-the-loop” hides two entirely different commitments inside one phrase. One asks a question about the system: across thousands of decisions, is this thing performing the way we said it would? The other asks a question about a single decision: before this specific output changes this specific person’s life, did a human with the competence and the time actually look? You can satisfy the first completely and have nothing of the second, which is what an insurer who audited aggregate denial rates while a physician spent a second and a half per claim had built. The chapter on evals covered the system question. The chapter on designing behavior covered the approval moment. This chapter is about deciding, for each output your agent produces, which of the two kinds of oversight that output requires, and refusing to let one masquerade as the other.

The two questions

Program-level oversight is review of the system in aggregate. A team samples outputs, tracks success and failure rates, watches for drift, and audits performance against the standard the product committed to. It is periodic, statistical, and retrospective, and it is the right instrument for the question “is the agent, as a system, behaving acceptably across the population it serves.” Most of what a mature team calls monitoring is program-level, and most of an eval suite is program-level by nature: it characterizes a distribution.

Transaction-level review is a human looking at an individual output before or immediately after it executes, with the authority and the means to stop it. It is per-decision, contemporaneous, and prospective. It is the right instrument for the question “should this output, for this person, proceed without a human in the way.” The approval moment from the design chapter is a transaction-level mechanism. A stop button a person actually watches is transaction-level. A monthly accuracy report is not.

The two are not redundant, and neither contains the other. Program-level oversight tells you the system denies claims correctly 92 percent of the time. It tells you nothing about whether the wrongful denial that reached a particular patient last Tuesday was caught, because in a statistical view that denial is a rounding error and to the patient it is the whole event. Transaction-level review catches that denial and tells you nothing about whether the system as a whole is drifting toward a higher error rate next quarter. You need both, for different reasons, and the design failure is using one to discharge your obligation to the other.

One caution before you reach for transaction-level review as the answer, because it has a second failure mode this chapter only sets up and a later one takes apart. Funding it properly fixes the Cigna problem, the budget problem, a real human with real time on each consequential case. It does not fix the problem of speed. When the agent acts faster than any human can review, no amount of funding rescues the per-decision loop, because the constraint is no longer the budget, it is the clock; a reviewer who needs ten seconds to judge a case the agent decides in fifty milliseconds is decorative at any staffing level. For those cases the answer is not a faster or better-funded reviewer but a different control entirely, a hard limit the agent cannot cross, a class of action it may not take alone, a pre-commitment rather than a real-time veto. Transaction-level review is the right mechanism when the volume and speed leave a funded human real time to judge. When they do not, you are in the territory the human-in-the-loop-fails chapter maps, and the fix lives there, not here.

Which one an output needs

Not every output earns transaction-level review; most do not, and demanding it everywhere is its own failure, because a review budget spent on low-stakes outputs is a review budget unavailable for the ones that matter. Four questions decide it, and they are worth asking in order.

The first is harm asymmetry. When the agent is wrong, is the cost of a false positive roughly the same as a false negative, or wildly different? An agent that occasionally recommends the wrong article is symmetric and cheap on both sides. An agent that denies a cancer patient post-acute care has a catastrophic false positive and a trivial false negative, and that asymmetry alone can require per-decision review regardless of how good the aggregate numbers look. The second is reversibility, and it is the axis with the strongest grounding in where the law actually lands. The strictest per-transaction requirements cluster around decisions that cannot be undone or that affect a life while they stand: a medical-necessity denial, a credit denial, an employment termination. For an advisory output a human then acts on, aggregate oversight is usually enough, because the human action is itself the transaction-level check. The third is regulatory exposure, which increasingly removes the choice from your hands. The fourth is the cost of reviewing each transaction, which is real and is the reason the first three matter: if per-decision review were free you would do it everywhere, and the entire discipline is about spending a finite review capacity where harm asymmetry and irreversibility say it belongs.

Two kinds of human-in-the-loop. Program-level oversight: periodic, statistical review of the system in aggregate. Right for “is the agent behaving acceptably across the population?” The eval suite and the monitoring dashboard live here. Transaction-level review: a competent human looking at an individual output, with authority to stop it, before or immediately as it executes. Right for “should this output, for this person, proceed un-checked?” The approval moment and the watched stop button live here. Choose per output type using harm asymmetry, reversibility, regulatory exposure, and per-review cost. The default error is assuming program-level oversight discharges a transaction-level obligation.

What the law already decided

The regulatory direction is no longer ambiguous, and a PM shipping into a consequential domain should treat transaction-level review as the presumption rather than the exception. In the EU, the AI Act’s oversight provision requires, for every high-risk system, that a person be able to interpret the output, decline to use it, and interrupt the system; for remote biometric identification it goes further and requires that each identification be confirmed by at least two competent people before any action is taken, which is the clearest per-decision mandate on the books. In US healthcare the requirements are the most explicit: a 2025 California law bars health plans from using AI as the sole basis to deny care and requires a licensed physician to make the medical-necessity determination, not merely approve the algorithm’s, and a Medicare prior-authorization pilot starting in 2026 requires a human clinician to validate each AI-flagged denial before it is final. Several more states passed equivalent laws in a single legislative session, so this is a trend, not an outlier. In consumer credit, the adverse-action rules require a specific, accurate reason for each denial and explicitly forbid “the model is too complex to explain” as a defense, which forces a human to understand each decision well enough to articulate it. Financial regulators now frame per-decision human sign-off as the expected practice for any AI that influences a customer outcome or executes a transaction. And the most recently enacted comprehensive US AI law gives a consumer the right to human review of the specific adverse decision, not of the system’s aggregate performance.

Read across all of it and the same sentence keeps surfacing, and it is the one to carry out of this chapter: the presence of a human in the loop is not the same as human oversight. The insurer in the opening had a human present on every denial. What the courts are now being asked to define is what meaningful review means at the transaction level, and the working answer is that a signature applied at a second and a half per claim is presence performing as oversight, which is to say it is the absence of oversight wearing its uniform.

The hybrid most systems actually need

In practice you are not choosing one kind for the whole product. You are assigning a kind to each output class, and most real systems end up layered. The low-consequence, reversible, symmetric outputs run on program-level oversight: ship them, sample them, watch the aggregate, and react if the distribution moves. The high-consequence or irreversible or regulated outputs carry a transaction-level gate: a competent human, with the time and the authority and the context the trace cannot supply, looks before the action commits. The skill is drawing that line carefully per output type and then resourcing the transaction-level side for real, because the failure mode is not usually choosing wrong on paper. It is choosing transaction-level review on paper and then funding it like program-level oversight, which is how you arrive at one reviewer, three hundred thousand decisions, and a second and a half each.

Cigna: presence is not oversight. A health insurer used an AI-assisted platform to review claims for medical necessity, with a physician signing each denial, so by the workflow’s design a human was in the loop on every decision. Documents surfaced in litigation showed physicians signing more than three hundred thousand claims over about two months, roughly one and a half seconds per review; of the denials that were appealed, the litigation alleged around nine in ten were overturned.

The defense was that a human reviewed each denial. The allegation was that the review was performative: present in the workflow, absent in substance. The team had a transaction-level obligation, the irreversible denial of care to an individual, and met it with a transaction-level mechanism funded at a program-level budget. A human at one and a half seconds per decision is not a slower careful reviewer; he is an aggregate auditor relabeled as a per-decision gate, and the relabeling is the failure.

For one output type your agent produces, decide which kind of oversight it requires, and then write the harm-asymmetry sentence that justifies the choice, the one that names what a false positive costs against a false negative. If the answer is transaction-level, look at what you have actually budgeted for that review, in seconds per decision, and ask whether a person could exercise judgment in that time or only a signature. The gap between those two is where a later chapter, on why human-in-the-loop fails, begins, because designing the gate is the easy part. Keeping a human truly in it, month after month, is the part that fails.

You Built the Agent. Now Design the Behavior Evals: What the Checkmarks Prove