Chapter 12: The Record Is Complete. The Picture Is Wrong.
The wrong answer usually comes from the data, not the model. The agent reasons faithfully over a record that is comprehensive and wrong, and the output is grounded, specific, and traceable to a real source, which is exactly why the human reviewing it signs off. This is the failure the human-in-the-loop was never built to catch, and seeing it requires the supervisory layer to grow a dimension that watches the data, not just the agent.
A consumer health platform recently assembled my medical record for me. Years of encounters across multiple providers, pulled from connected systems into one view, and it asked me to confirm which items were still relevant. The record was extensive. It was also, in any clinically meaningful sense, wrong. There were conditions listed that had resolved more than a decade ago. There were medications prescribed once for a problem that cleared and never came back. There was an allergy that had been formally evaluated and removed from my chart at one institution, still listed as active everywhere else, because the correction had never traveled. The record was comprehensive and it was dangerously wrong, and those two facts were true at the same time, in the same document, with nothing on the page to tell them apart.
Now imagine an agent reasoning from that record instead of me. It would not invent anything. It would read the allergy that is no longer real, the condition that resolved a decade ago, the medication long discontinued, and it would reason correctly from every one of them, and produce a confident, well-grounded, specifically-cited recommendation that is wrong at the root. This chapter is about that failure, and it is not a healthcare problem. It is the most common way an agentic product fails in production, and it is the one your supervisory layer is least equipped to catch, because the layer you built to catch errors was built to catch a different kind of error.
The model is doing its job; the substrate is wrong
Ask where an agent’s wrong answer actually comes from, and the answer, more often than teams want to admit, is the data, not the reasoning. An agent reasoning over stale pricing approves purchase orders that miss the market. An agent working from a knowledge base of retired policies gives customers guidance the company stopped supporting six months ago. An agent whose entity mappings are broken routes every ticket of a certain type into the wrong workflow, confidently, every time, with nothing wrong at the application layer. In each case the model did exactly what it was built to do. The substrate was wrong, and infrastructure monitoring saw none of it, because infrastructure monitoring is not looking at whether the data is true. It is looking at whether the pipe is flowing.
This is worth separating cleanly from the kind of failure everyone already knows about, because the whole industry response to AI error is built around the other kind. We have been taught that hallucinations are confabulations, facts the model invents with no grounding, and the prescribed defense is the human in the loop: a reviewer who checks the output, traces it to a source, and flags what does not check out. That defense works for fabrications. It does not work for this. When the agent cites a real record, a real policy, a real price, a real allergy entry, the reviewer is not looking at a hallucination in any sense they were trained to recognize. They are looking at a grounded, well-reasoned output from a flawed premise. The reviewer approves it because it checks out. It does not check out. The check they were trained to run is a check the failure passes.
You could call this a data-quality problem and not be wrong, since bad input has produced bad output since the first database. But that framing misses what is new and dangerous about it, which is not that the input was wrong but that the agent launders the wrong input into an output that defeats the review designed to catch error. The usual descriptions of AI hallucination, the factual error, the outdated reference, the fabricated source, all share a tell: the output is wrong in a way a careful human reviewer can catch, because the model invented something. Here the model invents nothing. It reasons correctly over a wrong premise and produces an answer that is fluent, specific, internally consistent, and cites a real source, because the source is real; only the record it points to was stale or incomplete. The reviewer trained to catch fabrication checks the output against the source, finds the source exists and says what the agent claims, and signs off. The check passes precisely because the failure is upstream of what the check inspects. Call it data-layer hallucination: not a sixth item on a list, but a failure that wears the costume of a verified answer and walks straight through the gate built to stop the others.
This is the iceberg from the literacy chapter, surfacing where it does the most damage. The record the agent read was the tip: the values were all there, comprehensive, specific, real. What stayed below the water was the meaning, that the allergy had been retired, that the condition had resolved, that the medication was long discontinued. The data moved between systems and the context that would have marked those entries as no longer true did not move with it. The agent reasoned over the tip and never saw the rest of the iceberg, because the rest had been left behind in the system the record came from.
What the agent reasons over
Start with a sentence that should be obvious and is routinely not owned by anyone: it is the product manager’s job to know what the agent reasons over. Not engineering’s, not the data team’s, yours, the way it was always your job to know what a feature was for. An agent’s output is a function of what it was given, so the inputs are not an implementation detail beneath the product. They are the product. A PM who can describe the agent’s behavior but not what feeds it has described half of it, and it is the dangerous half they left out.
What the agent reasons over has two parts, and the four-layer stack already named them. Data is the fact: the glucose reading is 9.1, the patient is on metformin, a retinal scan was done in March. Context is what connects the facts into meaning: that the glucose and the metformin and the retinal scan all belong to the same managed diabetes, that the metformin is treating the condition the glucose reflects, that the scan is the monitoring the diagnosis requires. Data is the dots. Context is the lines between them. An agent can have every fact right and still be wrong, if the lines connecting them are missing or drawn to the wrong places, and it can have the lines right and be wrong because one of the dots is stale. Both arrive at the same place, a confident answer from a flawed premise, and a PM who cannot say which one failed cannot fix either.
There is a third thing the PM owns, and it is the one most often left to default: the channels the data and context arrive through. A fact is not just a fact; it is a fact from somewhere, and where it came from changes how likely it is to be wrong. Take a shipping address. It can come straight from the verified account record, or it can come from a channel that looks just as authoritative and is far more fragile: the customer gives a phone number, a lookup returns an address, and the agent now holds an address that is the product of a user-entered value and an inference on top of it, two chances to be wrong stacked into one field. To the model the two addresses are identical, a string in the right format. They are not identical. One is a fact and the other is a hypothesis wearing a fact’s clothes, and the difference is the channel each arrived through. Defining which inbound channels feed the agent, and how far each one is trusted, is a design decision, and it is yours to make. An agent that ships to the looked-up address with the same confidence as the account address has not been given bad data; it has been given a channel design no one decided on purpose. The PM names the channels, ranks their trust, and decides what the agent is allowed to do on the strength of each, ship on the verified record, confirm before acting on the inferred one. Leave that undecided and the agent will reason over whatever reaches it, weighting a stacked inference and an audited record exactly the same, which is to say not weighting them at all.
There is a reason the agent needs all of this held to a higher standard than you would ever demand of a person, and it is worth naming because it reframes the whole job. An experienced human reasons with something the agent does not have. A physician looks at a patient and feels that something is off before any test says so. A seasoned accountant reads a clean dashboard and senses that the numbers do not add up. A professor reads a competent paper and knows it is not original. Call it gut, judgment, pattern learned over a career; whatever it is, it is a mix of the quantitative data in front of them and a qualitative sense built from everything they have seen before, and it works as a silent completeness check. It is the faculty that notices the gap, distrusts the stale figure, asks the question the record did not prompt. The agent has none of it. It cannot feel that a record is too clean, cannot sense the missing piece, cannot bring a career of prior cases to the one in front of it. So the burden moves. The completeness, the freshness, the context that a human supplies from experience, the agent can only get from the input, which means the bar for what you feed it is higher than the bar a person would need, not lower. The agent is not a faster expert with a gut. It is a precise reasoner with none, and it is only as right as what you hand it.
Two failures, pulling in opposite directions
What makes the data layer hard is that it fails in two directions at once, and the two are mirror images.
The first is the gap you cannot see. Connecting records is not the same as completing them. Information lives across dozens of systems that do not all reach one another, so a record that looks whole is missing the parts that never traveled. In the clinical version, only about a sixth of patients with lab-confirmed chronic kidney disease have it on their problem list, which means the rest of that diagnosis is clinically active and completely invisible to any system reading the record. The agent observes what is connected and reasons from it, and the gap is not marked as a gap. It appears as the absence of a problem, which reads as health. A patient whose monitoring lives in an unconnected specialist clinic looks, in the assembled view, like a patient who needs no monitoring. The enterprise version is the same shape: the agent that cannot see the contract amendment stored in the system nobody connected will reason, correctly and disastrously, from the original terms.
The second runs the other way, the noise you cannot filter. Records are additive. Diagnoses accumulate, prescriptions stack, statuses are entered once and rarely retired in a way that travels. The data standards have the right fields for this, a status that can say active, resolved, inactive, but the fields are filled inconsistently at the source, defaulted to active because updating them takes time nobody has, and overwritten in transit when the mapping layer guesses. A condition entered as active in 2015 and never touched arrives in 2026 still marked active. The pipeline did its job correctly and the information is wrong. So the agent is not reasoning from a current picture; it is reasoning from a comprehensive archive stripped of the annotations that separate then from now. Same confident wrong answer, opposite cause: the first failure is a premise that is absent, the second is a premise that is polluted, and the record gives you no way to tell which one you are looking at, or whether you are looking at both.
Both of those are failures of data, the dots being missing or stale. Context fails the same two ways, and it is harder to see. A gap in context is a connection that was never made: the agent holds the glucose and the metformin as two unrelated facts because nothing told it they belong to the same condition, so it reasons about each in isolation and misses what they mean together. Noise in context is a connection drawn wrong: the agent sees the metformin and concludes diabetes, when the same drug is prescribed for several other things, and now every downstream step inherits a diagnosis the data never supported. The dots can be perfect and the lines still absent or false. A reviewer scanning the output will see correct facts and a confident conclusion and have no way to notice that the line between them was never there, or was drawn to the wrong dot.
What the supervisory layer has to watch
Here is the move for a product manager, and it follows directly from the fact that the human reviewer cannot catch this by reading the output. If the failure is invisible at the moment of review, then review is the wrong place to defend against it. The defense has to live where the input enters the agent’s reasoning, which means the supervisory layer needs a dimension that watches what the agent reasons over, not just how the agent behaves. That is not an abstract ask. The properties are nameable, and they split along the same line as before. Two watch the dots. Freshness: is retrieval pulling from current sources or a stale snapshot. Completeness: are there gaps that will make the agent reason from partial evidence without knowing it is partial. Two watch the lines. Referential integrity: do the identifiers the agent uses to connect things actually connect to the right things. Mapping accuracy: are the relationships it traverses actually correct. And one watches the channels: provenance, does the agent know where each input came from, so a typed-in guess and an audited record are not weighted the same, and so a corpus that has been quietly poisoned does not get read with the authority of a system of record. None of these show up in latency, error rate, or any green dashboard, and the tools that watch data quality for analytics pipelines were not built to watch a vector index’s freshness or whether a retrieved document’s provenance is intact. In most enterprises the data-observability stack and the retrieval stack are separate concerns with no single surface telling you whether what the agent is reasoning from is trustworthy at the moment it reasons from it. That surface is something you currently have to build.
A smarter data layer does exist in research, and the principle is worth carrying even where you cannot build the whole thing, because it draws a line a PM has to respect. If the system knows a patient has diabetes, it already knows what a complete record for that patient should contain, the recent labs, the standard medication or a documented reason for its absence, the annual imaging, so when those are missing it can flag the gap as a gap rather than read it as nothing. The same logic applies to any agent over any domain: from what you know is present, you can compute what should be present and is not. But there is a boundary inside that move, and it is the sharp one. Using a data point to expand what you look for is not the same as using it to infer what is true. A diabetes drug is prescribed for several conditions; concluding the diagnosis from the prescription would be wrong most of the time in some populations. The reverse-lookup expands the search radius; it does not conclude. The moment the data layer decides what the patient probably has, it has stopped doing data engineering and started doing clinical reasoning, which it has no authority to do, and in regulated domains no license to do. Expansion without inference. The data layer’s job is to know what is missing, not to guess what it means.
The check no platform runs yet
Notice what this chapter has not done: it has not told you whether watching the data is your job or the platform’s. That is deliberate, and it is the honest position, because the answer is moving. Right now no major platform ships a surface that tells you whether the data your agent is reasoning from is true at the moment it reasons from it, the way no platform once shipped encryption or failover, so the work falls to you, by hand, per product. Over time some of this will migrate into the foundation, the way every cross-cutting concern eventually does. But the window where it is unbuilt is the window you are in, and in that window the agent that quietly reasons over a comprehensive, wrong record is shipping in production while its dashboard stays green. Aggregating data is not the same as understanding it, and a record being complete is not the same as a record being right. The teams that learn the difference will not just have agents that answer. They will have agents you can trust the answer from, which is the only kind worth shipping. Take one agent you run and ask the question the dashboard cannot: not is the data flowing, but is it true, and who would know if it were not.