Part III · Failures · Chapter 10

The Calendar the Agent Could Not See

In late 2022 a man booked a flight to his grandmother’s funeral on the strength of what an airline’s website told him. He had asked the airline’s chatbot about bereavement fares, and the chatbot told him he could book now and apply for the reduced rate within ninety days. That was wrong. The airline’s actual bereavement-fare policy, sitting on its own website a click away, said the opposite: the reduced rate could not be claimed after the fact. Two sources inside one company disagreed about what was true, the man relied on the one he was handed, and when he applied for the refund the airline refused, and then made an argument worth pausing on. It argued that it could not be held responsible for the chatbot, because the chatbot was, in effect, a separate entity responsible for its own answers. In early 2024 a tribunal rejected that, in a line that has been quoted ever since: the chatbot is still just a part of the airline’s website, and the airline is responsible for all the information on it. The man got his refund. The airline got a precedent it did not want.

The case is usually filed under hallucination, the chatbot made something up. Filed correctly, it is something else, and the something else is the subject of this chapter. The chatbot did not invent a fare out of nothing. It served an answer that had once been true, or true on some page, while the authoritative version said otherwise, and nothing in the system knew the two had come apart or which one governed. That is not a model being creative. It is a supersession failure, a new truth and an old truth coexisting in one company with no rule for which wins and no owner for the question, and it produced a confident, fluent, completely wrong answer that a person acted on and a judge had to reverse.

The book has been carrying this chapter’s case since its first page. The travel agent from the opening booked a non-refundable fare into a cancelled trip because the cancellation lived in a calendar the agent could not see, and the sentence that named the failure was that nobody on the team could say whose job it had been to notice that the agent could not see it. The map app in the degradation chapter routed me to an art fair that had closed half a year earlier, fluent and confident and wrong, because its picture of the world had gone stale and nothing announced it. Three failures, three industries, one shape: an agent reasoning over a picture of the world that was incomplete, or out of date, or internally contradictory, and no one whose job was to know. This chapter is about that picture, where it comes from, how it rots, and the seat no team has staffed to keep it true.

Two questions the pipeline never asks

The prior books in this series put the agent’s data in two places, and both placements are correct and both are incomplete. They put the data the agent can reach under the architect’s plumbing, the interfaces and the access, which is right: someone has to decide what the agent connects to. And they put the freshness of a vendor’s model under the currency question asked at procurement, which the degradation chapter in this very part carries, when was the training data last refreshed, on what corpus, with which retractions pulled. That question is real and this chapter does not take it back; it inherits it, because the currency of a bought model is one instance of the larger thing this chapter is about. What neither placement asks is the pair of questions the travel agent needed answered and no one owned.

The first is sufficiency: can the agent see everything its job actually requires it to see. The second is currency: is what it sees still true. The first is never asked because the demo works, and a demo only ever exercises the data that exists; the calendar the agent could not see does not show up in a demo, because the demo’s test cases all happened to live in the systems the agent was wired to. The second is never asked because nothing in the system raises its hand when a source goes stale. Staleness is silent by construction. A retrieval corpus does not email you when the world moves past it; an integration does not flash a warning when the upstream system changes a field; a vendor’s model does not announce on the Saturday it was retrained that the thing it now believes is different from the thing it believed on Friday. The picture decays quietly, and the only signal is a wrong answer, weeks later, to someone who was not in the room.

And here is the structural reason the questions go unasked, the reason that makes this a category of failure rather than a string of bad luck: a model does not refuse on missing context. Hand a human analyst a question they lack the data to answer, and a good one says “I can’t answer that, I don’t have X.” Hand the same question to an agent with the same gap, and it answers anyway, fluently, confidently, filling the hole with the most plausible completion, exactly as the travel agent booked and the chatbot quoted. Absence does not look like absence. It looks like an answer. That single property is why context failures are invisible until they are expensive: the system that has too little to go on behaves, on the surface, identically to the system that has enough.

Five ways the world and the corpus come apart

The picture an agent reasons over fails in five distinct ways, and they are worth separating because each has a different tell and a different owner.

The first is absence. The source the job requires was never wired in at all, like the calendar. The tell is that the failure is invisible in every test, because tests exercise the sources that exist and an absent source has nothing to test; you find absence only when a real case in the world needed the missing piece, and by then it is a booking.

The second is staleness. The source is wired in and the world moved past it, like the art fair and the route to a venue that had closed. The tell is that confidence stays flat while accuracy decays, because nothing in the pipeline carries a freshness date, so the agent is exactly as sure of the stale fact as it was of the fresh one.

The third is supersession, which is the hard case of staleness and the one that caught the airline. The new truth exists, somewhere in the system, and the old truth is still being served, and both are live. The tell is two answers to one question depending on which source the agent happened to retrieve, which means the failure is intermittent, which means it is the kind that passes testing and surfaces in production.

The fourth is contradiction. Two live sources disagree and the agent picks between them silently, with no rule and no record. The tell is that the pick is unlogged, so the disagreement is invisible until the person on the receiving end of the wrong pick surfaces it, the way the man surfaced the airline’s.

The fifth is provenance, and here the chapter has to draw a careful line, because provenance has two halves and only one of them is this chapter’s. The adversarial half, content deliberately injected to manipulate the agent, the poisoned document, the prompt hidden in a retrieved web page, belongs to security and the red team, and the boundary chapter and the security chapter own it; one sentence hands it forward. The half this chapter owns is the benign one, lineage: knowing where every fact in the agent’s context came from and what authority it carries, so that a number pulled from a forum post is not weighed the same as a number pulled from the system of record. Lose lineage and you cannot answer the question that every other failure in this list reduces to, which is simply says who. Strictly, the first four are the ways the picture breaks and the fifth is the question that diagnoses all of them; it sits in the list because a team that cannot answer it cannot find the other four.

There is a sixth thing people will want to add to this list, and it belongs to a different chapter: scope, what the agent must not see. That is the architect’s wall and the security boundary, the subject of the next chapter, and it is the inverse of this one. This chapter is about whether the agent can see enough that is true. The next is about whether it can reach things it should not. One sentence hands scope forward; the rest of this chapter stays on the side of the line where the failure is too little truth, not too much reach.

The seam is the failure

Watch the context failure arrive on a real team, because as with every failure in this part, no one is negligent and the gap is in the space between jobs.

The product manager writes the brief, and the brief assumes the data exists, because for the entire history of software the data existed; you specified the feature and the data was a thing the engineers would wire up. The assumption is invisible precisely because it is not a decision. It is the absence of one. Nobody wrote “this agent must be able to see the customer’s calendar” because nobody writes down the floor they are standing on. The engineer wires the sources that are reachable, which is the engineer’s job and a reasonable reading of it, and in the wiring, reachable quietly substitutes for sufficient, because the sources that have APIs get connected and the sources that do not are not so much rejected as never considered. The domain expert, the person who actually knows that the heart-failure guideline was revised three years ago or that the bereavement policy changed or that this data feed is authoritative and that one is a convenience copy, is never asked, because asking the domain expert is nobody’s step in the pipeline; their knowledge is exactly the knowledge of which sources are true and current, and it stays in their head. The architect built clean, well-documented interfaces to every source the spec named, and did it well, and the spec named the sources someone thought of.

Each person did their job. And the question that fell between all of them, is the picture this agent reasons over sufficient and current for the decisions we have given it the authority to make, was no one’s, which is the exact sentence the book’s opening case already wrote: nobody could say whose job it had been to notice. That is the seam, and it has a shape worth stating as the chapter’s hinge, because the shape is why it cannot be closed once and forgotten. Sufficiency is decided at design time, in the brief and the wiring, when someone could in principle list what the agent needs to see and check it against what it can. Currency decays at run time, continuously, on the world’s schedule and not the team’s. A question that is settled once at design and then erodes forever at runtime cannot be owned by a one-time decision. It is a standing job, and a standing job is a seat.

The register, and the disciplines that already exist

The artifact that holds the seat is a source-of-truth register, and it is plain enough that the absence of it on most teams is the surprising part. For every source the agent reads, five fields. An owner, a name, because a source whose accuracy is no one’s responsibility is a liability with an API. An authority rank, which source governs when two of them disagree, decided in advance rather than discovered in a tribunal, which is precisely the rule the airline did not have. A freshness SLA, how stale this source may be before the agent’s autonomy on any decision that depends on it is reduced, which is the line connecting this register to the autonomy ladder: a source past its freshness SLA does not just throw a warning, it demotes the agent. A supersession rule, what retires this source’s claims and who pulls the trigger, so that the new truth does not have to win a race against the old one inside the retrieval layer. And a retirement workflow, how a dead source actually leaves the agent’s context, because a source nobody removed is a stale source that still has credentials.

The register has a companion in the executable brief, one numbered requirement that makes sufficiency a built thing rather than an assumption: the agent’s job requires sight of X, Y, and Z; it has X and Y; the Z gap is either closed before launch or accepted in writing, with a signature, by someone who owns the consequence. That is how Channel 2 gets built throughout this book, by a requirement with a number that someone has to close, and the sufficiency statement is that discipline applied to the agent’s eyes.

None of this is exotic, and the fields that govern high-stakes autonomy elsewhere are the proof. Aviation treats the currency of its navigation data not as a virtue but as an airworthiness condition, with an aircraft barred from dispatch on an expired database, which is the art fair written as a rule; banking, after a crisis traced partly to risk numbers aggregated from data nobody owned, was made to keep its risk data accurate and current with named ownership and lineage; medicine convenes a standing committee to decide which sources of clinical truth are sanctioned inside the institution, which is the authority rank as a recurring meeting. The register is those disciplines written for an agent, and the point of naming them is not to import their machinery but to say that every field on the register is a question some serious, high-consequence field already decided it could not leave to chance. The agentic team is late to a problem others have been governing for decades.

The owners

The work divides the way every supervisory artifact in this book divides, across hands that each see one piece.

The context owner holds the seat. They are the register’s custodian and the sufficiency statement’s enforcer, the person who can answer the question the travel agent’s team could not: what can this agent not see, and who accepted that gap. At a small team it is a hat, split between the domain expert who knows what is true and the architect who knows what is reachable. At scale it is the role the data-platform world is already drifting toward under names like data-product owner and knowledge engineer, without yet seeing that what it is becoming is the supervisor of an agent’s picture of the world. The domain expert owns authority and supersession, because only they know that the guideline was revised and which of two sources a careful practitioner would actually trust; the register has a column only they can fill. The architect owns the plumbing’s honesty, the interfaces and, crucially, the freshness telemetry without which no SLA is watchable, because a freshness rule you cannot measure is a freshness wish. The eval owner owns proving it, with seeded stale-source and missing-context cases in the golden set, the art fair planted in the suite so the team finds the staleness in a test rather than in a tribunal. And the supervisor owns the production half, reading the freshness instruments and demoting the agent’s autonomy when a source breaches its SLA, which is the airworthiness rule from aviation applied to data and the same demotion the autonomy ladder calls for when trust has not been earned.

The binding discipline is one sentence, and it is the chapter’s: every source the agent reads has a name and a date attached, and a source with neither is not context, it is rumor with an API.

The Failure No One Is Watching The Boundary and the Wall