Chapter 9 · Post-launch, drift and retirement

Silent Degradation, What Your Agent Is Doing When You Stop Watching

Stage: post-launch, drift and retirement

Last week I drove forty minutes to an art fair and found a construction site. The venue had closed six months earlier. Google Maps had taken me there, confidently. I entered the new address by hand and drove another thirty. It was not the first time Google had confidently given me a dead address.

Notice what did not happen. The app did not warn me. It did not surface a confidence interval on the address. It did not tell me when the venue record was last verified. It handed me a fluent, plausible answer, and the answer was wrong, and the only reason I found out is that I arrived at an empty lot.

This is the consumer version of something happening in production systems across every industry right now. A deployed agent gives a fluent, plausible answer. The answer is wrong. Nobody on the team knows, because nobody built the system that would know.

This chapter is about that failure mode. It has no widely used name in product management. Most enterprise agentic deployments will meet it within a year or less. Most will not notice it when it arrives.

Chapter 7 addressed the production dashboards and incident response surfaces a team needs on day one. Those instruments answer the question, is the agent working now. The question this chapter addresses is different and slower. Is the agent still the agent you launched, and are the instruments you built to watch it still measuring what you thought they measured? The answer drifts silently on a timescale that escapes anything a real-time dashboard can surface.

The Pattern With No Name

The vocabulary for this failure mode is scattered across disciplines, and none of the fragments capture the whole.

Human factors calls part of it automation complacency. Sociology calls part of it normalization of deviance, the term Diane Vaughan coined after the Challenger investigation. Psychology calls part of it habituation. Machine learning calls part of it drift, which is itself three different phenomena labeled with one word.

Every one of those names a cognitive failure. None of them names the structural conditions underneath. The user has no mental model for how a deployed model can silently lose accuracy over time. The vendor does not publish the question the buyer would need to ask. You cannot check a box that does not exist on the form.

Call the composite condition silent degradation. The name matters because the existing vocabularies each describe one slice, and designing against one slice does not cover the others. If you solve automation complacency, you have not solved corpus drift. If you solve corpus drift, you have not solved shadow workflows. Silent degradation is the condition in which all of them compose at once, and the composition is what makes it invisible for a very long time. The condition produces no errors thrown, no visible failure. The trend is downward, though the team may experience stability because their task-level metrics stay green. The mechanism is not a bug. It is the default behavior of every deployed agentic system in which observation was treated as an afterthought.

The Six Vectors

Silent degradation is the product of at least six vectors closing on each other simultaneously, and that simultaneity is what makes it hard. The first four are documented in the human-factors and ML-drift literature. The fifth (security posture decay) is specific to agentic systems and the post-launch threat environment. The sixth (scope drift) is the one a PM is most likely to discover by accident.

Concept

The Six Vectors

Attentional: vigilance falls as trust rises.
Substrate: the model, the tokenizer, the context, the corpus, the tools, the prompt, the guardrails, and the cutoff all drift independently.
Observational: the team rotates and the dashboard outlives its purpose.
Compensatory: users build shadow workflows around known failures, so task metrics stay green while job performance decays.
Security posture: the threat environment evolves as attackers learn the agent’s shape.
Scope: the agent’s mandate quietly expands through small accommodations until it is doing work nobody authorized.

Any one is manageable. The six in composition is what produces month eighteen.

Human trust rises while vigilance falls. Warning fatigue at the interface layer, automation complacency at the workflow layer, and automation bias at the decision layer. Three distinct mechanisms, each with its own literature, all firing at once. Users click through agent warnings at very high rates, because agent outputs do not interrupt; they sit inline. The click-through on a non-interrupting warning is higher than most teams expect, not lower.

The substrate drifts across many axes at once. The model vendor silently updates the model. The training cutoff moves. The context window behavior changes. The tokenizer changes. The retrieval corpus shifts. The upstream tool API evolves. The system prompt accretes small improvements, each tested alone and never tested together. The guardrail policy tightens and a workflow that used to complete now pauses. Vertical software has one substrate that can drift and thirty years of regression tooling to catch it when it does. Agentic systems have eight, mostly unobserved by default, composing into a product of drift rates that amplifies rather than dampens.

The detection system itself degrades. The team that built the monitoring rotates. The PM who knew which metrics mattered leaves. The dashboard someone built in month two is still running in month eighteen, but nobody remembers why that particular metric was chosen. A monitoring layer without institutional memory becomes decoration.

Shadow workflows silently compensate. Users learn the agent’s quirks. They prompt around failures, ignore low-confidence outputs, double-check tricky task types in a second tool. Their metric, did I finish the task, stays green. The system’s metric, did the agent do its job, was often never being measured. From the outside, the product looks successful. From the inside, a parallel workflow is carrying the load. Shadow workflows are not only a change-management pathology. They are one of the mechanisms by which a degrading agent looks fine for eighteen months.

Security posture decays. The fifth vector belongs in any honest catalog and is the one most explicitly missing from the standard drift literature, because it is specific to agentic systems and to the post-launch threat environment. The agent that was secure at launch faces attackers who learn its tool boundary, find prompt injection patterns that work, and share them. The widely reported agent breaches of recent years, where a single exploit compromised hundreds of thousands of live agents at once, were not the result of agents getting weaker. They were the result of attackers getting better at exploiting them. The agent did not change. The threat environment around it did. Security posture has a half-life, and the half-life is measured in months, not years. A red-team report from launch is a snapshot of a moving target. The same prompt-injection patterns that did not work against your agent in March may work in November because attackers have collected enough patterns from the wild to find the gaps. Treat the security posture as a degradation vector with the same operational discipline as the four cognitive and substrate vectors above. Re-test on a cadence. Refresh the threat model with each frontier model generation. Assume the gap between defense and attack will widen if the cadence slips.

Scope drifts. The sixth vector is the one a PM running an agentic deployment is most likely to discover by accident. The agent ships with a defined mandate. Over time, users find new uses, the team adjusts prompts to handle adjacent cases, integrations expand the tool boundary by half a step, and the agent ends up doing work nobody specifically authorized. This is not malicious and rarely deliberate. It is the cumulative effect of small accommodations, each rational in the moment. The procurement agent flags anomalies in financial postings; six months later it is also adjusting payment terms because somebody added a tool integration to handle a one-off case and the use stuck. The agent is not broken. It is simply not the agent that was launched, governed, and reviewed. Scope drift is the second-most-common cause of post-incident reviews where the post-mortem question is “why was the agent doing that” and the honest answer is “nobody decided that, exactly, but here is the chain of small decisions that put it there.” Detection requires a launch-time scope statement, a recurring review against it, and a revocation mechanism for tools that were added after launch. None of those are commodity platform features. All three are PM artifacts.

Years, Not Weeks

If you want to see how long this can run before anyone notices, the sharpest example comes from a setting where the stakes are high and someone finally audited the thing. In many hospitals there is a quiet model running in the background that scores every patient for the risk of deteriorating, a kind of early-warning system wired straight into the alerts the staff see. One of the most widely deployed of these shipped across hundreds of hospitals, integrated into the safety workflow, marketed with strong accuracy numbers. It ran for years. Then an independent team, with no relationship to the vendor, finally measured it against real outcomes and found it was performing at roughly half of what had been advertised, missing far more of the cases it was supposed to catch than anyone believed.

The number that matters in that story is not the accuracy figure. It is the duration. A safety-critical model ran for years, in a regulated industry, at half its advertised performance, before a single outside party ran the check that revealed it. No regulator caught it. No vendor disclosed it. An outside team did, out of curiosity, not because any alert or contract required it.

Now translate the pattern, because nothing about it is medical. A procurement agent whose cost-floor recommendations drift twenty percent from their trained baseline over six quarters. A support agent whose deflection rate looks unchanged while the lifetime value of a customer cohort nobody is segmenting on quietly erodes. A compliance agent whose exception-flagging rate has shifted while team turnover has erased anyone who remembers what the original rate meant. The setting changes. The mechanism does not. And almost none of these systems get the outside team with funding and curiosity. They get whatever you built to watch them.

The Profession We Trust Less

Every regulated profession has a mechanism for demonstrating that its practitioners are still current. A physician must prove continuing medical education. An airline pilot takes a proficiency check every six months and a line check every year. The currency is imperfect in all of these. It is auditable in all of them.

Ask the same of a deployed AI. When was the training data last refreshed. On what corpus. Curated by whom. Which retractions were pulled. Were behavioral change notes published at the last vendor release. In most enterprise agent deployments, the vendor knows, maybe. The buyer, never.

We govern the professions we trust less more rigorously than the AI we trust more.

This is what I call the currency question: when was the training data last refreshed, on what corpus, curated by whom, with which retractions pulled, and are behavioral change notes published at each vendor release?

The silence is not neutral. It is a risk the buyer has accepted without naming. Put the answer, or the refusal to answer, in the contract. Put the re-audit on the renewal calendar. Absence of an answer is an answer.

The regulated professions developed continuing-proficiency mechanisms because they understood that automation erodes the supervisor. Aviation built mandatory recurrent manual proficiency checks because aircraft fell out of the sky when automation failed and the pilots, who believed they could take over, discovered the assumption had not been tested in a long time. The mechanism is a structural response to a known failure mode of skilled judgment under sustained automation. AI deployment is being shipped without an equivalent. The vendors are not asked the currency question. The buyers do not know to ask it.

The Instruments Have a Shelf Life

The counter-argument to everything above is that the industry is maturing. Vendors are building guardrails. The field will figure this out.

Part of that is right. Input-output moderation is maturing. Content filters, refusal training, jailbreak defense are all visibly better than two years ago. Runtime observation of deployed agentic systems is not. Pre-deployment evaluation harnesses are not production monitors. Trace-inspection tools are not drift detectors. The observation layer that would catch slow drift does not exist for most deployed agents in 2026, and it is not on the near-term roadmap.

The deeper problem is that the field is not maturing in the direction the maturity argument assumes. It is replacing itself. Three generations of frontier models have shipped in the last eighteen months. Each generation resets the calibration of every instrument you built, because capability shifts, refusal boundaries move, and context behavior changes.

Every observation instrument built for an agentic system has a useful life of roughly eighteen months, pegged to frontier model release cadence. That is the cost nobody is budgeting for. A 2026 observation layer does not work in 2028 unless someone scheduled the re-calibration and owned it. Treat the instruments as first-class versioned product artifacts, not dashboards someone maintains on the side. If an instrument has not been re-calibrated since the last frontier model release, it is measuring the agent you used to have.

And the change is not monotonically bad. Some generation shifts fix errors the previous generation made. The buyer gets silent improvements alongside silent regressions. Uncertainty in both directions is the condition under which incident recovery time goes long, because the team cannot tell whether an anomaly is a regression or a new capability they did not plan for. Rollback is easy when things got worse. Rollback is paralysis when they might have gotten better in three places and worse in one.

Three Disciplines

The four runtime artifacts from Chapter 5 and the six observation instruments from Chapter 7 are necessary. Silent degradation adds three operating disciplines on top, and they do not live inside the product. They live in the product organization.

The first is to treat the observation instruments as versioned product artifacts, not dashboards. Owners, version numbers, release notes, and a scheduled re-calibration cadence tied to frontier model generations. Without a version number, the instrument has no maintenance plan. Without a maintenance plan, it will drift alongside the agent it was built to observe.

The second is to ask the currency question at contract and on a cadence. The buyer side is not asking it yet. The buyer who asks first is the buyer whose agent is still doing its job at month eighteen.

The third is to commission an independent external audit. Not a vendor audit, which tests the vendor’s self-understanding. Not a compliance audit, which tests conformance to a written standard. An independent external audit, performed by a party with no commercial relationship to the vendor, that measures actual agent performance on a held-out sample the vendor did not help design. The audit tests the monitoring, not the agent. If the monitoring is healthy, the audit confirms it cheaply. If it is not, the audit is how you find out before the story becomes a headline.

Questions to bring to your team:

Which of our observation instruments have an owner today? Which will still have that owner in eighteen months?
Have we asked the currency question of our current vendors? Is the answer in the contract?
When was our last external audit, and who ran it?
Do we have a launch-time scope statement for each deployed agent? When was it last reviewed?
Which shadow workflows are our users running right now that we do not know about?

The List Nobody Wants To Be On

In 2028, there will be a list of companies whose agent did something in production that no one on the team knew it had been doing for months. Some names are already on a version of that list. The pattern played out in regulated industries over years before anyone outside the vendor ran the numbers. Your company does not get an academic team with research funding and curiosity. Your company gets whatever observation layer it built, whatever currency question it asked, and whatever external audit it commissioned.

Silent degradation is not a clinical problem. It is not a regulated-industry problem. It is what happens by default to every deployed agentic AI, in every domain, because the mechanism is attention, drift, substrate replacement, organizational memory, and a security posture exposed to attackers who learn over time. Every one of those is a human or environmental constant, not a technology artifact.

The chapter before this one closed with the instruction to design the person who is supposed to be watching the agent. The instruction this chapter adds: design the system that watches the person who is watching the agent, version it, maintain it, and accept that it expires on the clock of the next frontier model. Maturity requires stability. The field has none. Plan accordingly.

Silent degradation is six vectors closing at once: attentional, substrate, observational, compensatory, security posture, and scope. Each is manageable alone; the composition is what produces month eighteen.
Ask the currency question at contract and on a cadence: when was training data last refreshed, on what corpus, curated by whom, with which retractions pulled.
Every observation instrument has a useful life of roughly 18 months, pegged to frontier model cadence. Version them, assign owners, and schedule re-calibration or they measure the agent you used to have.
Commission an independent external audit with no commercial relationship to the vendor. The audit tests the monitoring, not the agent; a healthy monitoring layer makes it cheap.
Shadow workflows are a degradation signal, not a change-management footnote. If users are routing around the agent, task metrics stay green while job performance quietly decays.