What You Need to Know About AI Before You Design Anything
Before I became a product manager, I worked as a physician. I trained in anesthesiology and radiology, and one of the things that stayed with me long after I left clinical practice was what I experienced during a heart transplant as an anesthesiology resident.
The operation itself was complex, a team of specialists working against the clock to get a donor heart into a critically ill patient. I was at the head of the table, managing the patient’s hemodynamics alongside my attending, titrating drips, watching for instability, keeping the patient alive while the surgeons worked. But what struck me, even in the middle of it, was the information architecture of the room.
Large monitors displayed real-time telemetry: heart rate, blood pressure trends, oxygen saturation, ventilation curves, infusion rates for every drip. I watched hemodynamic trends the way an SRE watches a dashboard during a deploy. The perfusionist tracked bypass flows and core temperature on a separate console. In the corner, a nurse counted used pads and towels, writing running totals on a whiteboard, a manual but critical data point for estimating blood loss. Every surface in that room was a display, and every display served a different consumer of the same underlying patient state.
Every person was working on a different slice of the same problem. The surgeon focused on anatomy. I managed hemodynamic stability. The perfusionist managed circulation on bypass. The scrub nurse tracked instruments and sequence. No single person held the complete picture. The system worked because each role had clear authority over its domain, clear escalation paths when domains intersected, and a shared understanding of when to act independently and when to call for help. The patient survived not because any one person was brilliant, but because the supervisory architecture, the telemetry, the observability, and the real-time insight-to-action loops were deliberately designed.
That room is the closest analogy I know to what you are about to build.
An agentic AI system is not a single intelligent actor. It is a collection of capabilities, each operating on a slice of the problem, coordinated by rules you define, supervised by humans whose attention is a limited resource, and consequential in ways that compound over time. If you design only the capabilities and not the supervisory architecture, you have built an operating room where everyone can see their own monitor but nobody has agreed on who calls the code.
This chapter gives you the technical vocabulary you need to design that architecture. Not to become an AI engineer. To become the PM who can specify what the system should do, ask the right questions about how it works, and catch the failure modes before they reach production.
Five ideas will recur in every chapter that follows. Non-determinism is a permanent property, not a flaw to be fixed. Confident wrong answers happen at a measurable base rate. The tool boundary is the most concrete expression of the agent’s authority. The design challenge is not the agent’s intelligence; it is the supervisory system around it. And the supervisor’s competence is not a fixed input. It is being eroded, in measurable ways, by the same systems the supervisor is supposed to oversee. If you catch yourself in a design conversation and something feels off, it is usually one of these you have stopped accounting for.
From Rules to Reasoning
For twenty years, the software you shipped followed rules. If the order exceeds ten thousand dollars, route it to a manager. If the patient’s heart rate drops below fifty, trigger an alert. If inventory falls below the reorder point, generate a purchase order. You defined the rules. The software executed them. Deterministic, traceable, predictable.
That model works beautifully when you can enumerate the rules in advance. It breaks when the problem space is too large, too variable, or too ambiguous for anyone to write all the rules. A radiologist reading a chest X-ray is not following a decision tree. She is recognizing patterns across thousands of prior images, weighing probabilities, integrating clinical context, and arriving at a judgment she could explain but could not fully codify. That kind of reasoning, where the rules are learned from data rather than written by a person, is what machine learning makes possible.
One clarification before we continue. The phrase “AI” is not one technology. Healthcare alone has been running at least seven generations of it simultaneously since the 1970s: rule-based expert systems, statistical risk models, deep learning for imaging, gradient boosting on tabular data, generative language models, agentic systems, and early neurosymbolic architectures. Every one of them is still in production somewhere. This book focuses on LLM-driven agentic systems, because that is where most of the 2026 product conversations are happening. When governance, evaluation, and risk decisions land on your desk, knowing which generation of AI you are deploying is the prerequisite for not making a category error.1
Deterministic software follows explicit rules and produces the same output for the same input every time. A billing system that calculates totals, a workflow engine that routes approvals, an ERP that generates purchase orders. Probabilistic systems, including most modern machine-learning and generative AI systems, produce outputs based on learned patterns and statistical inference. Many enterprise deployments are hybrids, combining deterministic logic with probabilistic components.
The same input can produce different outputs on different runs, even with randomness minimized, because retrieval context, batching, and infrastructure routing introduce variability. This is not a bug. It is a fundamental property that changes how you test, monitor, and trust the system.
Machine learning is the broad category: systems that learn patterns from data rather than following hand-coded rules. Within that category, the technology driving the current wave of agentic AI is a specific architecture called a large language model.
What a Large Language Model Actually Does
A large language model is a neural network trained on enormous quantities of text. During training, the model learns statistical relationships between words, concepts, and patterns of reasoning. When you give it a prompt, it generates a response by predicting, token by token, what comes next based on what it has learned.
That description sounds simple. The implications are not.
Because the training data includes medical literature, legal documents, software code, business analysis, scientific papers, and virtually every other form of written human knowledge, the model develops capabilities that arise from training scale rather than being explicitly programmed. It can summarize a contract, write code, analyze a dataset, draft a clinical note, reason about a supply chain problem, and carry on a conversation that feels human, all from the same underlying architecture.
An LLM is a neural network trained to predict the next word, technically the next token, in a sequence based on patterns learned from vast amounts of text. Models like GPT-5, Claude, and Gemini are LLMs. They do not understand text the way a human does. They are trained on large text corpora and optimization signals, but at inference time they do not have direct access to ground truth. They are extraordinarily good at pattern matching across contexts, and without explicit grounding they can produce correct and incorrect answers with equal apparent confidence.
The practical consequence for PMs: LLMs are capable of impressive reasoning and of confident, plausible-sounding errors. Designing around this distinction is your job.
Two properties of LLMs matter enormously for product design, and most PMs underestimate both.
First, LLMs are non-deterministic at the system level. Even when configured for low randomness, the same prompt can produce different outputs across sessions due to retrieval variability, context differences, and infrastructure factors. You cannot test an LLM-based system the way you test deterministic software. One passing test does not prove the behavior is fixed. This has deep implications for quality assurance, which Chapter 5 covers in detail.
Second, LLMs are confidently wrong at a base rate that varies by domain and task. They do not signal uncertainty the way a well-calibrated system should. A model that is ninety-five percent accurate on routine cases and sixty percent accurate on edge cases will present both outputs with the same apparent confidence. The user has no way to tell the difference without a designed confidence surface. Building that surface is a PM responsibility, not a model property.
One more property is worth naming now, because it will come up in Chapter 4. When an LLM is asked to show its reasoning, what it writes out is not a reliable explanation of how it arrived at the answer. Research on model explanations has shown that the written reasoning and the actual computational path can diverge, sometimes significantly. The phrase for this is chain-of-thought, and the important thing for a PM to know is that chain-of-thought is generated text, not an audit log. Treat it accordingly.2
Classifier Metrics, Through Product Decisions
One last vocabulary item belongs here, because the metrics most often cited around AI come from classifier evaluation and most of them collapse on agentic systems if you do not know what they measure. You do not need to be a statistician. You need to know what each one translates to as a product decision.
Sensitivity, also called recall, is the proportion of real positive cases the model correctly identifies. High sensitivity catches most real cases and produces more false alarms. Specificity is the proportion of true negative cases the model correctly dismisses. High specificity produces fewer false alarms and misses more real cases. AUC, or area under the curve, is a single number summarizing how well the model separates real positives from real negatives across all possible thresholds. A perfect model has an AUC of 1.0. Random guessing gives 0.5.
The threshold is the point at which the model decides to act, and it is the part most often misattributed to the data science team. The threshold is a product decision. Only you know what a false positive costs your users versus what a missed case costs them. A fraud model set too aggressively blocks legitimate transactions. Set too loosely, it lets fraud through. The right threshold reflects your actual cost structure, not the threshold that maximizes a statistical metric. When your team brings you a confusion matrix, the question to bring back is: what does the false positive cost the user, and what does the missed case cost the affected person, and where on that curve are we landing? Chapter 5 covers where these single-step classifier metrics break down for multi-step agent trajectories. The Chapter 1 point is only that they are product decisions before they are statistics.
From Language Model to Agent
A language model, by itself, is a conversational tool. You ask it a question. It answers. You ask a follow-up. It responds. The interaction is bounded by the conversation. The model does not take actions in the world. It does not change state in any system. It generates text.
An agent is what happens when you give a language model the ability to act.
In technical terms, many agentic systems follow a common pattern: take a goal, break it into steps, select and use tools (APIs, databases, applications) to execute those steps, observe the results, and adjust the plan based on what is found. They operate in a loop: reason, act, observe, reason again. In production, some agents are more constrained. Workflow-style orchestrations call models at specific points. Event-driven agents are tied to a narrow trigger. For the purposes of this book, any system that can reason and act over multiple steps without a human at each step brings the same design problems.
An AI agent has four properties that distinguish it from a chatbot or a simple automation. It pursues a goal across multiple steps. It selects and uses tools to take actions in external systems. It observes the results of its actions and adjusts its approach. And it can operate with varying degrees of autonomy, from fully supervised to semi-independent. The combination of reasoning, tool use, and autonomy is what creates both the value and the risk.
This is the transition that matters for your work. Traditional automation executes a fixed sequence: if X, then Y, then Z. An agent reasons about what sequence to execute based on the goal and the current state. It can handle situations the designer did not anticipate, because it is reasoning over the problem rather than following a script. That flexibility is the value proposition. It is also the source of every design challenge in this book.
Consider a concrete example. A traditional order-processing automation checks whether the order meets predefined rules, routes it through a fixed approval chain, and updates the system. The rules are yours. The sequence is yours. If something does not fit the rules, it stops and waits for a human.
An agentic order-processing system receives a goal (“process this order”), reasons about the order characteristics, selects the appropriate approval path based on its assessment of risk and policy, executes the necessary API calls, and handles exceptions by reasoning about what to do next. If the shipping address is unusual, it might flag it for review or check it against a fraud database. If a line item is out of stock, it might propose a substitution. Each of these decisions is the agent’s, not a rule you wrote.
That is powerful. It is also consequential in a way that traditional automation is not, because the agent is making judgment calls you did not specify in advance. When those calls are right, you get efficiency and flexibility. When they are wrong, you get automated confidence applied to a bad decision, at scale, without a human in the loop.
The Stack: Models, Orchestration, Tools, and Context
To have productive conversations with your engineering team, you need a working model of the technical stack. Not to build it. To ask the right questions about where risk lives.
The stack has four layers, not three. The first three are the ones most teams discuss. The fourth is the one most teams forget, and it is the one where most production failures originate.
The four-layer stack
Most teams invest heavily in layers 1 to 3. They design layer 4 by accident.
Layer 1 — Foundation Model
The LLM from a provider. Your team selects it. The provider updates it on their schedule.
Selected, not controlled. Provider updates are deployment events.
Layer 2 — Orchestration
The code your team writes: tool sequencing, retries, error handling, multi-step workflow memory.
Your team controls this layer.
Layer 3 — Tools
Every system the agent can access. Every API it can call. Every action it can take.
Scope carefully. Each tool is a permission.
Layer 4 — Context
What the agent reads before it acts: prompt, retrieved documents, customer record, previous steps.
Intelligence without context is just confidence. Most production failures trace here.
When a team says “we are adding an AI feature,” they usually mean layer 1. The work lives in all four.
The foundation model is the LLM itself: the general-purpose reasoning engine. Your team probably did not build it. They selected it from a provider (OpenAI, Anthropic, Google, Meta, or others) and are using it via API. The model’s capabilities and limitations are set by the provider. Providers update models on their own schedule and expose new versions. Even with versioning, behavioral changes can appear without any deployment on your side, especially when a provider re-points an existing model name to a newer variant. Any model update is a behavioral change in your product that you did not deploy. This matters for testing, monitoring, and governance, all covered in later chapters.
The orchestration layer is the code your team writes to coordinate the agent’s behavior: how it breaks goals into steps, how it decides which tools to use, how it handles errors and retries, how it manages context across a multi-step workflow. Most of the engineering complexity lives here, and most of the design decisions that affect user experience are made here.
The tool layer is the set of external systems the agent can access: APIs, databases, file systems, communication channels. Every tool is a permission. An agent with access to your order management system can read and write orders. An agent with access to your email system can send messages on behalf of users. The tool list is the most concrete expression of the agent’s authority, and bounding it is one of the most important design decisions you will make. Emerging standards like structured tool calling and the Model Context Protocol aim to normalize how models discover and invoke tools across systems. They simplify integration. They do not remove your responsibility to define and log the actual permissions an agent has.
Every system an agent can access is a tool. Every tool is a permission to act. The tool boundary, the explicit list of what the agent can and cannot access, is the most concrete expression of the agent’s authority. A well-defined tool boundary is narrow, enumerated, and logged. An undefined tool boundary is how an agent ends up sending emails nobody authorized or modifying records nobody intended.
Vendors slice and label this layer differently. “Tools” is used narrowly (a single API call) and broadly (any external operation). “Skills” usually means packaged capabilities that bundle a prompt plus one or more tools. “Connectors” usually means pre-built integrations with a specific system (Salesforce, SAP, Slack). The naming varies. The architectural role is the same: each is an external action your agent can take, and every one is a permission. When a vendor says your agent has three hundred skills out of the box, the PM question is which three hundred permissions your organization just inherited, and which of them the agent may combine in ways the vendor did not plan for.
The context layer is what the agent reads before it acts. Every agent has one whether the team named it or not: a collection of documents, a vector store, a knowledge graph, a set of retrieval strategies, short-term memory for the current session, and sometimes long-term memory across sessions. Most teams design this by accident. That is the single largest source of production failures in 2026. The next section treats it directly.
What the Agent Reads
The three-layer stack tells you what the agent can do. It does not tell you what the agent knows. In enterprise deployments, most agent failures do not come from bad reasoning or bad tool use. They come from bad context: incomplete data, missing semantic relationships, stale embeddings, the wrong knowledge graph, retrieval that surfaces the wrong document, or a data product that was built bottom-up and does not carry the meaning the agent needs.
Most of what this section describes lives under the label Retrieval-Augmented Generation, or RAG: using retrieval over your own data to build the context the model reasons with at inference time. In many enterprise systems, RAG quality, not model capability, is the dominant failure mode.
The failure is so common it has a canonical enterprise case.
Consider a composite drawn from several large-retailer pilots of the last two years: a foundation-model shopping assistant launched on top of catalog data, announced with fanfare, then quietly rolled back months later when the results stopped justifying the cost. The model was fine. The model was great. The context was missing. A shopping recommendation requires more than a product description. It requires inventory state, location-specific pricing, promotion logic, household purchase history, substitution rules, returns policy by category, and a set of business constraints that live in operational systems and are meaningful only in combination. None of that traveled with the product data into the AI-capable platform. The assistant was confidently wrong in ways nobody could have predicted from testing the model in isolation.
The lesson is not that the partnership was a bad idea. The lesson is that context is the product. Intelligence without context is just confidence.
When data moves between systems, the meaning stays behind. Business rules, calculated semantics, security context, organizational conventions, delegated authority, relationship structure. What travels is the numeric or textual surface. What stays behind is the context that made the surface interpretable. Every enterprise data migration in the last twenty years has rediscovered this. Agentic AI makes it expensive in a new way, because an agent acting on decontextualized data produces confident, plausible, wrong output at scale.
The PM’s job is to name what context must travel with the data for this agent, and to reject architectures that assume the agent will figure it out from the data alone.
A useful way to think about this comes from clinical medicine, which has worked on this problem for a century. A physician taking a patient history is not being thorough. She is building the product. The complete history, the focused examination, the review of prior records, the collation of the lab values, the nursing home fax, the pharmacy record, the note from the community physician, all of it assembled into a coherent clinical picture, is the reasoning layer the decision runs on. The AI version is the same: what the model predicts is easy. What the model reasons with is where the hard work happens.
Three things are lost whenever enterprise data is moved without deliberate context engineering. The first is relationships: in the source system, a purchase order is connected to a vendor, a contract, a budget line, an approver, a shipment, and sometimes to a prior order that set the pattern. Export the purchase order as a row, and all of those connections are gone. The second is calculated semantics: a field called “revenue” in a dashboard is the output of a calculation, not a stored value. Export the field, and the calculation is gone. The third is security context: in the source system, who can see a record depends on role, region, and sometimes on the relationship between the record and the user’s org unit. Export the record, and the security logic stays behind. An agent reasoning over exported data is an agent reasoning without context. The PM’s job is to make sure the context travels with the data, or to bound the agent to decisions that do not depend on it.
What can you do about it?
There is a principle worth naming. Receiver-first design. Start from what the agent (or the human downstream) needs to answer the decision at hand, and work backward to the data, the retrieval, and the schema. Not the other way around. Most data products fail because they are built bottom-up: the source system has these fields, the integration layer exposes these endpoints, the data product inherits the shape. The receiver’s actual question, the one the agent has to answer, is not in the design. Top-down design produces data products that are useful. Bottom-up design produces data products that are accurate and useless.
There is a second principle. Standard first, extension last. Whatever vocabulary, ontology, or schema is standard in your domain (FHIR in healthcare, core industry schemas in retail, supplier taxonomies in procurement), use it first and extend it only where standards genuinely do not cover your case. The reason is not compliance. It is that agents from other systems, trained on industry standards, cannot parse your proprietary extensions. Every extension is a wall the next agent you integrate with cannot see over. JAMIA found that more than sixty percent of FHIR extensions in large healthcare organizations are proprietary and opaque to any outside reader.3 Enterprise data has the same pattern. The PM who accepts proprietary extensions without naming the cost is making a five-year decision on a one-sprint horizon.
There is a third principle, and it is the one PMs resist the most: design for incompleteness. The healthcare data that real agents receive in production is fragmented, approximate, delayed, and missing the pieces that would have made the decision easy. Patients describe pills as “the size of a dime, sort of oval, with a line across it.” Family members filter the history. The community physician does not have the hospital’s discharge summary. Waiting for clean data before deploying the agent has been the industry’s excuse for twenty years, and it has not produced clean data. The PMs who ship useful agents in this environment design for what arrives: a minimum coherent evidence set per decision, explicit metadata about what is missing so the agent’s reasoning can be calibrated to the gaps, decoupled retrieval and reasoning so a retrieval failure does not cascade into a confident wrong answer, and purpose-scoped data products rather than reusable-for-everything data lakes. If the team’s response to a context gap is “we are waiting for the data warehouse,” the agent is not ready to ship.
A fourth principle is more technical, and the PM needs to know it exists even if engineering owns the implementation. The semantic backbone of enterprise data is often some combination of a knowledge graph, relational systems, and vector indexes: explicit, machine-readable representations of the entities in the business (customers, suppliers, products, sites, contracts) and the relationships among them. A well-maintained knowledge graph is the closest software analog to how a physician or a senior operator reasons: connect this finding to that record to that prior event to that known relationship. Many systems achieve similar effects with hybrid architectures that stitch structured joins and retrieval together. When this semantic layer is present and maintained in some form, agents can reason across data rather than shuttle it. When it is absent, agents reason within whatever single slice they happen to retrieve. The PM who asks “what is our semantic layer, in whatever form, and who owns it?” in the first week of an agentic project will have a meaningfully different six-month trajectory than the PM who does not.
One last note on context for this chapter. Agents also carry context across time: short-term memory inside a session, and long-term memory across sessions. What the agent remembers about a user, a project, or a workflow is a designed product surface. It is also a governance surface: what may the agent remember, for how long, under what consent, with what mechanism for the user to review or delete. Most enterprise teams are building this by accident. GDPR and CCPA implications follow. Designing it on purpose is a Chapter 4 conversation. Naming it as a thing that exists is a Chapter 1 obligation.
Why “AI” and “Agentic AI” Are Different Conversations
If you have been following AI developments casually, you may think of AI as a smart assistant that helps people work faster: summarize this document, draft this email, answer this question. That is AI as a tool. Useful, often impressive, and fundamentally bounded by the human who directs it.
Agentic AI is a different category. The distinction is not about capability. It is about authority and consequence.
When AI is a tool, the human acts. The AI helps. The human retains full control over every decision and every action. The worst case is a bad suggestion that the human ignores.
When AI is an agent, the system acts. The human supervises. The agent makes decisions and takes actions that change state in real systems, sometimes in ways that are difficult or impossible to reverse. The worst case is not a bad suggestion. It is a bad decision, executed autonomously, compounded over time, discovered late. The same underlying model can operate as either, tool or agent, depending on the product design. The distinction is not technical. It is a design choice about authority, and it determines the entire risk profile of your product. When this book says “agentic system,” it means the whole assembly: model, orchestration, tools, context, guardrails, and human oversight, not a single autonomous actor.
This distinction is why the frameworks you know need updating, not replacing. Designing a tool product is a UX problem: make the suggestions useful, make the interface clear, measure adoption. Designing an agent product is a systems problem: define the authority boundary, design the supervisory experience, build the recovery workflow, instrument the observation metrics. You are designing for a system that acts when nobody is watching, and your job is to make sure that when nobody is watching, the right things still happen.
Chapter 2 introduces the label for this. Channel 1 is the agent itself. Channel 2 is the human system that supervises it. Both are products. Both require deliberate design. The rest of the book is about how to build both deliberately.
The Supervision Paradox
The fifth idea I named at the start of this chapter is the one most likely to catch you off guard, because the literature for it is recent and the language is not yet standardized. The premise is simple. Every AI deployment that requires a human supervisor depends on the supervisor remaining competent enough to perform the supervisory function. Every AI deployment that gets used at high frequency erodes the supervisor’s competence to perform that function. The framework that governs the deployment, the regulatory framework, the design-pattern framework, the operating procedure, assumes the first condition holds. The deployment itself violates it.
This is not new. Lisanne Bainbridge described what she called the ironies of automation in 1983, in a paper about industrial control systems. Her central observation: the more reliable the automated system, the more thoroughly atrophied the operator’s monitoring skill. When automation fails, it fails under conditions the operator is least prepared for. Unexpectedly. Under time pressure. In situations where the judgment required to identify and correct the failure has not been exercised in months or years. Bainbridge was describing process plants. The mechanism is general.4
The most measurable instance of the mechanism in 2026 is the one happening to your engineering team right now.
Software, Twelve Months
In a randomized controlled trial published in 2026, Anthropic, the company that builds Claude, studied software developers learning a new skill with and without AI assistance. The methodology was clean. Same task. Same skill. One group used AI assistance during the learning phase. The other did not. Then, minutes after using the concepts, both groups took a quiz on the material they had just worked with. The AI-assisted group scored seventeen percent lower than the control. Same minutes. Same concepts. Lower comprehension when the AI was withdrawn.5
One detail worth holding: the study was published by the company whose product is the cause. That is the rarer kind of self-disclosure. It deserves credit, and it removes the easy dismissal that the finding is a vendor-on-vendor attack.
The Anthropic result is a single instance of an older pattern. Bastani and colleagues at Wharton ran a four-week field experiment in 2024, a thousand high school students learning math, three conditions: GPT Base, GPT Tutor with Socratic guardrails, and no AI. Practice performance with GPT Base improved forty-eight percent. Subsequent exam performance, taken without AI, was seventeen percent below the control group. Same number, different population, different domain. The students did not perceive that their learning was being harmed. The harm was invisible to them while it was happening.6
Then there is the production side. Lightrun’s 2026 State of AI-Powered Engineering Report found that forty-three percent of AI-generated code changes require manual debugging in production environments, even after passing quality assurance and staging tests.7 The volume of AI-generated code is now projected to outstrip human review capacity by roughly forty percent. Senior engineers describe the bottleneck moving from “we cannot write code fast enough” to “we cannot review code fast enough.” Junior engineers are progressing faster on early tasks and plateauing earlier than the previous cohort, because the cognitive work that used to build foundational pattern recognition is now being done by the model.
Three findings, three populations, twelve to twenty-four months of evidence, all pointing the same way. The supervisor’s competence to evaluate the AI’s output erodes during the same period the AI’s output volume scales. The reviewer cannot keep up. And the practice that would let them catch the AI’s subtle errors, writing the code themselves, debugging it, internalizing the system’s shape, is exactly the practice the AI is replacing.
Customer Service, Three to Six Months
Klarna, the Swedish fintech company that lets consumers split purchases into installments, deployed an AI assistant in customer service in 2024 that handled the workload of roughly seven hundred agents. Cost per interaction dropped by roughly forty percent. On the metrics they chose, volume handled, average handle time, cost per interaction, the project looked like a clear win.
By mid-2025, Klarna had deliberately shifted work back to human agents for the complex emotional cases that determined whether high-value customers stayed or left. The AI was quietly underperforming in exactly those interactions, and the supervisor population was being reshaped by twelve months of triaging only the cases the AI escalated. CEO Siemiatkowski named the lesson directly: cost was too predominant an evaluation factor, and quality degraded in the cases that mattered most.8
Chapter 3 covers Klarna as a suitability and cost-modeling case. What matters for this chapter is the supervisor side. The CSR supervisor who used to handle the full distribution of cases now sees only the agent’s escalations. Their internal model of “what does normal look like, and what does a hard case look like, and where in between is the agent silently making bad calls” is built from a sample the agent shaped. The cases the agent handles invisibly badly do not appear in the escalation queue. The supervisor cannot calibrate against them, because the agent has filtered them out.
Three to six months is the rough timescale at which a CSR supervisor’s case-level intuition narrows around what the agent escalates. It is faster than the software-engineering timescale because customer support skill is built on volume, and the agent removes the volume. PMs deploying agents into operational workflows do not get six to eight years before someone audits their supervisor population. They get one product cycle, sometimes two, before the organization no longer has anyone who can tell them what the AI is doing wrong.
Medicine, Three Months
Healthcare publishes its results, including the failures, more rigorously than most industries. The clinical evidence on supervision erosion is therefore the sharpest, and the timescale is faster than most readers expect.
Budzyń and colleagues, in a multicenter observational study published in The Lancet Gastroenterology and Hepatology in 2025, tracked nineteen experienced endoscopists across four Polish centers during the first three months after AI-assisted colonoscopy was introduced. The AI highlighted suspicious lesions in real time with a green bounding box. The investigators looked specifically at what happened when those physicians performed procedures without AI, after sustained AI exposure. The adenoma detection rate fell from twenty-eight point four percent to twenty-two point four percent. An absolute drop of six percentage points. A twenty-one percent relative decline in the ability to find precancerous growths, independently, among physicians who had spent years developing exactly that skill.9
Three months. That is how long it took.
Eye-tracking research published before the Budzyń study showed the mechanism. Endoscopists under AI assistance reduce their eye travel distance during procedures. They stop scanning. The AI is scanning for them. The brain, efficient machine that it is, offloads the task it no longer needs to perform. When the AI is removed, the active visual search that experienced endoscopists had built over years is not sitting in reserve, waiting to be called on. It has atrophied. The physician who used to scan methodically now waits, without quite realizing it, for a box that is not coming.
A New England Journal of Medicine review in 2025 named the broader phenomenon and added a taxonomy that is worth carrying with you. Deskilling is the loss of a capability that existed. Mis-skilling is the development of a capability calibrated to a flawed reference. Never-skilling is the failure to acquire a foundational capability because AI was present during the entire formative window. The third is the one that should keep policymakers awake. It is not losing a skill. It is never developing it in the first place.10
The phenomenon is not specific to medicine. The Maguire studies on London taxi drivers established the general principle decades ago: spatial navigation builds the posterior hippocampus when used, and the structure shifts when GPS substitutes for it. Bohbot at McGill showed that habitual GPS users move from hippocampal spatial mapping to caudate-dependent turn-by-turn response learning. Javadi and Spiers showed real-time hippocampal disengagement during GPS-assisted navigation. The brain responds to what it is asked to do, and it stops maintaining what it is not asked to do. This is not atrophy in the pathological sense. It is ordinary neuroplasticity operating exactly as designed, just in a direction we did not intend.11
Two recent preprints from serious research groups suggest the same process is now operating on something considerably more important than navigation. Kosmyna and colleagues at the MIT Media Lab studied participants writing essays with ChatGPT, with a search engine, or unaided, and measured brain engagement using EEG. The ChatGPT group showed progressively lower neural connectivity across sessions, with the weakest executive control. When asked to write without the tool in a final session, their engagement did not recover. Shaw and Nave at Wharton ran three preregistered experiments on a Cognitive Reflection Test. When the AI was right, accuracy jumped twenty-five percentage points above baseline. When the AI was wrong, accuracy fell fifteen points below it. Confidence rose in both directions. They call the pattern cognitive surrender: the user stops constructing the answer entirely and adopts what the system produces.12
The Only Institutional Counter-Model
Aviation discovered the deskilling problem decades before the word existed in AI discourse. In September 2025, the European Aviation Safety Agency issued a Safety Information Bulletin warning that “continuous use of automated systems does not contribute to maintaining pilot manual flying skills” and could degrade the ability to handle manual flight during unusual situations. Long-haul captains on heavily automated aircraft accrue under one hour of true manual flying per year by some estimates. The EASA bulletin was not predicting a future problem. It was naming a present one.13
The institutional response is mandatory recurrent manual proficiency. Pilots are required, on a regular schedule, to demonstrate that they can fly the aircraft without automation. Not as a theoretical exercise. In the simulator, under conditions that include upset recovery, sensor failures, and off-normal situations specifically designed to test whether the skill is present or merely assumed to be present. The checks are binding. The record is maintained. The requirement is not negotiable.
No other knowledge profession has built the equivalent. Not medicine. Not software engineering. Not customer service. Not any of the fields where AI is now handling an increasing share of the cognitive work that practitioners are paid to perform. Aviation did not develop its recurrent proficiency model because the problem was theoretical. It developed it because aircraft fell out of the sky when automation failed unexpectedly and the pilots, who believed they could take over, discovered that the assumption had not been tested in a long time.
Why This Belongs in Chapter 1
Most of this book describes what to design. This section describes what is breaking underneath the thing you are designing for. The regulatory and design default for AI is “keep a human in the loop.” The systems we are deploying are eroding the human’s ability to perform that loop’s function. Both statements are true at the same time. The product manager whose design assumes the supervisor is the unimpaired version of themselves they were on day one is designing for a population that no longer exists.
The implication for the rest of this book is concrete and recurs in every chapter. Approval is not validation of the agent’s reasoning, which the supervisor often cannot independently reproduce. Approval is authorization within a designed boundary, on the basis of the trace and the policy and the supervisor’s domain knowledge of what this case requires that the trace does not show (Chapter 4). Eval coverage is not whether the agent is competent in absolute terms, but whether the supervisor can catch the failures the agent cannot self-detect (Chapter 5). Production observation is not a dashboard for the well; it is an instrument for the steadily compromised (Chapter 6). Change management is not training; it is the design of recurrent practice that holds supervisory skill stable against the deployment that erodes it (Chapter 7). Silent degradation is not just the agent drifting; it is the supervisor drifting alongside it, in opposite directions, neither monitored (Chapter 8). And the regulatory framework that depends on the human in the loop has not yet been updated for the population that is actually arriving in the loop (Chapter 10).
The argument is not that AI should not be used. The Anthropic result, the Bastani result, the Klarna efficiency, the Budzyń baseline detection rate that improved with AI present, are all real. The argument is that the safety model was designed for one world and is being deployed in a different one. The PM who treats the supervisor as a fixed input, available on day one and on day five hundred at the same competence, is making a planning assumption the evidence will not support.
Observability Literacy for the PM
One more piece of vocabulary before this chapter closes, because Chapter 6 will treat it at length and it pays to introduce the words now.
Observability is the discipline of being able to answer questions about a running system without having anticipated those questions in advance. The traditional enterprise monitoring stack, built around tools like Datadog, Dynatrace, Grafana, Prometheus, and SAP Cloud ALM, was engineered for deterministic systems. The contract is straightforward. The same input, exercised correctly, should produce the same output. Synthetic monitoring runs a fixed script at intervals and asserts a fixed result. Green means the system is behaving as designed.
That contract breaks for agentic systems in five places. The first is non-determinism: a synthetic test can verify that the LLM endpoint is reachable, the retrieval layer returns within budget, the tool-call routing works. It cannot tell you whether the agent’s judgment on the next real request will be correct. The second is semantic success: the question is not “did the workflow complete” but “was the action correct.” A two-hundred-millisecond response that takes the wrong action is worse than a five-second response that pauses for human confirmation. The third is the inversion of what performance means. Latency and throughput are no longer the primary signals. The relevant signals are: did the agent take an action, was the action authorized, was it correct, was it reversible, did a human have an opportunity to intervene. The fourth is behavioral drift: a model can shift its accuracy on a specific task without any infrastructure alert firing, because the infrastructure is healthy and the behavior is not. The fifth is blast radius: agents can execute irreversible real-world actions faster than alert-and-respond cycles are designed to operate. Chapter 6 walks through each of these in detail.14
For now, three vocabulary items are enough.
A trace is the recorded path of a request through a system. For an agent, a trace records the goal, the steps the agent considered, the tools it called, the responses it received, the context it retrieved, and the action it took. Spans are the individual segments of a trace. OpenTelemetry is the open-source standard most platforms emit traces in. When your engineering team says “we have observability,” the first PM question is whether they can show you a trace for an arbitrary historical agent action and reconstruct what happened.
The second item is the contract between platform and PM. The platform emits the events: tool invocations, approval pauses, recovery actions, incident tickets, confidence scores, retrieval queries, output text. The PM composes the metrics: override frequency, unintended action rate, task success rate, rollback time, incident recovery time, and the rest of the six instruments Chapter 6 names. Most enterprise platforms in 2026 emit the right raw events. Few ship the composed metrics as configured features. A vendor page claiming to ship those metrics out of the box is usually offering distributed trace capture plus an LLM-as-a-judge on top, relabeled. The composition is your responsibility, not the platform’s gift. If the team is waiting for the platform feature, the team is waiting forever.
The third item is the distinction between infrastructure observability (is the plumbing working) and behavioral observability (is the judgment correct). The first is mature and largely commodity. The second is emerging, partial, and often built on an LLM-as-a-judge with documented biases of its own. Healthy skepticism on the second category is the right disposition. The monitor having the same problem as the monitored is not a hypothetical concern. Chapter 6 returns to this directly.
You now have the vocabulary to ask the right questions. Engineering owns the implementation. You own the questions, and the answers belong to both of you.
The Clinical Parallel
Healthcare has been navigating human-AI collaboration under high-stakes conditions for longer than most people realize. ECG interpretation algorithms have been running inside hospital machines since the early 1980s. Mammography computer-aided detection received FDA clearance in 1998. Anesthesiology built closed-loop monitoring architectures borrowed from aviation cockpit design over three decades.
The lessons from those deployments are directly transferable. A clinical decision support system that flags a potential diagnosis is AI as a tool. The physician reviews, decides, acts. An autonomous dosing system that adjusts medication infusion rates based on continuous monitoring is AI as an agent. The nurse supervises, intervenes when thresholds are breached, but the system acts between interventions.
The healthcare field learned that the critical design problem was not the algorithm’s accuracy. It was the supervisory architecture around it. How does the clinician know what the system is doing? How do they intervene before a bad decision compounds? How is the system designed to hold the human’s attention stable over time, when the natural tendency is to trust more as the system performs well, and how do you hold that against the slow erosion of the very skill the trust depends on?
Those are exactly the questions this book answers for enterprise product managers. Healthcare’s methodology is rigorous because the consequences of getting it wrong are measured in lives. Your consequences may be measured in dollars and reputation. The design discipline is the same.
A Brief Note on Using AI While You Build With AI
Before closing this chapter, one thing worth flagging. Most readers of this book will use AI tools while they do the work the book describes. Drafting PRDs, stress-testing frameworks, evaluating vendor claims, running the thought experiment that produces the next hypothesis. The risk is subtle. Out of the box, most AI assistants behave like helpful colleagues who agree with you too quickly. Asked to critique a plan, they list three strengths before they get to the one weakness that matters. The resulting conversation produces a well-organized version of what you already believed, not the challenge that would change your mind.
The antidote is not a different model. It is a different configuration. When you use AI in your own working practice, configure it to disagree. Give it a role, a framework, a specific instruction to surface your weakest assumption, and permission to refuse to help you write the thing you were going to write anyway. Chapter 11 returns to this as a practice discipline. For now, the Chapter 1 point is only this: the agent on your roadmap and the agent in your browser window are the same probabilistic system. Both are equally capable of producing confident, plausible, wrong answers. Treating the one in your browser more carefully will produce a more rigorous design of the one on your roadmap.
And, as the previous section just argued, the practice of writing without AI assistance some of the time is itself the maintenance of the skill that lets you catch the AI when it is wrong. You are reading a book. You are not asking a chatbot. The two are not equivalent, and the difference is part of the work.
What This Chapter Gave You
You now have the vocabulary to hold your own in technical conversations about agentic AI. You understand the difference between deterministic and probabilistic systems, what LLMs do and do not do, what makes an agent an agent, how the technical stack is layered, why context is the fourth and most often forgotten layer, why the tool-versus-agent distinction reshapes your entire design approach, why the supervisor’s competence is not a fixed input but an eroding one, and the working observability vocabulary you need to ask the right questions of your engineering team.
The rest of this book is about how to design for what you now know. Chapter 2 names the role shift explicitly: you are no longer the bridge, and pretending otherwise produces the wrong answers to the wrong problems. Chapter 3 asks whether the problem on your roadmap is the right kind of problem for an agent, and at what cost it makes sense if it is. Chapter 4 covers the four runtime artifacts every agentic product needs at the moment of action, and the security treatment that has to ship with them. Chapters 5 through 8 cover evaluation, production observation, change management for the supervisor population the agent is reshaping, and the slow degradation of agents and the instruments built to watch them. Chapters 9 and 10 widen the lens to frameworks, governance, and the people the agent affects who never touch the product. Chapter 11 compresses the whole book into a field manual you can use on a Monday morning.
If you remember nothing else from this chapter, remember the five ideas. Probabilistic, not deterministic. Confidently wrong at a base rate. Tool boundary as authority. Supervisory system as the second product. And the supervisor’s competence as a moving target, not a constant.
Hold those, and the rest of the book is a toolkit. Forget them, and the frameworks will produce well-organized answers to the wrong problems.
Notes
- On the seven generations of healthcare AI: rule-based expert systems (MYCIN lineage, 1970s), statistical risk models (APACHE, sepsis scores, 1990s), deep learning for imaging (AlexNet era, 2012 onward), gradient boosting on EHR data (Epic deterioration index, 2015 onward), generative AI for documentation and triage (2022 onward), agentic systems with tool use (2024 onward), and emerging neurosymbolic architectures. Treated at length in Friedman, “The Healthcare AI Spectrum,” data-decisions-and-clinics.com, 2026.
- Research on chain-of-thought faithfulness has shown that the written reasoning a model produces and the actual computational path can diverge. See for example Turpin et al., “Language Models Don’t Always Say What They Think,” arXiv 2305.04388 (2023), and subsequent literature on reasoning trace fidelity.
- Heinze et al., on FHIR extension proliferation in large healthcare organizations. JAMIA, 2024. The headline finding (more than sixty percent of extensions in surveyed organizations are proprietary) is the basis for the FHIR “Standard First, Extension Last” modeling rule. Treated at length in Friedman, “You Moved the Data. The Meaning Stayed Behind,” data-decisions-and-clinics.com, 2026.
- Bainbridge, L. “Ironies of Automation.” Automatica 19(6):775–779 (1983). The foundational paper. Forty years on, it is the single most-cited paper in human factors research on automation, and the central observation has held across industrial control, aviation, anesthesia, and now AI.
- Anthropic, “How AI Assistance Impacts the Formation of Coding Skills,” Anthropic Research, 2026. Available at anthropic.com/research/AI-assistance-coding-skills. The seventeen-percent comprehension deficit is the headline finding from a randomized controlled trial; the methodology controls for time-on-task and task difficulty. The paper is unusual in that it is published by the company whose product is the cause of the effect; the disclosure is to Anthropic’s credit, and removes the obvious dismissal.
- Bastani, H., Bastani, O., Sungu, A., Ge, H., Kabakcı, O., Mariman, R. “Generative AI Can Harm Learning.” PNAS 122(44), 2025. Originally circulated as SSRN 4895486. The forty-eight percent practice improvement and seventeen percent exam deficit numbers come directly from the GPT Base condition; the GPT Tutor (Socratic, hints-only) condition substantially mitigated the harm, which is itself an important design finding.
- Lightrun, “State of AI-Powered Engineering Report,” 2026. The forty-three-percent debug-after-staging rate is the headline production-side finding. Adjacent: Pragmatic Engineer, “The Impact of AI on Software Engineers in 2026.”
- Klarna’s deployment of an AI customer service assistant and the subsequent 2025 partial reversal toward human agents for complex emotional cases is publicly documented in Klarna investor communications and CEO statements (Siemiatkowski, 2025). The case is treated as a suitability and cost-modeling failure mode in Chapter 3 of this book.
- Budzyń, K. et al. “Endoscopist Deskilling Risk after Exposure to Artificial Intelligence in Colonoscopy: A Multicentre, Observational Study.” The Lancet Gastroenterology and Hepatology 10(10):896–903 (2025). doi:10.1016/S2468-1253(25)00133-5. The study is observational, not a randomized trial, and the cohort is small (n=19); the authors are explicit about sensitivity to confounding. The direction and magnitude of the effect, observed within three months, are the salient findings for the supervision-paradox argument.
- Abdulnour, R-EE., Gin, B., Boscardin, C.K. “Educational Strategies for Clinical Supervision of Artificial Intelligence Use.” New England Journal of Medicine 393(8):786–797 (2025). The deskilling, mis-skilling, never-skilling taxonomy is the contribution most directly relevant to this book.
- Maguire, E.A., Gadian, D.G., Johnsrude, I.S., et al. “Navigation-Related Structural Change in the Hippocampi of Taxi Drivers.” PNAS 97(8):4398–4403 (2000). Dahmani, L., Bohbot, V.D. “Habitual Use of GPS Negatively Impacts Spatial Memory During Self-Guided Navigation.” Scientific Reports 10:6310 (2020). Javadi, A.H. et al. Nature Communications 8:14652 (2017). For the synthesis treatment, see Friedman, “The Quiet Erosion,” data-decisions-and-clinics.com, 2026.
- Kosmyna, N., Hauptmann, A., Olwal, A., et al. “Your Brain on ChatGPT: Accumulation of Cognitive Debt When Using an AI Assistant for Essay Writing.” arXiv 2506.08872 (2025). Shaw, S.D., Nave, G. “Thinking, Fast, Slow, and Artificial: How AI Is Reshaping Human Reasoning and the Rise of Cognitive Surrender.” SSRN 6097646, Wharton Behavioral Lab (2026). Both are preprints. Both are pre-registered. Both point in the same direction.
- European Aviation Safety Agency, Safety Information Bulletin 2025-09, “Manual Flying Skills Degradation,” September 2025. The aviation institutional model, mandatory recurrent manual proficiency, is the only knowledge-profession counter-template currently in operation. Medicine, software engineering, customer service, and most other domains where AI is now substituting for the supervisor have no equivalent.
- For the deeper treatment, see Friedman, “The Stack Is Green. The Agent Is Wrong.” data-decisions-and-clinics.com, 2026. The five places the traditional monitoring contract breaks (non-determinism, semantic success, the inversion of performance metrics, behavioral drift, and blast radius) are developed in detail there and revisited in Chapter 6 of this book.