Not Every Problem Deserves an Agent
Stage: pre-build
The most expensive agentic AI mistake you can make is not an engineering failure. The model worked. The integration held. The agent did what it was designed to do. The problem was the problem itself.
The use case was never a good fit, and the team discovered this after six months of development, a production deployment, and an invoice that was difficult to explain to finance.
If you have spent the past decade in vertical deterministic SaaS, this is exactly where your experience can hurt you. You have a well-honed instinct to look at any repetitive workflow and imagine how software could take it over. You know how to translate “our people are drowning in manual steps” into a backlog. In deterministic systems, that instinct is mostly a strength. With agents, it can send you straight into the wrong class of problem.
This chapter is the candidacy assessment that keeps you out of it.
The rest of this book follows the PM’s work through four states. Pre-build, this chapter, decides whether an agent is the right solution. Pre-launch, Chapters 4 and 5, designs the agent’s runtime and proves it is ready to ship. Post-launch, Chapters 6, 7, and 8, observes the deployed agent, supervises its use, detects degradation, and eventually retires it. Governance (Chapter 9) and obligations to people the agent touches (Chapter 10) cut across all stages. The field manual (Chapter 11) consolidates the decisions and artifacts each stage requires.
Each chapter from here forward carries a stage tag under its title to anchor you in the loop.
Two Questions Before the First Line of Code
Every agentic project should start with two questions you were never required to answer in deterministic SaaS.
Is this problem structurally suited for an agent at all? If it is, at what task volume and error cost does an agent beat the best non-agent alternative?
The first half of this chapter is a suitability assessment. The second half is the cost model most PMs never build. The third is a sidebar on what makes movement up the autonomy ladder a real engineering question rather than a calendar event.
The Four Suitability Tests
A problem is well suited for an agent when four conditions hold. If any one fails, the problem is not ready. The work is scoping, not building.
The decision repeats at volume. An agent creates value by handling many instances of the same decision type across time. A one-time strategic call, however complex, does not benefit from agency. Test question: how many times per day or week does this decision get made in the organization today?
The outcome is measurable. If you cannot define what a correct decision looks like, you cannot evaluate the agent, detect when it drifts, or calibrate user trust over time. Outcomes that depend on stakeholder politics, relationship history, or unstated organizational preferences are not measurable in the sense an agent requires. Test question: can you score a sample of past decisions as right or wrong without significant disagreement among the people who made them?
The tool use is bounded. Agents with broad system access will eventually touch something unintended. The value of an agent scales with the precision of its tool boundaries, not the breadth of its permissions. Test question: can you enumerate every system the agent needs to access to complete the task, and nothing beyond?
The consequences are recoverable. Wrong decisions happen at a predictable rate. The suitability question is whether the cost of a wrong decision, at the rate this agent will produce them, is one the organization can absorb without structural damage. Test question: what does the worst-case failure look like, and what does recovery cost?
Before committing to build an agent, four conditions must hold. (1) Volume: the decision repeats at meaningful scale. (2) Measurability: you can score past decisions as right or wrong without disagreement. (3) Bounded tools: you can enumerate every system the agent needs, and nothing beyond. (4) Recoverable consequences: the worst-case failure at the expected error rate is survivable.
If any one fails, you are not ready to build. You are ready to keep scoping. This is the clinical candidacy assessment applied to product decisions: the procedure may be excellent, but is this patient the right candidate?
If you cannot answer these four with concrete examples and numbers, you are not ready to build an agent. You are ready to do more scoping.
Three Places Agents Fail by Design
The evidence from real deployments converges on a consistent pattern. Agents succeed at tactical decisions: high-volume, bounded, measurable, with feedback loops that allow calibration. Recommendation ranking, document routing, alert triage, scheduling optimization, data extraction. They fail in three specific places, and those failures are not execution problems. They are problem-type problems.
Strategic decisions. Goals requiring judgment about what to optimize for, weighing competing values the organization has not reconciled, or choosing between paths whose consequences will only be legible in retrospect. These decisions do not have a ground truth the agent can learn from, because there is not a single right answer. Teams that automate them get confident output, applied consistently, on a decision whose consistency was never the goal.
Private-knowledge decisions. The most underappreciated failure dimension. An agent cannot replicate the judgment of a person who knows the relationship history, the political constraints, or the unwritten rules that govern a given decision. Sales outreach to a specific account, executive communications, strategic vendor negotiations, performance conversations, all fall into this category. Teams that automate these workflows are not saving time. They are trading human judgment for machine confidence on exactly the decisions where confidence without judgment is the core failure mode.
Asymmetric-consequence decisions. Decisions where being wrong costs meaningfully more than being right saves. Regulatory filings, customer trust events at a high-value account, medical triage in a small minority of cases that dominate the outcome distribution. Expected-value math that looks favorable in aggregate can be ruinous when the tail is fat. A ninety-eight percent success rate at this class of decision is still two outcomes a hundred that the organization cannot afford.
The Cost Model Most PMs Never Build
Here is the question that separates teams that deploy agents successfully from teams that quietly decommission them a few months later. At what task volume does this agent break even against the best available alternative, and is that volume realistic within the planning horizon?
Most agentic business cases are built on capability assumptions. The agent can do this task. It can scale. What they omit is the cost structure, which is fundamentally different from every other technology investment you have evaluated in deterministic SaaS. Industry analysts expect a significant share of early agentic AI projects to be abandoned by 2027 due to escalating costs and unclear business value. These are not execution failures. They are cost models that were never built.
Three cost patterns appear consistently. The floor price: a minimum cost the system incurs regardless of value delivered, in the range of a significant six-figure investment in the first year once you include evaluation, monitoring, security review, and maintenance, commonly underestimated by 40 to 60 percent relative to the vendor quote alone. The per-task operational cost: unit economics collapse the moment significant human review is required on every output. The cost trajectory: agentic costs scale with volume in ways headcount does not, and multi-agent architectures commonly consume several times the compute of simpler designs for the same outcome.
Practical correction: whatever number is in the initial business case, multiply it by 1.5 before presenting it to finance.
Three patterns appear consistently when teams look backward at the numbers.
The floor price. The minimum cost exists whether the agent handles two tasks a day or two thousand. Most business cases are modeled on optimistic volume. Almost none model what the system costs when volume comes in at half of projection. The compounding expenses sit behind the vendor quote: orchestration, human oversight, and the maintenance triggered by model updates every three to six months.
The per-task operational cost. In many service workflows, a fully loaded human interaction costs a few dollars. An agent handling similar interactions can land well below that, when it completes the task end-to-end without human review. The unit economics collapse the moment significant human review is required on every output. Recent enterprise surveys indicate that most organizations still require human validation of agent outputs. Human oversight does not disappear because you deployed an agent. It gets restructured into a recurring labor cost that must appear in the model.
The cost trajectory. One additional person costs one salary. An agent’s variable cost scales with usage: ten times the volume will drive a meaningfully higher bill, even if caching, batching, and small-model routing keep it from being literally ten times. Multi-agent orchestration can add coordination overhead that makes the curve steeper if you are not careful, with agents handing context back and forth, re-establishing state, and resolving disagreements. The crossover point where the agent is cheaper per task than the alternative exists at a specific volume and complexity threshold. That threshold must be modeled before the team commits.
What the Token Bill Actually Measures
Every agentic cost discussion eventually lands on the word tokens, but few business cases explain what that means in practice.
A token is roughly three-quarters of a word. The model charges for every token it reads and every token it produces. Reading is cheap. Thinking and writing is expensive. Output tokens cost three to five times more than input tokens because the model generates them sequentially, one at a time, while input can be processed in parallel. This asymmetry is the first thing most cost models get wrong by omission.
The second is complexity. Standard models produce a visible response and stop. Reasoning models, and standard models configured for extended thinking, generate an internal chain of thought before producing any visible output. Those internal reasoning tokens are billed at full output rates even though the user never sees them. A billing dispute that requires the agent to work through contract terms, transaction history, and policy exceptions can generate thousands of invisible reasoning tokens before a single word of the response appears. The quality improvement is real. So is the cost multiplier. Switching from a standard model to a reasoning model for complex tasks is not a free upgrade. It is a deliberate cost decision.
The third driver is context accumulation. Retrieval-augmented generation, RAG, is the pattern where an agent retrieves relevant documents and injects their content into the conversation before responding. Those document tokens are billed as input tokens on every call in the session. If an agent retrieves a ten-page policy document at the start of a conversation and the conversation runs eight turns, that document is billed eight times. Context caching can reduce this significantly for static documents, but it requires deliberate architecture and most initial deployments do not use it. The PM implication is that RAG is not free retrieval. It is paid context injection, and the bill compounds with every additional turn.
The fourth driver is one almost no customer service cost model includes: multimodal inputs. Images processed by vision-capable models are priced differently from text, often equivalent to several hundred to several thousand text tokens depending on resolution. A scanned invoice is processed as an image, not as text. A customer who uploads a photo of a damaged product, a PDF warranty document, and a handwritten note in a single support interaction has already consumed an order of magnitude more input tokens than a purely text interaction, before the agent has reasoned or responded at all. CSR deployments that model cost on text conversations and then encounter real customer behavior, which increasingly involves attachments, face a cost structure that was never in the business case.
The five variables that drive the token bill above forecast are worth naming plainly. Reasoning depth adds invisible cost: extended-thinking models generate internal chain-of-thought tokens before producing any visible output, and those tokens are billed at full output rates. Conversation length compounds because the full history is re-sent as input context on every turn. Retrieved document size multiplies that effect, since every document injected via RAG is billed again on every subsequent call in the session. Multimodal inputs, images, scanned PDFs, handwritten documents, are processed as vision tokens rather than text, and a single customer attachment can cost as much as an entire text-only conversation. Finally, retry behavior: when an agent fails a step and retries, the failed attempt stays in the context stream, and multi-agent architectures that resolve disagreements through iteration can accumulate significant token overhead before settling on an answer. Any one of these is manageable. A CSR deployment that encounters all five simultaneously, a reasoning model handling a complex dispute, across a ten-turn conversation, with retrieved policy documents, customer-uploaded attachments, and retries on a failed tool call, is operating at a cost per interaction that bears no resemblance to the average the business case modeled.
The published numbers on this are more concrete than most business cases assume.
| Cost Layer | Typical Range | Driver | PM Implication |
|---|---|---|---|
| Floor cost, year one | $175K to $450K for enterprise deployments; evaluation and monitoring alone $60K to $120K | Orchestration, monitoring, security review, maintenance, model update cycles every 3 to 6 months | Floor exists whether volume hits projection or half of projection. Model both. |
| TCO underestimate | 40 to 60 percent below actual | Teams anchor on vendor quote and miss compounding expenses | Multiply the initial business case by 1.5 before presenting to finance |
| Multi-agent coordination | ~37% of total tokens spent on coordination | Agents re-establishing context, handing off state, resolving disagreement | Adding agents does not divide the work; it multiplies the token bill |
| Multi-agent token multiplier | ~3.5x tokens for a four-agent system vs. single-agent for the same outcome | Coordination plus retry behavior | Multi-agent architecture is a cost decision, not just a design decision |
| Agentic vs. single-shot, same task | 10x to 50x more tokens per customer service case | Multi-step planning, tool calls, retries | Unit economics only hold if the agent closes the case end-to-end |
| Traditional automation vs. agent, routing | $0.001 per task vs. $0.10 to $0.20 per task | Deterministic logic has no inference cost | For routing, agents are 100 to 200x more expensive per task |
| Traditional automation vs. agent, scheduling | $0.01 per task vs. $0.60 to $1.20 per task | Same pattern, slightly more complex logic | The cost differential for routine work is structural, not marginal |
Table 3.1. Published cost benchmarks for enterprise agentic deployments.
A concrete frame for the break-even question. Deploying an enterprise agentic workflow in a single HR function often lands in the $450,000 to $650,000 first-year range in real deployments. To achieve a twelve-month return on investment, the organization needs to displace the work of six to ten full-time employees through efficiency gains. That is not an argument against building. It is an argument for knowing the number before the project starts, and for choosing the function where that displacement is achievable and measurable. The teams that model this before building find the right use cases. The teams that discover it on the invoice have a different conversation.
Brownfield vs. Greenfield: The Floor-Price Inversion
The conventional cost diagram, agentic AI starts with a higher floor price and wins at sufficient volume through lower marginal cost, is wrong as a universal claim. The floor-price comparison inverts for a large fraction of enterprise deployments. The most important variable is whether the enterprise already holds the underlying platform license.
Consider the 2025 to 2026 pricing of the platforms most enterprise PMs are working inside. SAP Joule Base is included in all SAP cloud subscriptions at no additional cost. Salesforce Agentforce Foundations gives Enterprise+ customers two hundred thousand free Flex Credits. ServiceNow folded AI capabilities into base product licenses in April 2026. Microsoft Copilot Studio is included with M365 Copilot for internal agents at no additional charge. For enterprises already running these platforms, which is most large enterprises, the incremental floor price of the agentic layer can be lower than adding a new RPA deployment. The conventional diagram is approximately correct only for greenfield deployments with no existing AI-capable platform license.
The pricing war caveat is worth naming. Current enterprise AI pricing is partially artificially suppressed. Vendors are absorbing AI costs to defend existing seat revenue and block competitive displacement during the consolidation phase. SAP CEO Christian Klein stated in March 2025 that SAP is shifting from per-user to AI-consumption pricing, because as agents replace human users, per-seat revenue collapses. When consolidation pressure forces vendors to recover their GPU and inference infrastructure investment, the floor price picture changes. Business cases built on current bundled pricing should include a repricing stress test. If the agent’s economics depend on bundled pricing that the vendor is currently absorbing, you have not modeled the cost. You have modeled the promotion.3
The Token Pricing Reference
One reference table belongs in this section, because the token-pricing asymmetry between input and output tokens is the single most consequential number most cost models omit.
| Model | Input per 1M tokens | Output per 1M tokens |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| GPT-4o-mini | $0.15 | $0.60 |
| Claude Sonnet 4.6 | $3.00 | $15.00 |
| Claude Haiku | $0.25 | $1.25 |
| Llama 3.1 70B (self-hosted) | ~$0.40 | ~$0.40 |
Table 3.2. Token pricing across major models, April 2026. Output tokens are 3 to 5 times more expensive than input tokens because output requires sequential generation while input can be processed in parallel.
Cost per task scales with complexity, not just volume. Simple tasks, email classification, FAQ, single-step output, run between two-tenths of a cent and four cents per task. Moderate tasks, customer support with two or three steps and tool use, run nine to twenty-six cents per task at mid-tier models. Complex tasks, dispute resolution with five or more steps and multiple tool calls, run fifty cents to two dollars or more per task. The complexity tax is the line where most agentic business cases break. A multi-step dispute resolution at one to two dollars per task can exceed the cost of a fully loaded outsourced human handle cost. This is the scenario the original Klarna business case did not run.
Outcome-Based Pricing Changes the Risk Structure
One emerging pattern is worth knowing about, because it collapses the floor-price-versus-marginal-cost tradeoff the conventional diagram assumes. Several vendors have moved to outcome-based pricing. Intercom Fin charges ninety-nine cents per resolved conversation, with no charge for failures. Zendesk charges one dollar fifty per automated resolution. Salesforce Agentforce charges two dollars per conversation, regardless of resolution, or alternatively Flex Credits at ten cents per action.
Outcome-based pricing means the enterprise pays only for demonstrated value. When the agent fails the resolution, the vendor absorbs the cost. This is structurally different from per-token pricing in a way that matters for the suitability conversation. If outcome pricing is available for the workflow you are scoping, the cost model question becomes simpler: at what resolution rate does this vendor stay profitable, and how does that align with what the workflow can accept? If outcome pricing is not available, you are holding the failure cost yourself. That is its own cost line.
Production Outcome Quality, Not Just Cost
One last set of numbers, because cost is one column of the comparison and outcome quality is the other. Recent customer service production benchmarks, drawn from 2026 industry data on AI-handled tickets versus human-handled tickets versus hybrid deployments, look like this.
| Metric | AI agent | Human agent | Hybrid (22% escalation) |
|---|---|---|---|
| Resolution time | 1.9 min | 11.4 min | Mixed |
| Cost per resolution | $0.62 | $7.40 | $2.10 |
| CSAT | 4.1 / 5.0 | 4.3 / 5.0 | 4.25 / 5.0 |
| Re-contact rate (72h) | 11.3% | 8.7% | — |
Table 3.3. Customer service production benchmarks, 2026. The re-contact rate is the hidden cost most CSAT-driven cost models miss.
The re-contact rate is the line that matters and that the cost-per-resolution figure does not capture. AI-resolved tickets are thirty percent more likely to generate a follow-up contact within seventy-two hours. That follow-up is itself an interaction with a cost, and the customer experience of having to come back is the kind of thing that does not show up in average handle time and does show up in customer lifetime value six months later. A cost model that compares dollar-sixty-two AI to dollar-seven-forty human and stops there is comparing the wrong columns.
Try the Cost Model Yourself
The interactive calculator below lets you adjust the six cost variables that drive agentic break-even and see how the crossover points shift. The defaults are set to the values used in this chapter; drag the sliders to model your own deployment. The most important scenario to try: drag the outsourced cost slider to eight hundred dollars per agent per month, which is realistic for Southeast Asia operations. The agentic-complex-versus-outsourced-humans crossover disappears entirely. Agentic complex never becomes cheaper than outsourced humans at any realistic volume. This is the Klarna lesson. The business case that looked compelling was comparing against the wrong baseline.
Calculator note: cost curves are illustrative models based on published benchmarks (2025 to 2026). Sources include HfS Research, Gartner, Deloitte RPA surveys; Intercom, Zendesk, and Salesforce published pricing; and the 2026 Digital Applied Customer Service AI Statistics dataset. The calculator is not a quote for any specific vendor.
Under MVP culture, governance, observation, context engineering, and supervisory design tend to be deferred in favor of shipping the happy path. Each deferral is small in isolation. Stacked, they become the load-bearing weakness of the product. Agentic systems are particularly vulnerable because the deferred work is exactly the work the probabilistic behavior requires. A deterministic MVP can afford to defer monitoring for a sprint. An agentic MVP cannot.
When the house of cards collapses, it collapses in production, under load, at the moment a confident wrong answer has been applied at scale. Build the supervisory half in from the first sprint, or do not call it an MVP.
The Right Comparison
The correct comparison is not agent versus doing nothing. It is agent versus the current best alternative, which is often a human with better deterministic tooling, a traditional automation or RPA flow (RPA, robotic process automation, which scripts the user interface steps of existing systems), or a simpler AI integration (classification, extraction, summarization) without action.
| Option | Unit cost | Setup cost | Governance overhead | Best fit |
|---|---|---|---|---|
| Human with better tooling | High per task | Low | Embedded in the org | Low volume, high judgment, ambiguous inputs |
| Traditional automation or RPA | Very low | Moderate | Minimal | Deterministic, rule-based, stable interfaces |
| Constrained AI (classification, extraction, summarization) | Low | Low to moderate | Moderate | Narrow, well-bounded tasks with a human consumer |
| Full agent | Variable, often high once oversight is priced in | High | Substantial and ongoing | High volume where judgment can be bounded and consequences are recoverable |
Table 3.2. Alternatives to an agent, by workflow.
For many workflows, the right answer is not an agent. Consider a pattern that has appeared repeatedly in customer-service pilots. A team launches with a vision of an end-to-end agent that handles support cases across channels, calls tools, and closes tickets without human involvement. During the pilot, three cost drivers appear that were not in the original model: multi-step planning, retries, and tool calls drive an order of magnitude more compute per task than a single-shot model; orchestration overhead accumulates through approval gates, audit trails, and monitoring; and integrating the agent with legacy ticketing systems turns the quick win into a multi-year process redesign. When the team runs the head-to-head comparison, the fully agentic path carries a higher cost per resolved ticket with no measurable improvement in resolution rate. Leadership cancels the full-agent initiative and standardizes on a constrained assistant plus RPA pattern that should have been the baseline comparison from the start.
Traditional automation handles deterministic, rule-based processes at very low per-task cost with no inference overhead. If the task can be solved with scripted logic or RPA, an agent is usually the wrong tool. The cost differential is not marginal. It is structural.
The published numbers on RPA economics are also worth knowing, because the comparison the agent has to win is not against doing nothing but against a mature alternative with thirty years of operational learning behind it. HfS Research, Gartner, and Deloitte data converge on a consistent picture: licensing is roughly 25 to 30 percent of total RPA cost, while implementation and maintenance is 70 to 75 percent. Implementation-to-license ratio runs 3:1 to 4:1. A representative three-year RPA TCO sits near 1.4 million euros against 250,000 euros in software licenses. Annual ongoing maintenance is 40 percent or more of initial implementation cost. And 30 to 50 percent of RPA projects fail before scaling to production, by EY’s estimate. The point is not that RPA is bad. The point is that the comparison agentic AI has to beat is not the demo. It is RPA at full operational maturity, which has a cost structure most agentic business cases do not bother to model.
Agentic implementation runs $40,000 to $200,000 depending on orchestration complexity, before adding the observability and governance stack: guardrails, monitoring, eval pipeline, somewhere between $50,000 and $200,000 in build cost plus ongoing operational overhead. The observability stack has no RPA equivalent. It is a permanent operational cost unique to probabilistic systems. Gartner’s prediction that 40 percent or more of agentic AI projects will be canceled by 2027 is grounded in this layer specifically: the cost of observation infrastructure, accumulating across years, against business cases that did not budget for it.
When the Objective Function Is Wrong
There is a failure mode that does not show up in the suitability tests or the cost model until it is too late: optimizing for the wrong thing.
Klarna, the Swedish fintech company that lets consumers split purchases into installments, deployed an AI assistant in customer service that handled the workload of roughly seven hundred agents. Cost per interaction dropped by roughly forty percent. On the metrics they chose, volume handled, average handle time, cost per interaction, the project looked like a clear win.
Over time, a pattern emerged. In the complex, emotionally loaded cases that determined whether high-value customers stayed or left, the assistant was quietly underperforming. It made choices that preserved short-term efficiency at the expense of long-term trust. By mid-2025, Klarna had deliberately shifted work back to human agents for those interactions. CEO Siemiatkowski was direct about the lesson: cost was too predominant an evaluation factor, and the result was lower quality.1
Klarna did not misjudge AI’s capability. It misjudged the total cost of quality degradation in edge cases and the irreversibility of bad customer interactions in the moments that mattered most. All four suitability tests had looked favorable. The fourth condition, recoverability, was harder to meet than the model assumed, at a cost that did not appear in the original dashboard. The pattern is one you will see repeatedly in the field. The agent technically did what it was asked to do, and the surrounding environment did not reward the behavior required for long-term value. Recoverability is the blind spot. Volume, measurability, and bounded tools sit on the surface of the business case. Recoverability lives in the tail of the distribution, where the metrics the team chose do not look. The system worked. The evaluation framework did not.
There is a Channel 2 dimension to the Klarna case that the original cost-model conversation rarely surfaces. Chapter 1 named it. The supervisor population the agent reshapes is often the population you would need to detect the failure mode you were not measuring. By the time Klarna’s twelve months of agent operation had filtered the case mix the human supervisors saw, the supervisors who would have caught the silent quality degradation in high-stakes cases had been calibrating against a sample the agent had already pre-filtered. The cases the agent handled invisibly badly did not appear in the escalation queue. The supervisor could not calibrate against them, because the agent had filtered them out. Recoverability is harder to model than the four suitability tests imply, and the supervision paradox is part of why.
Suitability and cost can both be green while the thing you are optimizing for quietly destroys value at the edges, and while the population that would otherwise have noticed has been gradually retrained to look elsewhere.
One reconstruction of the Klarna numbers is worth carrying with you. The original February 2024 press release claimed forty million dollars in profit improvement from the AI assistant, framed in coverage as savings against existing agents. The figure was technically an estimate of cost avoidance from not hiring additional agents to handle growth volume, not savings from firing existing agents. Klarna’s headcount reduction was attributed primarily to attrition and hiring restraint. That nuance was absent from most of the coverage, including coverage that PMs were citing in their own business cases six months later.
What was also absent from the original forty-million-dollar claim was the cost line for AI API and infrastructure operational cost. The cost of observability, monitoring, and eval infrastructure. The customer satisfaction degradation on complex cases. The customer retention impact. The escalation premium, the higher cost of human resolution after AI has already failed and the customer is frustrated. The compliance risk and redesign cost. The rehiring and retraining cost. The institutional knowledge loss. None of those were in the press release. All of them showed up in production.
The reconstructed verdict is more honest than either the press release or the walkback alone. The forty-million-dollar claim holds as directionally correct against a counterfactual in which Klarna would have hired roughly seven hundred additional agents at fifty-seven thousand dollars fully loaded annual cost. The claim is substantially overstated against a hybrid deployment, AI for tier-one simple queries, humans for complex and emotional cases, where production data shows hybrid models with sixty to seventy percent AI and thirty to forty percent human routing close the CSAT gap to less than one-tenth of a point while maintaining roughly seventy-one percent cost reduction. Klarna optimized for one hundred percent automation when the evidence supports sixty-five to seventy-five percent as the value-maximizing deployment point. The cost model that selects the correct deployment point is not the same cost model that maximizes cost reduction. Most business cases do not yet distinguish between them.
Earned vs. Scheduled Autonomy
One sidebar before the chapter closes, because the suitability and cost questions are not the only places teams quietly skip the rigorous version. The third place is when, after deployment, the agent is moved up the autonomy ladder.
Chapter 2 introduced the autonomy ladder as an earned, not scheduled, climb. Here is what scheduled autonomy looks like in production, and what the alternative requires.
In 2025, Doctronic, a Utah-based clinical AI vendor, designed a workflow in which a physician supervised a chronic-disease prescription renewal agent for the first two hundred and fifty cases. After two hundred and fifty supervised renewals, the physician was removed from the loop, and the agent operated autonomously. The threshold was a count. Two hundred and fifty.2
Two hundred and fifty supervised renewals tells you how many times a physician said yes to the agent. It does not tell you whether the agent correctly handles the edge cases where autonomous action is inappropriate. It does not tell you whether the agent has a recovery mechanism for when it is wrong. It does not tell you what fraction of the two hundred and fifty cases were the easy ones the agent would have gotten right under any framework. A schedule is a schedule. It is not a safety criterion.
The closed-loop insulin pump is the right precedent for what earned autonomy looks like in operation. The artificial pancreas is not granted autonomy because it has been observed for thirty days. It is granted autonomy because the system has a real-time physiological feedback loop architecturally embedded: the sensor reads continuously, the model acts, and when the model errs the body signals it within minutes. The feedback loop is the recovery workflow.
Earned autonomy in any agentic context requires three properties. Demonstrated competence in the specific failure modes that matter for this decision type, not average performance across all cases. A real-time or near-real-time signal that catches errors before they compound. And a defined path back down the ladder if competence is not maintained.
Movement up the autonomy ladder is triggered by demonstrated competence in the specific failure modes that matter for this decision type, not by reaching a review count or a date on the calendar. Earned autonomy requires demonstrated failure-mode coverage, a real-time error signal, and a defined demotion path. Scheduled autonomy substitutes counts and calendar events for the safety criteria. Counts and calendar events are not safety criteria.
The PM question before moving up the ladder is not “how many times has it been right?” The PM question is “does the agent know which cases it should not handle, and what happens when it does not?”
Most of the agentic deployments running in production today have scheduled their autonomy. That is a governance gap, not a governance model. Chapter 4 develops the runtime artifacts (autonomy boundary, approval moment, audit surface, recovery workflow) that make earned autonomy implementable. The Chapter 3 point is only this: pre-build, you decide whether the problem is suitable. Pre-launch, you decide whether the agent has earned its starting rung. Post-launch, you decide whether it has earned the next one. Three decisions, one discipline.
Before You Commit
Healthcare is where AI is being stress-tested under the most demanding conditions: highest stakes, most rigorous evaluation requirements, least tolerance for confident wrong answers. Clinical practice does not deploy a procedure just because it works in general. It begins with a candidacy assessment: indication, contraindication, risk-benefit ratio at the patient level. The procedure may be excellent. The question is whether this patient, at this moment, is the right candidate.
The same logic applies to agentic AI. The four suitability conditions are the candidacy assessment at the problem level. If any cannot be answered with confidence, the problem is not ready for an agent. If all four hold, one cost question remains: at what task volume does this agent break even against the current best alternative, and will you realistically reach that volume within twelve months?
If neither question has a clear number, you are not ready to build. You are ready to do more scoping.
An agent deployed into the wrong problem does not just fail to create value. It creates a specific kind of damage: automated confidence applied at scale to decisions that required judgment. That is harder to recover from than a project that was never started.
A last word before moving on. Suitability and cost are both probabilistic questions. You are not predicting whether the agent will work. You are predicting the shape of the distribution of outcomes, and whether the organization can absorb the tail. Most cost models also miss Channel 2 entirely. The supervisory interface, the training time, the incident response, the governance overhead, all of it is cost, and none of it is in the vendor quote. If those costs are not in your model, your model is not finished. It is a demo with a launch date in a different disguise.
The Five Conditions for Agentic Wins
One classification framework, drawn from the patterns visible across enterprise deployments in 2025 and 2026, is worth memorizing as a final filter before the build decision. Agentic AI wins, on combined cost and outcome, when all five of these conditions hold simultaneously.
The enterprise already holds the underlying platform license, so the agentic floor price is incremental rather than greenfield. The tasks involve natural language, judgment, or unstructured input, not purely deterministic logic. The failure consequences are recoverable and detectable within the same session, not deferred and silent. The organization has supervision capacity, including an eval pipeline, an observability stack, and a defined human escalation path. And the outcome is measurable at resolution level, not just task completion level (the procurement-agent failure mode from Chapter 6 is precisely the case where task completion looks green and resolution did not happen).
If any one of those five fails, the agent is not the right answer for this problem at this moment. Traditional automation, RPA, or constrained AI without action wins instead, and the cost differential is not marginal. It is structural. A team that builds an agent into a problem that fails any of the five is not making a borderline call. They are making a category error.
Three artifacts to carry into Chapter 4. The four suitability tests as the pre-build candidacy assessment. The cost model with its five token-bill drivers, its brownfield-versus-greenfield floor-price inversion, and its 1.5x correction. And the earned-versus-scheduled autonomy distinction, which the runtime design artifacts in Chapter 4 are built to support.
Notes
- Klarna’s 2024 deployment of an AI customer service assistant and the 2025 partial reversal toward human agents for complex emotional cases is publicly documented in Klarna investor communications and CEO statements (Siemiatkowski, 2025). Chapter 1 covers the supervision-paradox dimension of the same case (the supervisor population reshaped by twelve months of agent-filtered escalations).
- The Utah Doctronic case is treated at length in Friedman, “Utah Climbed the Autonomy Ladder. Nobody Designed the Rungs,” data-decisions-and-clinics.com, 2026. The two-hundred-and-fifty-supervised-renewal threshold is documented in the company’s public regulatory filings; the framework critique developed in the article is summarized in this chapter and revisited in Chapter 4 (autonomy boundary as runtime artifact) and Chapter 9 (the boundary map and the when-wrong spec).
- Cost-model sources synthesized for this chapter include HfS Research on RPA licensing economics, Deloitte’s Global RPA Survey, EY’s “Get ready for robots” analysis, Gartner’s February 2026 outlook predicting cancellation of forty percent or more of agentic AI projects by 2027, the Yao et al. (arXiv:2406.12045) finding that pass@8 reliability for GPT-4-class agents in retail customer service runs under twenty-five percent, Tipirneni et al.’s NEJM AI 2024 DAX Copilot RCT, and the 2026 Digital Applied Customer Service AI Statistics dataset for production benchmarks. Vendor pricing references are drawn from SAP’s Joule documentation, Salesforce’s Agentforce and Flex Credits pages, ServiceNow’s April 2026 platform consolidation announcement, and Microsoft Learn’s Copilot Studio billing rates. Klarna case material is drawn from the February 27, 2024 Klarna press release and the May 8, 2025 Bloomberg interview with CEO Sebastian Siemiatkowski. The interactive cost calculator embedded in this chapter uses these sources for the default parameter values.