Part I · Decide · Chapter 3

Chapter 3: Not Every Problem Deserves an Agent

Whether to build an agent is a decision, gated by four tests and a real cost model, made before a line of code. Two things in the standard version of that decision need fixing: the measure of success has to be one you actually trust, not merely one you can compute, and the cost that decides the case is set by your architecture, not by which model you choose.

The cheapest agent is the one you decide not to build. Most agentic failures I have seen did not happen in production; they happened at the moment someone confused “we can build this” with “we should,” and the confusion is more tempting now than ever, because building is nearly free and a demo is nearly instant. This chapter is the discipline that goes in front of that temptation: a small set of tests, and an honest cost model, applied before the team falls in love with a prototype.

There is a name for what the discipline prevents, and it will recur through the book: the MVP house of cards. The MVP was supposed to be a decision tool. Build the smallest thing that tests an assumption, measure, then decide: persevere, pivot, or stop. The load-bearing word in that sequence is decide, and the decision that matters most is stop. What happens in practice is that stop leaves the table. The business case is sold, the headcount is allocated, the board has seen the deck, so persevere becomes the only acceptable outcome and the test balloon quietly becomes a load-bearing structure. Then a second MVP consumes the output of the first, a third orchestrates the two, and before anyone draws an architecture diagram there is a compound system in production that no one designed as a system, each card labeled “MVP,” each one load-bearing. The house stands until one layer drifts, and then it does not fall dramatically; it fails quietly, over months, because no one built the monitoring and no one owns the system as a system.

This is why the suitability question comes before the build, not after it. The discipline is not anti-MVP. It is a redefinition of what minimum has to protect: minimum viable safety, the failure modes and escalation paths defined before any exposure; minimum viable architecture, the data contracts between layers and what happens when an upstream layer degrades; minimum viable evidence, a real stop trigger at each stage; and minimum viable honesty, calling a demo a demo and an experiment an experiment rather than letting MVP language disguise which one you are actually building. The tests in this chapter are how you decide whether you are laying a slice of architecture or adding another card.

The two questions before any code

Before the suitability tests, two questions that kill most bad agent ideas on their own. First: is there a real, repeating problem here, or is “let’s add an agent” the actual driver? An astonishing share of agent proposals are solutions shopping for a problem, and the tell is that no one can state the user pain without mentioning AI. Second: if this works perfectly, what changes for the customer? If the honest answer is “not much,” you have a technology project, not a product, and it will fail in production for reasons that have nothing to do with the model.

Pass those two and you have earned the right to the four tests.

The four suitability tests

An agent is the right answer only when four conditions hold at once. Three are from the standard assessment; the fourth I have revised, because experience showed the original was too weak.

The Four Suitability Tests. Build an agent only when all four hold:

Repeats at volume. The task happens often enough that automating it pays back the design and supervision cost. A once-a-quarter task rarely clears the bar.
Bounded tool use. The actions the agent needs are finite and definable. If “do whatever it takes” is the scope, you cannot reason about the blast radius.
Recoverable consequences. When the agent is wrong, the damage can be caught and undone. Irreversible actions demand a human in the loop regardless of volume.
Measurable and trusted. You can define what “good” output is, and the measure is one you would stake a decision on. A metric you can compute but do not trust is worse than none, because it green-lights an agent you cannot actually evaluate.

The revision to the fourth test is the one that matters. The original said “measurable,” and teams duly produced a metric, an eval pass rate, a confidence score, a satisfaction proxy, and treated its existence as the green light. But a metric you do not trust is a trap: it makes the agent look governable while hiding that you cannot actually tell good output from bad. The clinical version is stark, a model scoring well on a benchmark that does not capture the failure that kills the patient, but it generalizes to every domain. If you cannot point to a measure you would bet on, this task is not yet suitable, however measurable it looks.

What the token bill actually measures

Now the cost model, and here is the finding that reorders most teams’ intuition. The sticker price of a model, the dollars per million tokens, is the first number and almost never the deciding one. Token prices have fallen roughly tenfold a year; a frontier chat model that cost around thirty-six dollars per million tokens in early 2023 is now one to two dollars, with budget models a fraction of that. If your decision turns on the model’s list price, you are optimizing the wrong variable.

The real bill is set by architecture. An agent does not make one model call; it makes many, re-entering the model as context grows, calling tools, retrying when something fails. For a typical agent this multiplies the base token cost by five to fifty times per task. For agentic coding workflows it runs one hundred to five hundred times. The same task on the same model can cost a cent or a dollar depending entirely on how the loop is built. So the cost question is not “which model is cheapest.” It is “how many times will this architecture re-enter the model, and how long is the context each time,” and that is a design decision you own, not a price you shop. (The runaway version of this, an architecture with no ceiling, is its own chapter; see operational guardrails.)

The break-even. An agent is worth its cost when the value of the task, times how often it runs, exceeds the fully loaded per-task cost: tokens times the architecture multiplier, plus the human supervision the agent still requires. Teams compare the model’s sticker price to a human’s hourly rate and conclude the agent is obviously cheaper. The full comparison includes the 5-to-50× multiplier and the cost of the human who must still watch it. Often the agent still wins. Sometimes, once you count it all, it does not, and that is a finding worth having before you build.

The brownfield inversion

One counterintuitive cost effect worth naming. In greenfield, a new product, the floor price is whatever the agent costs to run. In brownfield, replacing a step in an existing system, the floor can invert: the agent has to be cheaper than the marginal cost of the human step it replaces and worth the integration, change-management, and supervision cost the existing system imposes, which a greenfield build never paid. The result is that the same agent can clear the bar as a new product and fail it as a replacement, because the brownfield deployment carries costs the greenfield one does not. Decide which you are in before you run the math.

Earned versus scheduled autonomy

The last discipline connects back to the Autonomy Ladder. Once you decide to build, the question of how much autonomy to grant has a right answer and a common wrong one.

Utah Doctronic: autonomy on a schedule. A telehealth service moved its AI toward autonomous prescribing by a timetable, advancing the system’s authority on a schedule rather than on demonstrated competence at the current level. Autonomy that advances because the calendar says so, rather than because the system has earned the next rung, is autonomy granted blind. The lesson is the rule: you move up the Autonomy Ladder on evidence the system behaves at its current rung, never on a date.

The counter-case is just as instructive about going too far and walking it back.

Klarna: the rebalance. Klarna deployed an AI assistant it described as doing the work of roughly 700 human agents, a genuine and widely-cited efficiency. Then it rebalanced, bringing human support back into the loop for the cases the agent handled worse than advertised. The story is not “the agent failed.” It is that the right autonomy level was discovered in production and corrected, which is what mature deployment looks like. The failure would have been refusing to walk it back to protect the headline.

A worked suitability sheet

The chapter’s practical output is a sheet you fill in before building, not a principle to nod at. For one candidate agent: state the repeating problem in one sentence with no mention of AI; list the bounded set of tools; name what happens when it is wrong and whether that is recoverable; name the measure of good output and whether you would stake a decision on it; estimate the architecture multiplier and the fully-loaded per-task cost against the task’s value and volume; and place it on the Autonomy Ladder at the rung it has earned. If any line is blank or any test fails, you have not necessarily killed the idea, but you have found exactly where it is weak, and you owe yourself one written sentence on why you are building it anyway. Whether that sentence would survive being read aloud to your CFO is a fair test of whether you believe it.

Take one agent through the sheet, and keep it; this book returns to it. A support team wants to automate refunds: a customer asks for their money back, and today a human reads the request and decides. It repeats at volume, which passes the first test. The tool set is bounded, read the order and account, issue a refund, or escalate, which passes the second. Consequences are mostly recoverable, a wrong refund can be clawed back within a window, which passes the third but only conditionally, and that condition is where the boundary will later have to sit. And the measure of good output is the hard one: you can measure whether a refund was issued, but “issued” is not “correct,” and you would not stake the company’s margin on the agent’s judgment about the truly ambiguous cases, so the fourth test passes only if you build the evals and the supervision that make the output trusted, not merely measurable. On the Autonomy Ladder it earns, at most, the rung where it resolves the clear cases and escalates the rest. That is a buildable agent, and a revealing one, because every place the sheet came back “yes, but” is a place the supervisory layer will have to do work. Hold onto the refund agent; it reappears as the worked example runs through the design, operation, and human-system chapters, and it is the agent whose two briefs close the book in Appendix A.

You Are Not a Bridge Anymore How the Work Splits