Part I · Foundations · Chapter 2

Deciding What to Build, and Whether To

The most expensive decision in an agentic product is the one nobody questioned: whether to build it at all. I have sat in the meeting where a roomful of capable people approve a project in twenty minutes, because the demo was good and the budget was there and the future was clearly pointing this way, and not one of us asked whether the thing was worth building, because asking felt like doubting the future. The project shipped. It failed, slowly and expensively, in exactly the way someone in that room could have predicted if the room had been built to let them. Everything in this chapter happens before a line of production code is written, because the front of the pipeline is where agentic products are won or lost, and the team that treats planning as the boring part it rushes through to get to the building is the team that builds the wrong thing quickly. There are three disciplines here: deciding whether to build, writing down what to build, and proving the bet without shipping it.

The cheapest agent is the one you don’t build

Most of the product failures I have watched did not happen in production. They happened months earlier, in the meeting where someone confused “we can build this” with “we should,” and no one had either the standing or the framework to say no. So the first discipline runs before any building: deciding what not to build, and being a team where someone is allowed to say so.

Two questions come before any framework. The first is whether there is a real, repeating problem here, or whether “let’s add an agent” is the actual driver. The tell is simple and damning: no one can state the user’s pain without mentioning AI. If the problem can only be described in terms of the solution, there is no problem, there is a technology looking for a place to land. The second question is what changes for the customer if this works perfectly. If the honest answer is “not much,” you do not have a product, you have a technology project, and it will be counted a success by everyone except the people it was supposed to help.

If those two questions survive, four tests decide suitability, and each one, failed, is a different kind of disaster. The agent should repeat at volume; a once-a-quarter task rarely clears the bar, because the cost of building and supervising an agent is only justified by a problem that recurs enough to amortize it. Its tool use should be bounded; if the scope is “do whatever it takes,” you cannot reason about the blast radius, and an agent whose blast radius you cannot reason about is one you cannot safely supervise. Its consequences should be recoverable; an irreversible action demands a human in the loop regardless of how high the volume is, because the whole logic of autonomy is that mistakes are cheap to undo, and where they are not, autonomy is not a saving, it is a loaded gun. And the output should be measurable and trusted, which is the test that matters most and the one teams get wrong. It is not enough that you can compute a number for “good.” You have to trust that number enough to stake a decision on it. A metric you can calculate but do not believe is worse than no metric, because it makes the agent look governable while hiding that you cannot actually tell good output from bad. The measurable-but-untrusted agent is the one that passes every dashboard and fails every customer.

The recoverable-consequences test smuggles in a question the suitability conversation usually skips, and it is worth pulling out into the open, because skipping it is the origin of the worst failures an agentic product produces. When the consequences are not recoverable, the obvious next question is who absorbs them, and the answer is almost never the person in the room. A software team designs for the user, the person who operates the product, clicks its buttons, appears in its analytics. An agentic product has a second person, and they are usually the one the product is actually for: the applicant whose loan the agent denied, the patient whose case the agent triaged, the candidate the agent screened out, the supplier whose invoice the agent rejected. Call them the affected person. They never see the interface. They are not in your funnel, not in your usability test, not in the room where suitability is decided. They are simply the one who lives with the agent’s decision, and when the agent is wrong about them, they are the one who pays. The cruelty of it is structural: the people inside the agent’s error rate are, almost by definition, the ones least able to make you notice, because they are not your users and have no channel to you. A suitability decision made with only the user in view is a decision made with the most important person absent. Deciding whether to build the agent at all means deciding whether you can be accountable to someone who will never be in your analytics, and naming, before you build, who on the team holds that person’s interest when no metric will surface them on its own.

There is a cost trap underneath all of this, and it changes who needs to be in the room. The real bill for an agentic product is not set by the model’s price per token. It is set by the architecture, because an agent does not call the model once, it re-enters the model many times per task, five to fifty times for an ordinary agent and a hundred to five hundred for one that writes code, reasoning and re-reasoning and checking its own work. The cost question is not “which model is cheapest.” It is “how many times will this architecture re-enter the model,” and that is an architectural question, not a procurement one, which means the person who can answer it has to be present when suitability is decided. The same product can clear the bar comfortably as a greenfield build and fail it as a replacement for a system that already works, because the comparison is never agent versus nothing. It is agent versus the current best alternative, which might be a human with better tooling, and the agent has to beat that, not beat the empty field.

Suitability is the first place the new collaboration shows its shape, because answering these questions well requires judgment no single role holds. Whether the problem repeats at volume is something the product owner knows. Whether the tool use can be bounded and what the architecture will cost to run is something the architect knows. Whether the consequences are recoverable, and what happens to the person on the receiving end when they are not, is something the domain expert knows. Whether the output can be measured and, harder, trusted, takes the eval owner and the domain expert together. The teams that treat suitability as the product manager’s solo call, decided in isolation and handed down, are the teams that discover in production what the architect would have told them in the meeting, if the meeting had included the architect and given them a real veto. And the architect is not the only one who needs that veto. The domain expert holds the interest of the affected person, the one who never sees the interface and lives with the decision, and an interest that can only advise is not held, it is noted. So the domain expert’s judgment about the affected person carries the same standing the architect’s carries about cost and blast radius: a real veto over whether the agent ships at the autonomy it was given, exercised before the build, not a worry recorded in the minutes and overridden by velocity. A team that lets the domain expert describe the harm but not stop the build has not seated the affected person at the table. It has invited them to watch. The cheapest agent is the one you decide not to build, and deciding not to build it is a team act, performed by a team that has arranged for more than one person to be able to say no.

The suitability tests decide whether to build. There is a second pass of the same tests that decides something this book will lean on for the rest of its length, and skipping it is what makes the supervision argument sound like overkill: not whether to build, but how much supervision this particular agent deserves. The honest version of this book’s thesis is not that every agent needs the full apparatus. It is that the apparatus should scale with the stakes, and the stakes are exactly what the suitability tests already measured. Run them a second time as a dial rather than a gate. How irreversible are this agent’s actions: a draft a human always edits sits at one end, a payment or a denial or a deletion at the other. How exposed is it to regulation: an internal brainstorming agent and a medical-necessity agent are not in the same world. And how high are the stakes for the affected person, the one who never sees the interface and lives with the decision: a wrong product recommendation and a wrong benefits determination are not the same wrong. An agent low on all three, reversible, unregulated, low-stakes for the person at the end, rationally gets a thin supervision column, and a book that demanded the full grid for a meeting-notes summarizer would deserve to be ignored. An agent high on any of the three earns the apparatus this book describes, and earns it before it ships, not after. The gradient is what keeps the argument honest in both directions: it concedes the products where light supervision is correct, which is precisely what makes the demand for heavy supervision credible on the products where it is not. The mistake the field makes is not under-supervising the summarizer. It is supervising the summarizer and the benefits agent the same thin amount, because no one ran the second pass and asked which kind of agent this was.

Two documents, two audiences

Once the room decides to build, the decision has to become an instruction, and here a thing that used to be allowed to blur has to come apart. Every product that gets built starts as two different things pretending to be one. There is the decision to build it, a bet made in a room full of people with budgets and doubts, and there is the instruction to build it, a precise description a system can act on. For most of the history of software we wrote these as one document, a spec, a PRD, a ticket, and got away with it because the human engineer on the other end filled the gap with judgment, read the loose requirement, understood the intent, and built the thing you meant rather than the thing you wrote. The agent does not do that. It builds the thing you wrote, exactly, at speed, and so the two documents come apart.

The first is the Human Brief, the successor to the PRD, the document a room argues with. It is written in prose, for people, and its job is to carry the decision: the problem and the opportunity, what we are building and what we are deliberately not, the hard cases that actually decide whether this is worth building rather than the easy ones that always work, the business case and the explicit go or no-go, where we are drawing the agent’s boundary and why, who is accountable and how we will know it is going wrong, and the questions we still owe an answer to. That last part matters more than it looks: a question named in the brief is one the room knows it must answer, and a question left out is one that gets answered by default, in code, by whoever ships first.

In practice the Human Brief is not written all at once, and seeing how it runs matters more than the list of its parts, because the order is the discipline. It comes in two passes around the go/no-go. The first pass is slim by design and exists only to make the bet: the problem and its size, what the thing is and the dangerous adjacent it must not become, the few hard cases that decide whether it is worth building, and the business case with the suitability tests and the cost model. That is the part a sponsor and a finance partner argue with, and most ideas die there, which is the point, because you do not want to have spent the rest of the brief on a thing that does not clear the gate. Only after a go does the second pass get written, and it is where the supervision gets designed: where the boundary sits and why, who is accountable and which instruments will signal drift, what success looks like and what it does not, and the open questions the room still owes. The reason to hold the second pass until after the decision is the same reason the hundred-and-ninety-eight-page strategy document was a warning and not a model: planning is cheap insurance only when it is spent on the thing you have decided to build, and a team that designs the whole supervision apparatus for a product it then kills has simply moved the waste earlier. Decide on the slim brief; commit on the full one; and do not confuse the two, because the slim brief is a filter and the full brief is a blueprint, and a filter that has become a blueprint has stopped filtering.

The second is the Executable Brief, the successor to the epic, the document a system acts on. Where the Human Brief is prose, this is numbered and testable: the agent’s behavior written as requirements you can check, the acceptance scenarios in given-and-when-and-then form, the supervisory experience the human will use to oversee it, and, crucially, the governance written as its own numbered requirements, the Channel 2 rules stated as plainly and as testably as the Channel 1 behavior. It borrows its discipline from the way agentic code is now specified, numbered requirements, explicit markers where something is still unknown rather than silently assumed, scenarios that run as tests, and it adds the part those engineering practices do not give you, the supervisory content. The reason to write the governance into the Executable Brief, as numbered requirements alongside the behavior, is the most practical sentence in this chapter: the supervision gets built because it was written into the spec, not because someone remembered to. Channel 2 does not get built out of good intentions. It gets built because it was a requirement with a number on it that someone had to close.

This is also where the domain expert’s veto stops being a posture and becomes a mechanism, because a veto with nothing to attach to is just a strong opinion. The governance requirements that protect the affected person, the boundary that keeps an irreversible decision from being made unsupervised, the recovery path that lets a wrong determination be undone, the approval moment on the actions that fall hardest on the person at the end, are numbered requirements like any other, and the domain expert is their owner. Owning them means the same thing it means for any owner of a numbered requirement: the build is not done until they are closed, and they are closed when the person accountable for them says so, not when someone else decides the schedule needs them to be. That is the veto, written as a line item rather than carried as a feeling. The domain expert who can point at requirement fourteen and say it is not met, and have that block the release the way an unmet behavior requirement blocks it, holds the affected person’s interest in the only form a team actually respects, a thing that has to be closed before the thing ships.

You derive the second from the first, and both carry both channels. The split is not “the agent in one document and the supervision in the other.” Both briefs describe the agent and its supervision; what differs is the audience and the purpose, one to decide, the other to build. The most common way a good agent project goes wrong is that someone writes the second document and never writes the first, specifying the build in beautiful detail while never holding the argument about whether to build it, so the bet gets made by default, in the act of specifying, by people who thought they were just writing requirements. And on a team a fault line appears exactly at the hand-off between the two documents, where intent leaks: a boundary the room agreed to that never became a numbered governance requirement, a hard case discussed in the argument that never became an acceptance scenario, a question marked unknown that got quietly answered in code because no one carried it forward. The two briefs are the first place in the pipeline where the team’s judgment is written down, and the seam between them is the first place it can be lost.

Build it to decide, not to ship

I joined SAP in 2004, onto the team building NetWeaver, the platform that would become SAP Business Technology Platform, and one of the first things I did there was help write a five-year strategy document for the company’s future technology stack. About twenty of us worked on it for the better part of a year. We flew tens of thousands of miles to sit with customers, ran competitive analyses, consulted analysts and partners, argued the vision with people who loved it and people who wanted nothing to change. What we produced was, by the standard of its day, magnificent: one hundred and ninety-eight pages, small font, dense with diagrams. I have always remembered one thing about it above all else. It took longer to read that document than it would later take to build the thing it described. That was not a failure. It was the doctrine working exactly as intended. The rule of the era was that planning should run something like three times as long as the building, because the expensive mistake was building the wrong thing, and a year of research was cheap insurance against a multi-year build going the wrong way.

That ratio has quietly inverted, and the inversion is the whole point of this section. The principle did not die: good planning still translates into good execution, and a team that skips the thinking still builds the wrong thing. What collapsed is the ratio between planning and building. Agile already moved the planning from one giant up-front document to small continuous slices before each sprint. The agentic shift takes it further, because the building is now so fast that the old three-to-one is impossible and the planning deliverable has shrunk from a 198-page strategy to the two briefs and the spec. That is mostly progress, and it is also a loss, because the 198 pages carried a weight the briefs sometimes cannot, the slow accumulated argument about whether the thing should exist at all, and a team that compresses planning to near zero because the building compressed to near zero has thrown away the insurance along with the waste. The new capability at the front of the pipeline has to be understood against that history, because it is precisely the thing that makes the compression both possible and tempting.

There is one more thing the team can now do at the front of the pipeline that it could not before, and it is the most useful new capability and the most dangerous. A team can build a working version of almost any product in a day. Describe it to an agent, and an hour later something runs, clicks, demos, looks for all the world like the thing. The working thing that took a day is a learning instrument that looks exactly like a product, and the entire discipline is in knowing, every single time, which one you are holding. You build the prototype to decide. You do not ship it to users.

What the fast prototype is good for is real. It shows instead of telling, surfacing reactions a document never will. It de-risks a bet before the bet is expensive, letting you find out that nobody wants the thing while the thing cost you a day instead of a quarter. It tests a job against reality, the actual messy task in the actual messy world. But all three are tests of desirability and adoption: does anyone want this, will anyone use it. The prototype is built to invalidate your riskiest assumption about whether the product should exist, and it does that beautifully. What it does not test, and cannot, is whether the thing is safe to run.

Here is where the demo lies, and the lie is structural. A prototype generated in an afternoon works under the conditions you happened to try. It is honest about the happy path and silent about everything else, and the silence reads as success, because a thing that does not visibly fail looks like a thing that works. A team built a featured education app this way; it ran, it demoed, it would have been flawless in front of any audience. Its authentication logic ran backwards, blocking the users who were logged in and letting the anonymous ones through, and it exposed eighteen thousand six hundred and ninety-seven records, and none of that was visible in the demo because the demo never tried the case that broke it. The prototype was telling the truth about the only thing it had been built to test, whether the idea was appealing, and lying by omission about everything it had not, which is everything that matters in production.

So the prototype hands over a learning, not a codebase. What you carry out is what it proved and disproved, the bet validated or killed. What you do not carry out is the code, which was never built to survive and will not. The fast prototype does not make the product manager a builder. It makes the product manager someone who can test a bet cheaply, and confusing the two, mistaking “I built a working version” for “I built the product,” is how the build cost does not get saved but merely moved, downstream, to the worst possible moment, the production incident, with the team’s name attached. On a team this is an ownership question wearing a methodology costume. Someone has to be the person who says, this proved the bet, now we throw the code away and build it properly, against the Executable Brief, with the supervision designed and the hard cases handled, and that someone needs the standing to stop a working demo from becoming a shipped product, especially when the demo is good, because a good demo is the most persuasive argument for skipping exactly the work that keeps the product from becoming an incident. The prototype is the bet proven; the product is the bet built to survive, and the team that loses the difference loses it in the most seductive way there is, by succeeding at the demo.

What You Are Actually Building The Team Was Built for a Different Product