Part II · Prototype & Collaborate · Chapter 4

Chapter 4: How the Work Splits

The agentic lifecycle is not a new process; it is the one you know with the seams moved. Two things change shape, and the rest follows. The work splits into two tracks at the decision point, the agent and the layer that supervises it, and the supervisory track opens the work and outlasts it. And the unit of work for the agent’s behavior stops being the user story and becomes a spec written around an outcome and graded by evals. This is the chapter that says, concretely, what the new job looks like on a Monday.

Before the book takes the new job apart chapter by chapter, this chapter shows you its shape whole, so the parts that follow have somewhere to sit. The rest of the book goes deep on each piece: how to design the agent’s behavior, how to choose the oversight, how to read the evals, how to operate the thing in production, how to carry the human and moral weight. Here is the map those chapters fill in. It is not a methodology to follow step by step; it is the shape the work takes once you have decided to build an agent, and the point of seeing the shape first is that it tells you which of your old habits still fit and which quietly stopped working, before you are deep in the mechanics of any one of them.

Start from what does not change, because the comfort is real and earned. You still start with a pain worth solving. You still research the market, the competitors, the users. You still write something down, validate it, and hand work to engineering. The front of the process survives, and AI makes most of it faster: the research synthesizes in an afternoon, the competitive scan assembles itself, the first draft of everything arrives in minutes. If the change were only speed, this would be a shorter chapter. The change is where the seams fall.

The work splits in two, and the split is not symmetric

The first seam moves to the decision point. The moment you commit to building an agent, the work divides into the two channels from the spine of this book: Channel 1, the agent and what it does, and Channel 2, the layer that supervises it. The division is not a formality. The two tracks have different deliverables, often different owners, and, crucially, different start and end points.

Channel 2 opens the work, or it should. In practice boundaries and capabilities get negotiated together, and that is fine; the claim worth defending is narrower and sturdier. Of the two ways to sequence the channels, treating Channel 2 as an afterthought is the more expensive mistake, because the autonomy boundary, what the agent may do alone and what requires a human, shapes everything in Channel 1. An agent that may act up to a limit and must get approval above it is a different product, with a different interface and a different architecture, than one that runs autonomously with an audit trail. Decide the boundary late and you discover the agent you built does not fit it; the boundary was deciding the agent whether you set it first or not. So the four runtime artifacts the design chapter will spell out, the boundary, the approval moment, the audit surface, the recovery workflow, are not Channel 1 outputs. They are Channel 2’s opening move, and they become the constraints the agent is built inside. This is the part teams get wrong most often: treat Channel 2 as a parallel sibling that starts when Channel 1 starts, and you will design an agent whose capabilities do not fit the boundaries you later try to wrap around it.

It is worth seeing the cost concretely, because the abstract version of this claim is easy to nod at and expensive to ignore. The case below is a composite, assembled from the way this failure actually unfolds, and it follows the refund agent the book has been tracking.

The boundary that did not fit. A support team builds the refund agent MVP first, the natural order, because the agent is the part you can demo. They train it to resolve requests end to end: read the order, weigh the customer’s history, decide, issue the refund, send the confirmation. It works, it demos beautifully, and it ships to a pilot.

Then someone asks the question that should have come first: what is it allowed to do on its own? Compliance wants a hard ceiling, no autonomous refund above a limit, mandatory review on anything outside the return window or flagged for fraud. Reasonable, and now very expensive, because the agent was built as one fluent motion from request to resolution, with no seam where a human could be inserted. Nothing in the flow pauses, packages the decision, and waits, because the agent was never designed to stop. Stopping was not in its world.

Adding the boundary now means re-architecting the thing that already works: breaking the smooth resolution into a gated one, building the approval surface no one specified, retrofitting a decision package the agent has no notion of producing. The team spends a quarter rebuilding a working agent into a differently-shaped one that is worse at the cases it used to handle smoothly. None of this was a model failure; the agent was excellent. The sequence was wrong. The boundary was a Channel 2 decision that arrived after Channel 1 had hardened around its absence, and had it been set first, the agent would have been a different and cheaper thing from the first line of code: the one that knew how to stop.

The lesson is not that you must literally write Channel 2 before touching Channel 1; in practice the two get negotiated together. It is that the boundary is load-bearing, and a team that defers it is not saving the decision for later, it is making it by default, in the shape of an agent that assumed it would never have to stop.

Channel 2 also closes the work, and outlasts it. When the agent ships, Channel 1’s job is largely done; Channel 2’s is just beginning, because supervision, observation, drift detection, and the audit that survives the agent all run for as long as the agent runs. So the picture is two parallel tracks with offset ends: Channel 2 starts earlier and finishes later, and the agent lives inside that envelope. That envelope is the shield from the spine chapter, drawn as a process: the supervisory layer is in place before the agent acts and remains after it stops, which is exactly what a shield does.

The brief becomes two documents

The second seam is the brief, and it is the place the operating model concentrates the most value, because the brief is where Channel 1 and Channel 2 are synthesized into something a team and a machine can both act on. But a single brief cannot do both jobs, and trying to make it is a common, quiet mistake.

Call them the Human Brief and the Executable Brief, and the names are the job. The Human Brief is the successor to the PRD: the business case, the go/no-go, the cost model, the boundaries, written as narrative a room argues with. It states the problem, the alternatives considered and rejected, the tradeoffs, the question the product is answering, and UX, engineering, finance, and a customer can each push on it, because it shows its reasoning. The Executable Brief is the successor to the epic: the experience, the behavior, and the governance, structured and explicit, stripped to requirements and constraints, because a generation tool takes the simplest path through whatever ambiguity you leave it. The mature pattern, now visible in the spec-driven development tooling that emerged in 2026, GitHub’s Spec Kit the clearest instance, where a written specification is a living source of truth that drives the plan, the tasks, and the implementation rather than a prompt typed once and discarded, is to keep the Human Brief as the artifact you reason and align with, and derive the Executable Brief from it, machine-assisted and human-owned. That a major platform vendor built tooling around exactly this move, deriving an executable spec instead of hand-prompting an agent, is worth noting: the Executable Brief is not a hopeful abstraction, it is the same principle others are now shipping tools to support, and the tools will change while the principle does not. One is for the argument; the other is for the build. Crucially, both carry Channel 1 and Channel 2: the split is by audience, not by channel. The supervisory layer appears in the Human Brief as a decision the room makes and in the Executable Brief as buildable requirements, which is how the supervision gets constructed instead of assumed. The brief is the center of gravity precisely because it generates both the prototype and the backlog, rather than getting handed off and going stale. That second consumer matters more than it first appears: the Executable Brief is read twice, once by the coding agent that vibes the MVP, and once by the engineers and architects who build the production system through whatever they actually use, Jira, design docs, architecture review. The same requirement renders differently for each, and the difference is the enforcement principle in miniature. “The agent must not read personal data” becomes, in the MVP, an instruction the agent follows as far as it cooperates; in production it becomes an architectural control, a data-access layer that strips the personal fields before the agent can see them, so the boundary holds whether the agent cooperates or not. One brief, two implementations, hardening from a prompt to a structure as it moves from the thing that proves the bet to the thing that ships. Appendix A works both briefs in full for the refund agent, with a template under each you can adapt.

It is worth seeing where that tooling stops, because the gap is the case for the Human Brief in one stroke. Spec Kit has an artifact that sits above the spec and even calls it a constitution, the principles every spec must honor, checked at a gate before the build proceeds. But the word is doing different work there than the literacy chapter’s sense: Spec Kit’s constitution holds engineering principles, test-first, simplicity, library boundaries, how the team builds. It is not the agent’s behavioral contract and it is not the business case. It holds no cost model, no go/no-go, no statement of what the product deliberately will not become, no name against the question of who owns a wrong outcome, and no value-priority for the agent itself. The leading spec-driven tool borrowed the most loaded word in AI governance for a page of build discipline, and left both layers the word usually carries, the agent’s behavioral contract and the human’s decision to build, as an exercise for the reader. That is not a criticism of the tool; it is the pattern this book is about, arriving in the place you would least expect it. The executable layer races ahead and the human-decision layer is a stub, even in the best instance of the form. The Human Brief is that stub, filled in. It is the artifact the tooling assumes you already have and does not help you write.

From the executable spec you can do the thing the next chapter takes up in full, generate a working prototype in hours and put it in front of someone who has the problem, with one discipline attached: the prototype is for deciding, not for shipping, and saying so out loud is what stops a persuasive demo from being quietly hardened into production. The prototype validates the bet. The spec, and what comes next, builds the product.

The unit of work changes, and this is the disruptive part

This is the seam that will unsettle a working product manager most, and it is worth being exact, because the careless version of this claim is wrong and a sharp PM will know it.

For two decades the unit of work has been the user story and its acceptance criterion: when the user does X, the system does Y, and you verify it with a test that passes or fails. That primitive is not arbitrary. It works because the system is deterministic, there is a single correct Y, and “pass or fail” is a real answer. None of those three holds for an agent’s behavior. The agent’s output is a distribution, not a single Y. “Correct” is a judgment, not a binary. And a test that runs once tells you nothing, because the next run may differ. So the acceptance criterion the user story was built to carry has nowhere to live. You cannot write “when the patient presents with these symptoms, the agent produces this triage” as a pass/fail line, because the real target is “produces a good triage most of the time, and here is what the wrong cases are allowed to look like.” That is not an acceptance criterion. It is an eval against a golden dataset: a curated set of inputs paired with the answers you would accept and the failures you must never ship, which the evals chapter takes up in full.

So the work-unit, for the agent’s behavior, shifts from the user story to an outcome-centric spec: a statement of the outcome you want, the bounds the agent must stay inside, and the eval set that grades whether it got there, including what acceptable failure looks like. The word outcome-centric is the load-bearing part. A deterministic spec says do X then Y. An outcome-centric spec says achieve this outcome, within these bounds, and here is how we will know, which is the only kind of specification a probabilistic system can actually be held to. Engineering cannot build to a spec that does not say what acceptable failure looks like, and a user story has no field for that. The teams still generating epics from an agentic brief, and most teams are, are translating probabilistic work into a paradigm built for software that does what it is told, and the false comfort of a full backlog hides that the most important thing, what wrong is allowed to look like, was never written down.

The boundary is where the claim is easy to overstate. The deterministic shell around the agent, the orchestration, the tool integrations, the screens the user clicks, is still ordinary software, and it still gets ordinary user stories. What resists the story is the probabilistic core, the agent’s behavior itself. So the user story is not dead; its territory has shrunk to the parts of the system that still behave deterministically. And the direction is worth noticing: as agents absorb more of the work that used to be deterministic code, that territory keeps contracting. The story is not wrong today. The share of the product it can describe is getting smaller, and the part it cannot describe is the part that now matters most.

The same behavior, written both ways

The claim is abstract until you see the same piece of work written as a story and then as a spec, so here is one behavior expressed both ways. Take a support agent that issues refunds, because the shape is the same whether your domain is a help desk, a claims desk, or a clinic, and the takeaway is portable: the moment an agent makes a judgment rather than executing a rule, the story stops being able to hold it.

The old way writes it as a user story. As a customer, when I request a refund for a defective item, the agent issues the refund, so that I am made whole without waiting for an agent. Acceptance criteria: given a valid order within the return window, when the customer requests a refund, then the agent issues it and sends confirmation. It is a clean story. It would pass review in any backlog grooming in the world. And it is the wrong primitive, because it describes the case that was never hard. The defective item inside the window was always going to be refunded. What the story does not, and cannot, say is what the agent should do with the request that is technically outside the window but the customer is right on the merits, or the one where the item is not defective but the customer is about to churn and is worth more than the refund, or the one where the refund amount is ten times the others and the pattern smells like fraud. Those are the cases the agent will actually be judged on, and the story has no field for any of them. “Issues the refund” is a pass/fail line for a decision that is not pass/fail.

The new way writes it as an outcome-centric spec, and the spec has parts the story does not. The outcome: the agent resolves refund requests in a way a reasonable manager would endorse on review, favoring customer trust on small amounts and human judgment on large or anomalous ones. The bounds (these come from Channel 2, the autonomy boundary): the agent may issue refunds up to a set amount without approval; above it, or outside the return window, or when the fraud signal fires, it must route to a human rather than decide. The eval set, graded against a golden dataset: a curated set of refund requests, each paired with the outcome a senior support lead actually endorsed, including the hard ones, the generous exception, the justified denial, the escalation. The agent’s behavior is scored against that set, not against a single right answer, and the target is stated as a distribution, resolve the clear cases autonomously at or above some rate, escalate the ambiguous ones, and never auto-approve a case from the fraud set. And the part the story structurally cannot carry, what acceptable failure looks like: a wrong escalation, sending a borderline case to a human who did not strictly need to see it, is tolerable and expected; a wrong auto-refund on a fraud-pattern case is a never-ship failure, the kind one occurrence of fails the eval. The escalation rule is the seam back to Channel 2: every case the agent declines to decide is a case the supervisory layer has to be designed to catch.

Read the two side by side and the shift is concrete. The story has one field, the happy path, and verifies it with a test that passes. The spec has four, outcome, bounds, eval set, and the definition of acceptable failure, and it is graded by a distribution over real judged cases. Engineering can build to the second and cannot build to the first, because the first never said what the agent was allowed to get wrong. This is the work-unit shift made literal: not a better-written story, a different primitive. When you find yourself writing an acceptance criterion for an agent’s judgment and it reads as pass/fail, that is the signal that you are using the deterministic tool on the probabilistic core, and the case you are not writing down is the one that will reach the affected person.

Where the tracks reconverge, and what is owned by whom

The two tracks come back together at the release decision, and the decision is no longer binary, ship or do not ship. It is a question of how much autonomy the agent is allowed to launch with, the rung on the autonomy ladder it has earned, and the gate has two halves that must both pass. Channel 1 readiness: the agent does the job, validated with users, evals clearing the bar you set. Channel 2 readiness: the supervision is live, the boundaries enforced, the kill switch reachable, the observation instruments running, the audit recording. An agent is not ready because Channel 1 works. It is ready when both halves clear, and a launch review where the Channel 2 half has no named owner is a review that has found its own gap.

That owner question is the one the operating model makes unavoidable. Channel 1’s deliverables have obvious homes; Channel 2’s do not, which is why they get skipped. Someone owns the supervisory interface, which is design. Someone owns the runtime controls and the kill switch, which is engineering or platform. Someone owns whether the agent is still correct in production, which is the role the field is beginning to call the agent supervisor. These roles do not have stable titles in most organizations yet, but they are appearing, agent supervisor, agent QA lead, AI operations manager, and the chapter on the agent as a team member returns to who they are and why no one has hired them. And the synthesis, whether the whole adds up to a product that should exist and behave the way it does, is yours. The operating model does not remove work from your plate. It names the work that was falling on the floor.

After launch, the model does not end, which is the last way it breaks from the lifecycle you are used to. The classic process tapered into maintenance; this one’s center of gravity is after the agent ships, because that is where the supervisory track does most of its work and where the agent drifts. The governance itself runs as three streams on three clocks: the pre-deployment work that has to be done before the agent first acts, the runtime controls that ship with it, and the post-deployment work, audit, drift detection, the decision to retrain or retire, that starts at launch and never stops. Treating governance as one task on the backlog hides that it has three different timelines, and the one that never ends is the one teams forget to staff.

Pull back far enough and three things have changed at the level of the job, not the task. You no longer run a linear chain from market doc to spec to epic; you manage a living system of intent, evidence, prototype, controls, evals, and supervision, all of it in motion at once. The central artifact is no longer the spec that gets handed off; it is the executable brief that everything else is generated from. And the core of the work is no longer only deciding what the product should do. It is deciding what the agent is allowed to do, when it must stop, when it must ask, and how the organization watches it once it is doing it. That last sentence is the whole job in one line, and none of it was on the old job description.

So map your own next agent against this shape. Where does Channel 2 open the work, and have you set the boundary before designing the agent, or after. Is your brief one document trying to be two. Is the agent’s behavior written as user stories with pass-fail criteria it cannot actually meet, or as outcome specs graded by evals. And after launch, who owns the track that does not end. The shape is the answer to what your Monday looks like, and the places it does not fit your current process are the places your current process was built for a kind of software the agent is not.

Not Every Problem Deserves an Agent Vibe Coding for Value: Prototyping to Decide