Part II · The Work Reshapes  ·  Chapter 5

Who Wrote This Code, and Who Answers for It?

A staff engineer I worked with had a habit I did not understand until much later. When a junior handed him a pull request, he would not start by reading the code. He would ask the junior to walk him through it out loud, and he would watch their face while they did it. He was not checking the diff. He was checking whether the person understood what they had written, because he knew that a change you cannot explain is a change that will fail in a way you cannot predict. The reading of the code was almost secondary. The real review was the conversation.

That engineer’s instinct is the whole subject of this chapter, because the thing he was doing, holding the system in his head and testing whether a change fit it, is exactly the thing that does not scale, and it is exactly the thing the team now needs more of than it has ever needed before. The agent writes the code. It writes a great deal of it, fast, and most of it looks right. What it cannot do is the thing the staff engineer was doing across the desk: know whether this change fits a system that has to keep its guarantees while it absorbs the change. Somebody still has to do that. The question this chapter asks is who, and the uncomfortable answer most teams have arrived at without deciding to is: nobody, in particular, and everybody, too late.

The number that should stop the meeting

Start with the data, because the story the data tells is not the story the vendors tell, and the gap between them is the chapter.

In April 2026 Faros AI published telemetry from twenty-two thousand developers across four thousand teams, comparing low and high adoption of AI coding tools. The headline the industry wanted was there: throughput rose. More pull requests, larger ones, more changes shipped per developer. That is the number that ends up in the board deck. The numbers underneath it are the ones that should have stopped the meeting. As adoption deepened, code churn, the share of recently written code that gets ripped out and rewritten within weeks, rose by eight hundred and sixty-one percent. Bugs per developer rose fifty-four percent, on top of an already elevated rise the year before, and the curve was steepening, not flattening. Production incidents per pull request rose two hundred and forty-two percent. They tripled. And the mechanism behind all of it was sitting in one statistic: thirty-one percent of code was entering production with no review at all, because the reviewers could not keep pace.

That last number is the whole problem in miniature. The agent made writing cheap. It did nothing to make reviewing cheap, and reviewing is now the binding constraint on whether any of the cheap code is safe to ship. The same study measured what happens to the people who do the reviewing. Median time a senior engineer spends in review rose four hundred and forty-one percent. The deepest system knowledge on the team, the staff engineer who can tell whether a change fits, is precisely the resource that got buried under the volume. Call it the senior engineer tax: the more the team leans on the agent, the more of the senior’s week is spent reading machine output, and the less of it is spent on the architectural judgment that was the reason to keep a senior in the first place.

A fair reader should note that Faros sells engineering-measurement tooling, so a story about quality quietly degrading is not a neutral finding from a disinterested party, and a single vendor’s telemetry is correlational, not causal. The reason to trust the direction anyway is that it does not stand alone: the independent studies cited later in this chapter, the security-flaw audits, the randomized developer-slowdown trial, the trust-and-verification surveys, all point the same way, and the convergence of a vendor’s telemetry with disinterested research is what makes the pattern hard to wave off. The speedup is real. It means the speedup landed entirely on one side of the pipe. As one engineering writer put it, the team optimized the input side and never expanded the output. You can generate in an afternoon what used to take a sprint, and then the afternoon’s work sits in a queue behind a human who reads at human speed, and the queue is where the bugs and the incidents and the un-reviewed thirty-one percent come from. The bottleneck did not disappear. It moved downstream, to verification, and verification is harder to scale than generation because it is the part that requires understanding the whole.

The job inverted, and the new job is the hard half

Jensen Huang said the quiet part out loud at the start of 2026, telling engineers they should stop writing code and start directing and supervising the code the machine writes. Whatever you think of the source, the description is accurate, and you can see it in how the loop changed. The old developer loop was think, write, compile, debug, a loop where the human produced the artifact and the machine checked it. The new loop is frame, delegate, evaluate, constrain, integrate, a loop where the machine produces the artifact and the human checks it. The human moved from author to reviewer. That is not a small change in tooling. It is an inversion of which half of the work is the human’s.

And here is the part the inversion stories leave out: the half that stayed human is the hard half. Writing code was the part we knew how to teach, measure, and speed up. Reviewing code, really reviewing it, holding the system in your head and judging whether a change preserves its guarantees, was always the senior, scarce, slow part. The inversion did not hand the human the easy work. It handed the human nothing but the hard work, at a volume the hard work was never able to sustain, and then a chart showed throughput going up.

There is a sociological wrinkle here that makes it worse, and it is worth naming because it is invisible in the metrics. When the AI reviewer and the human reviewer disagree, developers tend to trust the human, even when the AI is objectively right. The deference runs toward the person. That is healthy when the person has the deeper context and corrosive when the person is a buried senior skimming their fortieth diff of the day, because now the one safeguard, human judgment, is being applied by a human who has been made into a throughput device.

What it looks like on one team

Make it concrete. Picture a payments team at a mid-sized company, six engineers, two of them senior, shipping the service that moves customer money. They adopt a coding agent, and for the first month it is wonderful. Features that sat in the backlog for quarters ship in days. The agent is fluent, it writes plausible tests, it even writes the boilerplate nobody wanted to write. Velocity, the number the team reports upward, doubles.

By the third month the two seniors are not building anything. They are reviewing, full time, and falling behind. The agent opens pull requests faster than they can read them, each one larger than the human PRs used to be, each one looking finished. To keep the queue moving, the team starts letting the small ones through on a glance, then on a green CI check alone. One of those, a refactor of how the service retries a failed charge, is locally correct, the function does exactly what its name says, and globally wrong, because it does not know that this system must never retry a charge that might have already succeeded on the processor’s side. The agent could not have known that. It is not in the code. It is in the head of the senior who was too buried to read this particular diff closely. The change ships. Three weeks later a subset of customers get double-charged during a processor timeout, and now the senior who did not have two hours to review the PR has two weeks to run the incident, the remediation, and the refunds.

Nothing in that story is exotic. It is the eight-hundred-percent churn and the tripled incident rate, lived by six people. And the most important fact about it is that no one on that team was negligent. The system was arranged so that the failure was the default outcome, and the failure had no owner until it became an incident, at which point it acquired the most expensive owner on the team.

What the payments team built has a shape, and the shape has a name that is also a warning. They built a house of cards. Each layer went up fast and looked sound, the feature on top of the refactor on top of the helper the agent wrote last week, and each rested on layers below it that nobody had fully checked. A house of cards is not unstable because any one card is weak. Every card is fine on its own; the retry refactor did exactly what its name said. It is unstable because the structure has no member that bears the load of the whole, no one holding the system in their head and asking whether the new card can take the weight of the cards that will sit on it. Velocity is the rate at which you add cards. Verification is the only thing that checks whether the structure underneath can hold them. When you accelerate the first and leave the second at human speed, the house gets taller and more impressive and more precarious at the same time, and it stays standing right up until one card low in the stack shifts, which is what a processor timeout is. The double-charge was not a card failing. It was the floor moving under a tower built faster than anyone could check the foundation. The foundations of this book made this point about a single product, do not ship the prototype you built to decide with. At the team level it is structural: high velocity on an unverified base is not progress, it is a taller fall.

Vibe to production is a cliff, and nobody owns the edge

Part of what happened to the payments team is a confusion the whole industry is still making, between two activities that use the same tools and produce the same-looking output and could not be more different in what they demand.

Andrej Karpathy named “vibe coding” at the start of 2025, building by conversing with an agent and accepting what comes back. A year later he reframed the professional version as “agentic engineering,” the same speed but “with more oversight and scrutiny,” and the reframing matters because it marks the line teams keep stepping over without noticing. Vibe prototyping is a way to learn fast: you build a disposable artifact to find out whether an idea is worth anything, and the artifact is supposed to die. Production engineering is the human-intensive work of hardening something until it can be trusted with real money and real data. The failure is treating the first as if it were the second, shipping the prototype because it ran in the demo.

The evidence that the prototype is not production is not a matter of taste. Across eight studies between 2024 and 2026, AI-generated code carries security flaws at rates that have not improved across model generations. One large analysis of code from over a hundred models found roughly forty-five percent contained a critical vulnerability. A Georgetown review found the majority of generated code failed basic defenses against common attacks. A review of four hundred and seventy real pull requests found AI changes carried roughly 1.7 times the issues of human changes, with logic and correctness errors up seventy-five percent and error-handling gaps roughly doubled. The pattern underneath all of it has a name, the eighty percent problem: the agent nails the happy path and misses the twenty percent that is the actual engineering, the error handling, the edge cases, the security validation, the observability, the way this function interacts with permissions and sessions and data isolation three systems over. The agent works at the level of the function. The missing work lives at the level of the system, which is exactly the level the agent cannot see and the buried senior no longer has time to.

The sharpest thing in the research is what is not there. There is no widely adopted, named process for the handoff from prototype to production. There is no job title for the person who hardens vibe-coded output into something shippable; the work falls to whichever senior, reviewer, code-owner, or security gate happens to be nearby. The hardening is encoded in pull-request templates and test logs and code-scanning rules, not in anyone’s job description. And that absence is not a gap in the literature. It is the failure. When the most consequential step in the pipeline, turning a plausible draft into a trustworthy system, belongs to no one by name, it gets done late, partially, by someone who was not given the time for it, or it does not get done until an incident does it for you.

The choke point is review, so stop trying to prompt your way out of it

Teams confronted with this reach first for the wrong fix, because the wrong fix is the comfortable one: prompt better, spec harder, get the agent to produce cleaner code so there is less to review. It helps a little and it misses the structure of the problem. The problem is not that the input is dirty. The problem is that the output side, verification, cannot scale the way generation just did, and no amount of better prompting changes that. The fix is not “prompt better.” It is “rebuild the verification layer,” and that is a different and less glamorous project.

What the teams further along do is narrower and more disciplined than the vendor pitch. They bound the agent to a small, auditable loop with explicit completion criteria, the most useful practical construct in the whole space being a four-field delegation: goal, context, constraints, and “done when.” The agent is not set loose on the architecture. It is handed a scoped task with a stated definition of done, on an ephemeral branch that starts from a clean environment so the blast radius of a bad run is contained and every run is auditable. The agent cannot push to the protected branch or merge its own work; that gate stays human. I want to be careful here, because the first wave of writing on this oversold it: there is no strong evidence that teams enforce numeric limits on diff size, or run formal statistical sampling of agent output. The real, observed pattern is plainer than that. Bound the task, contain the environment, and put layered automated review in front of the human so the human sees a smaller, pre-filtered surface.

That layered review is AI reviewing AI, and it is now normal: a scoped review agent runs first, restricted to specific concerns it is actually good at, security, performance, test coverage, and explicitly not architecture or style, where it produces noise. Fed the team’s architecture-decision records and incident postmortems as context, and filtered by confidence score, one team cut its review agent’s false-positive rate from around thirty percent to under ten and caught a third more critical bugs while trimming review time. That is a real gain and it is not the gain the vendors advertise. The honest version: the AI reviewer is a first pass that shrinks what reaches the human. It does not replace the human, because the thing the human is for, does this change fit a system it cannot see, is the thing the reviewer is worst at. Review remains the choke point. You can widen it with automation, but you cannot remove the senior from the end of it, and pretending otherwise is how you get the thirty-one percent.

The contract moved out of people’s heads and into the repo

One good thing happened in this transition, and it is worth holding onto because it points at where the work is going. The knowledge that used to live in a senior’s head and get transmitted by the across-the-desk conversation is moving into text the agent can read.

It has several names depending on the vendor, a project guidance file that the agent loads automatically and treats as standing instruction: CLAUDE.md, AGENTS.md, the cursor rules file. Functionally they are the same artifact, and the right way to think about it is as the onboarding document for a new employee who happens to be an agent. It holds the constraints, the conventions, the hard-won “we never do it this way because last time it caused an outage” that a human would have absorbed over three months of code review. The sharp way to put it: if you have not written your team’s agent-guidance file, your codebase is being written by an employee you never onboarded. The high-value work of the senior is shifting from writing the code to writing the instructions, policies, and checks that govern what the agent may do, converting tacit team knowledge into machine-readable constraint. That is real, durable, and it survives model upgrades and team turnover in a way that knowledge in a departing senior’s head never did.

The ticket survived too, but its role inverted. In the spec-driven approach the agent’s instruction is derived from a specification rather than being the source of one; the spec becomes the unit of delegation and the pull request remains the unit of verification. I will be honest about the limits of this, because the second pass of research was rightly skeptical: spec-driven development is a real and growing practice, but the specific tools and the “one requirement, two builders” framing are not yet industry standards, and there is no large controlled study showing that spec-gating actually reduces defects. Treat it as a promising discipline and one of this book’s own frames, not as a settled result. What is settled is the direction: the contract between human and agent is moving from heads into repo-scoped text, and the teams that write that text deliberately are the ones whose agents behave.

The time the agent gave back is a choice, not a gift

The agent did free up time. The generation that used to eat the week now takes an afternoon, and that is real. The mistake is to experience the freed time as a gift, because it is not a gift, it is a choice, and most teams are making it badly without noticing they are making it at all.

There are two places the freed time can go. It can go into more of the same: more features, more happy-path code, more cards on the house. That is the default, because it is what the velocity chart rewards and what the agent makes effortless, and it is the choice that built the payments team’s tower. Or it can go into the work that the old pace never left room for, the Channel 2 work the foundations named, the supervision and readiness half of the system that everyone always meant to get to and never did. That is the missing twenty percent made non-optional: the error handling, the edge cases, the security validation the generated code skips by default. It is the observability that lets you see a drift before it becomes an incident. It is the hardening that takes a prototype to production. It is writing the agent-guidance file properly instead of leaving it stubbed. It is the senior reading the architecturally consequential diff instead of waving it through to keep the queue moving.

This is the prescriptive heart of the chapter, and it is simpler to state than to do: spend the dividend on enterprise readiness, not on more velocity. The reason teams do not is that Channel 2 work is invisible to the chart that gets reported upward. Nobody puts “did not have an incident” in the sprint review. So the freed time gets quietly reinvested in the thing that is visible, throughput, which is the thing that was never the constraint. The teams that come through this are the ones that make the reallocation explicit and defend it against the pull of the chart: a stated share of capacity goes to verification, hardening, and drift-watching, by policy, because left to the default it goes to zero. The agent did not solve the readiness problem. It removed the excuse for not solving it, and then tempted everyone into spending the freed time on more cards instead.

Where the dev meets the PM and the designer

This reallocation is not the dev team’s decision to make alone, and that is the part that touches the rest of the team. The seam where the developer meets the product manager and the designer changed shape, and it changed in a specific way worth naming from the dev’s side.

The developer’s new high-value output, the constraints, the specification, the agent-guidance file, the definition of done, is precisely the place where the PM’s intent and the designer’s behavior model have to arrive intact, because that is what the agent will build from. When the PM’s intent is vague or the designer’s behavior policy is unstated, the agent does not stop and ask; it fills the gap with a plausible guess, and the guess becomes code that looks finished. The retry refactor that double-charged customers was, underneath, a gap in stated intent: nobody had written down, in a form the agent could read, the rule that this system must never retry a charge that might have succeeded. That rule was a product decision and a domain constraint before it was ever code. The seam between the PM who owns the intent, the designer who owns the behavior, and the developer who hardens it into a system is exactly where this class of failure is born, and it is born from intent that never made it into the shared, machine-readable contract. The collaboration, from the dev’s chair, is less about handing finished tickets back and forth and more about jointly authoring the constraints the agent cannot infer, early, before the agent has confidently built the wrong thing fast. How the seats formally relate is the subject of a later part of this book; what matters here is that the developer is now the one converting the team’s shared intent into the constraints that govern the machine, and a gap anyone leaves in that intent becomes the developer’s incident.

The drift nobody owns

Here is the failure that is most clearly real and most clearly orphaned, and it is the reason this deserves its own chapter rather than a paragraph in someone else’s.

When an agent makes many locally correct changes over time, the codebase drifts. Each change does what it says; together they pull the system away from its own architecture, until the patterns stop being consistent and the thing becomes, in the phrase from the research, a system nobody understands, in production, waiting to fail. The eight-hundred-percent churn is the visible tremor of this drift. And there is no standard tooling that catches it as it happens, no default CI gate for “this change is locally fine and architecturally wrong,” and, most damningly, no role on most teams whose job it is to watch for it. Architecture-as-code tests flag the drift after it has landed. The agent-guidance file constrains it weakly. Senior audits catch it occasionally, when the senior has time, which we have established they do not. The detection of architectural drift is on nobody’s job description and in nobody’s pipeline.

There is a second drift, in the deployed agents themselves, where the behavior shifts over time, and that one is better theorized, with stability indices and similarity-based monitoring and even a working production system where a supervisor agent detects and remediates another agent’s configuration drift. Tooling is emerging there. But put the two together and the organizational picture is the same, and it is the book’s recurring shape: the tooling exists, partially; the organizational home does not. Codebase drift gets treated as a platform-and-governance problem split across developers, admins, and team leads, owned by everyone and therefore no one. Deployed-agent drift lands on an AgentOps function where one exists and on the general on-call where it does not. No single accountable function unifies “the codebase is decaying” and “the agents are drifting” under one name. The failure is not a missing tool. It is a missing seat at the table.

The apprenticeship problem, stated plainly

The last consequence is the one I am least certain about and least willing to wave away, because the honest position is an uncomfortable one.

The seniors who can do the only job that now matters, the system-level judgment, earned that judgment by doing the work the agent now does. They wrote the routine code, broke it, debugged it, sat through the across-the-desk review where someone explained not just that their approach was wrong but why a different approach fit this system. That was the transmission mechanism, and code review in particular was knowledge transfer the agent cannot replicate: the agent can tell a junior that a line is inefficient, but it cannot explain why a different design fits the particular history and constraints of this codebase, because it does not know the history. If the agent now does the routine work, the mechanism that produced the next generation of seniors is being removed at the same moment the system comes to depend entirely on seniors.

The labor data is alarming and I am going to be careful with it. The signals are concrete: a Stanford analysis of payroll records found employment for the youngest software developers, the twenty-two to twenty-five cohort, down roughly a fifth since the wide release of generative coding tools, even as employment for older engineers held or grew; separate hiring trackers put entry-level software postings down sharply over the same window. The direction is consistent across sources. But the magnitudes vary widely, the methodologies are not all equally sound, and a payroll dip concentrated in a couple of years is not the same as proof that the junior pipeline has structurally collapsed. So I will claim only what the evidence supports. The moment now rewards accumulated judgment over the ability to write a line, decisively. That is documented. Whether the industry has any answer for how the next generation acquires that judgment when the practice that built it is being automated away is not documented, because the answer does not yet exist. The most suggestive thing anyone has tried is a deliberate learning mode, where the agent hands small tasks back to a less-experienced engineer specifically to keep them in the loop, friction added on purpose to preserve learning. Whether that is enough is unknown. This is the largest open question in the chapter, and the right thing to do with a largest open question is to name it as one, not to resolve it with a confident sentence it has not earned.

The developer’s change-management job, then, is not technical. It is to make the implicit explicit: to name which decisions still route through a senior by design rather than by subtraction, to protect apprenticeship deliberately by giving juniors AI-assisted-but-not-absorbed work that keeps the learning loop alive, and to maintain the agent-guidance file as the organizational memory that outlives both the model and the team. None of that happens by default. By default the seniors get buried, the juniors get hired less, the codebase drifts, and the throughput chart keeps going up until the incident.

Who answers for it

Return to the title, because the chapter can now answer it. Who wrote this code? In the literal sense, the agent did, and the question is a category error, because authorship was never where the accountability lived. Who answers for it? On a team that has thought about this, a named human: the senior who owns the architectural judgment for this domain, given the time and the standing to exercise it, supported by the guidance file they wrote and the layered review that shrinks what reaches them. On a team that has not thought about it, the answer is the one the payments team discovered, that accountability defaults to whoever is holding the incident when it goes off, which is the most expensive and least fair way an organization can assign responsibility.

The agent changed which half of the work is the human’s, and it changed it to the hard half, and it raised the volume on the hard half at the same time. The teams that come through this are not the ones with the best agent. They are the ones that looked at the inverted pipeline and decided, on purpose, who owns the output side: who reviews, who hardens, who watches the drift, who protects the people who will be the seniors in five years. That is a staffing decision and a design decision, not a tooling decision, and like the staff engineer watching the junior’s face, it is mostly about refusing to mistake a change that runs for a change someone understands.