Chapter 12

The Personnel Infrastructure That Does Not Exist

The first eleven chapters of this book teach you how to build, evaluate, observe, and govern agentic AI products. Each chapter assumes that the institutional machinery to absorb a new actor exists, somewhere in the organization, ready to be applied. This chapter argues that it does not. Chapter 13, the field manual, follows as the practical companion: the checklists, diagnostic questions, and red flags you use when the work in this chapter has not yet been done at the institutional level.

Every other category of actor an organization adds to its workflows has an existing infrastructure. A new employee has HR. A new contractor has procurement. A new vendor has vendor management. A new system has IT operations. A new clinical device has biomed and credentialing. The AI agent fits none of these cleanly, and the category it actually belongs to has not been built yet. We treat it as a software integration because that is the only category we have. The misfit is what makes the problem invisible until it surfaces as a supervision failure.

The chapter is structured around four facets of the missing infrastructure. The first names the missing category: AI agents are structurally team members but they skipped every gate. The second names the missing assessment: we have fifty years of personality and team-fit instruments for human hires and have pointed none of them at the agents. The third names the missing supervisor: the role that performs rate-aware sampling, drift detection, and validator rotation does not exist yet. The fourth describes the individual-practitioner pattern that responds to all three when the enterprise has not built them.

None of the four facets are technical problems. The technology is settled. All four are personnel problems masquerading as workflow problems. The PMs and engineering leaders who recognize this and design for it have a real opportunity to fill institutional gaps that nobody else is looking at yet.

The Missing Category

A newly licensed MD is sometimes called a “license to kill.” Formally authorized to do anything. Real-world experience close to nothing.

I worked as an EMT physician at a cardiac center shortly after my internship ended. The center had a team of cardiac nurses, most of them with twenty years on the job. Subscribers transmitted their ECGs over the phone. The nurses read them, made decisions, dispatched ambulances. I was the MD on shift. On paper, the supervisor. In practice, I was learning from them while pretending to supervise.

What I learned, fast, was who to trust without checking. Who to validate twice. Who I wanted on the ambulance with me. Who I could hand a code to if it came to that.

None of that was on their credentials. It was how they behaved under pressure. Whether they admitted uncertainty. Whether they said “I’m not sure, take a look” instead of presenting every reading as certain. Whether their tone in a chaotic moment got tighter or looser. Whether they communicated when things went wrong, or covered.

Twenty years of experience meant something. But on the actual shift, what mattered was the qualities HR tries to approximate in an interview and never fully can. Trust. Calibration. Composure. Honesty about what you do not know.

Now consider every AI agent your team deployed last quarter. We added a new team member to that ride. Nobody interviewed them for any of it. No interview. No reference check. No probationary period. No question about uncertainty.

The reframe

Some readers will call this anthropomorphism. Treating a piece of software as if it had agency, judgment, or personhood. The objection is fair, and worth naming directly. I am not claiming the agent thinks the way a nurse thinks. I am claiming the org chart does not know the difference. The supervisory infrastructure that exists for an actor exists because an actor takes action, not because an actor is conscious. Removing the consciousness does not remove the action. It just removes our excuse for skipping the infrastructure.

If an agent acts in your workflow, it is structurally a team member, regardless of what is happening or not happening behind its API. It generates output. It makes decisions inside an authority boundary. It interfaces with humans who treat its output as input to their own work. That is a team. There is no other word for it.

Some readers will push back further. AI agents are not team members, they are tools, APIs, vendor products. Fine. Call them what you actually have: persistent contractors with no statement of work, no acceptance criteria, no performance bond, no termination clause. The HR question becomes a procurement question. Your AI agents skipped that infrastructure too.

Either way, employee or contractor, the same gap. An actor in the workflow without the institutional machinery that exists precisely to make new actors safe to add.

What HR actually does

I have sat through enough hiring processes to know what they are designed to catch. References. Probationary period. Performance review. Those three alone do most of the work. The rest of the HR machinery is documentation.

The point of the machinery is not to be thorough. It is to refuse to add an actor to the workflow until somebody who is accountable has looked at them and decided.

The AI agent your team deployed last quarter went through none of it. Someone watched the vendor demo. Someone got procurement approval. Someone integrated the API. The thing you required for the receptionist, you did not require for the agent that drafts clinical documentation. The thing you required for the junior engineer, you did not require for the agent that proposes code changes.

The justification, when you press on it, is “but it’s not really a person.” True. And not the point. The supervisory infrastructure required for a person who runs your workflow is required for any actor who runs your workflow. Removing the personhood does not remove the need. It just removes the paperwork.

Culture fit, non-locally determined

One of the gaps is structurally worse than the others. Culture fit, on the standard HR list, is the evaluation that catches misalignment between an actor’s default behaviors and the organization’s norms before the actor has authority. For human hires, defaults are local. The candidate developed them through prior employers, education, lived experience, and the customer organization evaluates fit against its own culture during the interview. The defaults can be observed, probed, and rejected.

For AI agents, defaults are not local. They are baked in upstream by the model provider during training. When an agent encounters an ambiguous instruction, what does it default to? Conservatism or speed? Escalation or autonomy? Caveats or confidence? Those defaults are determined by a company you do not work at, by people you have never met, optimizing for goals that may or may not align with your culture. A pharmaceutical company that prizes conservative, escalation-heavy clinical decision-making has deployed agents whose defaults were tuned by an AI lab whose culture prizes speed and breadth. Nobody checked whether the cultures match. Nobody could have.

This is not a metaphor. It is a structural property of foundation-model-based agents: their cultural defaults are non-local. Culture fit, the one HR mechanism that exists specifically to catch this kind of misalignment, cannot be performed by the customer organization at all. Not because they forgot. Because the evaluation surface is not available to them.

The Missing Assessment

For years, every manager I worked with was deep red.

That is a Lumina color. Many enterprises run Lumina assessments and workshops for senior staff: a questionnaire, a personal profile in four colors, then a facilitated session where you compare your profile with colleagues and managers to surface the differences out loud. The deliverable is not the report; it is the conversation. You leave the workshop knowing what your default style is, what theirs is, and where the friction is going to come from when the calendar gets tight.

The workshop set hands you four small foam cubes, one per color, with a phrase on each side. The green one says “show me you care.” The yellow one says “involve me.” The blue one says “give me details.” The red one says “be brief, be bright, be gone.” That last phrase is one of those lines that lodges in your head because it is funny and accurate at the same time. You would have recognized the type without the assessment. Direct. Outcome-first. Allergic to setup.

I am not red. I am a storyteller. I lead with context. I land the point at the end of the paragraph, which is sometimes the end of the second paragraph.

For fifteen years that mismatch was an operating problem, not a personality problem. The mismatch did not fix itself. We fixed it deliberately. I learned to compress, get to the point faster, write the bottom-line-up-front version of every email. They learned to wait, ask one more question, let the story finish before reaching for the bullet points. The workshop did not give us the answer. It gave us shared language about what we were each defaulting to, and an explicit working contract about how we were going to meet in the middle.

That is the only reason enterprise personality assessment is worth the money. Not the report. The conversation it forces.

Now consider every AI agent your team deployed last quarter.

What we already know how to do, and what we do not

You can write a persona prompt and the model will follow it. You can construct an ENTP brainstormer that leads with the unexpected angle, defaults to “yes and,” and resists summarizing at the end. You can construct an ISTJ auditor that reads the full submission before commenting, labels findings as BLOCK / WARNING / NOTE, and refuses to soften. Both work. Multi-agent rooms with named roles can produce better output than the human workshop they replaced, and that work depends entirely on the agents having distinct, intentional, persistent personalities.

That part is solved. You can dress an agent in whatever personality you want, as long as you write the prompt.

What is not solved is what happens when the prompt runs out.

The default underneath

The persona prompt is the costume. The model is the actor.

When the instruction is clear, the costume holds. The ENTP brainstormer brainstorms. The ISTJ auditor audits. When the instruction is ambiguous, when the context window gets long enough that the persona becomes a faint signal, when the user asks the agent something the persona did not anticipate, the costume slips and the actor underneath responds.

Different models have different default personalities. Sonnet 4.6 underneath is light, fast, occasionally funny, gets to the point. Opus 4.7 underneath is serious, layered, takes longer to land. Put both in an ENTP brainstormer costume, and one of them is a more convincing ENTP than the other. Put both in an ISTJ auditor costume, and the asymmetry reverses. The costume is real. The actor is realer. The actor wins when the stakes get high or the prompt gets thin.

This is observable to anyone who has worked with multiple frontier models for more than a few weeks. It is not metaphor. The defaults under ambiguity, under pressure, are different across models, and they are different in ways that map predictably onto the personality frameworks enterprise hiring has used for decades. Lumina is one. DISC, MBTI, Five-Factor, Hogan do similar work, by different names. They were built to make explicit what otherwise stays implicit: people have defaults, the defaults survive coaching, and team performance depends on whether the defaults are legible to the rest of the team.

We have those instruments. We have not pointed them at the agents.

Confidence calibration and reading the room

One default deserves to be named specifically. Confidence calibration. In a high-stakes team, the most valuable colleague is often the one who says “I’m not sure, take a look.” Not the one who presents every reading as certain. That is what the nurses at the cardiac center taught me, and it is what every trauma team, every flight crew, every well-run code knows in its bones. The team’s safety depends on the willingness of its members to declare uncertainty out loud.

Agents do not declare uncertainty out loud. They present. Some softly, some confidently, but almost none of them have a built-in equivalent of the nurse who pauses, says “this one is borderline,” and asks for a second look. The architecture does not produce that move. The training does not reward it. The product surface rarely surfaces it to the user even when the model internally has some probability estimate that would justify pausing.

This is one of the least solved problems in production AI, and it is one of the most consequential. The entire downstream supervisory structure is built on the assumption that the supervisor can tell the difference between confident outputs and uncertain ones. When the agent presents everything with the same register, the supervisor has nothing to allocate attention against.

There is a related skill, harder to name and harder to teach, that the cardiac-center nurses had and the agents do not. Reading the room. A trauma nurse who responds to every question with a full differential is dangerous. A junior resident who answers “fine” when asked about a crashing patient is dangerous. The skill is knowing which situation you are in, and changing the register accordingly. Thoroughness when the stakes call for it. Brevity when the stakes call for that. Pushback when the team needs a check. Compliance when the team needs speed.

In healthcare especially, this skill is the job. Talking to a healthy twenty-year-old at an annual visit is not the same as talking to an elderly woman who just learned she has cancer. Talking to a patient who arrived alone is not the same as talking to one whose family is in the room. Same diagnosis, same clinical facts, completely different register required.

Agents do not read the room. They have a register they default to, and they stay there.

Every model upgrade is a personnel change

The enterprise has spent fifty years running personality assessments on humans because the resume is not the colleague. The output is not the answer; the output is a shared vocabulary that lets a team adjust around predictable differences.

An equivalent assessment for an AI agent is not science fiction. It is a test suite. You construct a battery of ambiguous scenarios, each one designed to surface the agent’s default when the persona prompt is silent or contested. You observe what it reaches for under stress, what it hedges, what it asserts confidently, when it volunteers caveats, when it suppresses them. You score the responses across the same kind of dimensions a Lumina profile uses, or you invent dimensions specific to the agent role.

You run this once per model version, before deployment. You publish the profile to the team that will use the agent. The team learns what to expect under ambiguity. When the next model version ships, you run the same battery, compare the profiles, and tell the team explicitly what changed.

None of this is invented. The methodology has been running on human hires for half a century. The test infrastructure is comparable in difficulty to the eval suites the AI field already builds, and considerably easier than the alignment evals it builds for safety. The reason it does not exist is not technical. It is that nobody has named the category.

Every model upgrade is a personnel change nobody approved. Even from the same vendor, even with the same brand name on the model, the upgrade replaces a team member who has earned a working contract with their colleagues. The team has to renegotiate the working contract. For most enterprises running internal copilots backed by frontier models, that renegotiation is happening invisibly, every time the underlying model updates, to thousands of users at once, without warning.

Concept

Every Model Upgrade Is a Personnel Change

When the underlying model updates, even silently, even under the same brand name, you have swapped a team member. Its defaults under ambiguity shifted, its register shifted, the working contract your people built with it is now void, and nobody got to interview the replacement. Treat a model version bump the way you would treat a key hire leaving and a new one starting: re-profile, re-onboard, and tell the team what changed.

If your engineering org’s productivity drops three to six weeks after a quiet model upgrade, you are not seeing feature regression. You are seeing a workforce that just had a team member replaced and did not get to interview the replacement.

The Missing Supervisor

It is Thursday, 5:47 p.m. The senior engineer has nineteen pull requests left in the queue and a meeting at six.

The first PR opens. The diff is two hundred and forty lines, generated by Claude Code in eleven seconds. She reads the title, scans the file changes, checks that the tests are green, glances at the function names. Eight seconds. Approve. Next.

By PR fourteen she has stopped reading the diffs in full. She is looking at the test results, the CI status, and the descriptive PR title. If those three things are clean, the PR moves. The thing she would have done four years ago, which is to read the actual code and ask whether the implementation matches the intent, would take her forty minutes per PR if she did it properly. There are nineteen PRs left and she has thirteen minutes.

This is the supervisory channel. This is what it actually looks like in the most engineering-rigorous sector that exists.

She is not negligent. She is doing what the cognitive math tells her to do. The agent produced this code in eleven seconds. Reviewing it with the depth she would give human-written code would take her forty minutes. The math says: this is faster.

By the time she gets home, she will have approved nineteen pull requests. Her probationary period for each agent that wrote them was less than a minute.

The channel collapses twice

The first collapse is happening now, through speed asymmetry. The human cannot validate at the speed of generation. So the human starts heuristic-validating, and heuristic-validation is not validation.

The second collapse is happening through skill atrophy, and it has an eighteen-month time horizon. Or sometimes much faster than that.

This is not a guess. Medicine has already measured it. When experienced specialists were given an AI tool that flagged what to look for during a procedure, and then measured a few months later working without it, their unaided detection rate had dropped sharply, a meaningful decline in a skill those physicians had spent years building, after only months with the tool. The deeper lesson is the timeline: the erosion was fast, it was measurable, and the people losing the skill were exactly the experts everyone assumed were safe. The mechanism is not clinical. It is the same one closing on your senior engineer.

The engineering version is the prediction. A senior engineer in 2026 can still validate AI-generated code because she grew up writing code by hand. Her pattern recognition was built on years of human-authored work. Ask the same question about a senior engineer in 2028. She will have spent two years primarily validating AI output instead of writing original code. Her validation capacity will be weaker, not because she is worse engineers, but because she has less recent practice with the underlying skill that validation depends on. The validator atrophies as the validated work grows.

The channel collapses twice. Once now, through speed asymmetry. Again in eighteen months, through skill atrophy. Then the question is no longer “can the validator keep up?” It is “who validates the validator?”

The codebase the senior engineer no longer recognizes

There is a layer underneath the validator-atrophy problem that is even less visible, and worth naming because it changes what the senior engineer is actually losing.

The agent does not, by default, write code in the dev manager’s style. It writes in its own. Different naming conventions. Different comment density. Opus in particular over-comments. Different patterns for error handling. The default Claude Code output and the default Codex output do not match each other, and neither of them matches what the human team was writing eighteen months ago.

This is configurable. You can give the agent a style guide. You can point it at the existing codebase and instruct it to match. You can write a `CLAUDE.md` or a set of cursor rules with the team’s conventions. The agent will follow them.

Here is the part that has not been named correctly. Those configuration files are the new-hire onboarding process. That is what they are. The list of code conventions, the architectural decisions encoded as “here is how we do it on this team,” the explicit rules about what good looks like. Every well-run engineering team already has these documents. They used to live in a Confluence page that new hires read in their first week, and then they lived in the code review feedback the new hire got for their first three months.

For the agent, none of that gradual absorption happens. The agent reads the rules each time, follows them within the limits of the prompt context, and starts again the next session. The onboarding has to be explicit, written down, and continuously enforced because there is no “first three months” during which the agent organically learns the team’s style.

If you have not written your team’s `CLAUDE.md`, your codebase is being written by an employee you did not onboard.

What HR figured out fifty years ago

The hardest problem in AI supervision is well-traveled HR territory. It has been worked on for half a century. The AI field is reinventing the answer from scratch, not because anyone is ignoring HR, but because HR was not in the room when the norms were being formed.

The problem HR solved: how does a manager supervise direct reports whose capability exceeds their own? Every engineering manager supervising senior engineers has lived this. So has every CMO with domain specialists, every hospital administrator with physicians. The manager is not the smartest person in the room. The manager is the accountable person in the room.

The answer is structural. You stop supervising the process and start supervising the outcome. You build trust thresholds that loosen as the report demonstrates reliability and tighten when the report demonstrates drift. You construct peer review structures that catch what the manager cannot. You build a culture of declared confidence so the report says “I am sure” or “I am guessing” and the manager treats those differently.

If those phrases sound familiar, you are describing AI supervision. You are also describing what every engineering manager learns in their first year. The vocabulary is different. The structure is identical.

The human analogy does break in one important place. With human reports, the manager has an escape valve: the manager’s value is judgment, context, accountability; the report’s value is technical depth. The roles are partitioned, and the manager knows where they add value versus where they would just add latency. The AI version collapses the partition. The agent is superior across most of the task surface. The supervisor cannot easily locate where their judgment is the thing being asked for. That is the part the field has not yet named.

What we need to build

The supervisory structure for AI agents looks like the HR structure for human direct reports who outpace their manager, plus three additions specific to the speed-and-volume problem.

The first is rate-aware sampling. If the agent produces at ten times the human rate, the sampling cadence has to account for that. Not “review one in ten.” Review the cases that statistically matter: the high-consequence outliers, the cases where confidence is borderline, the cases that look like prior failures. Sampling becomes a triage problem, and the triage logic has to be designed deliberately.

The second is continuous drift detection that does not depend on human attention. The senior engineer at 5:47 p.m. is not going to notice that the agent’s code style has shifted slightly over six months. The system has to notice. Drift detection is the supervisory work the human cannot do in real time.

The third is the rotation of validators back through original work. If the validator atrophies because she stops writing code, the structural answer is to ensure she does not stop. She writes some code every week. She reviews some agent PRs every week. The mix is deliberate, because the validation skill depends on the writing skill, and the writing skill depends on practice.

None of this is technically hard. None of it is conceptually new. The infrastructure for human supervision-of-smarter-reports has been refined for fifty years. The additional pieces (sampling, drift detection, validator rotation) are well within the engineering capability of the teams shipping these agents.

The question is whether the institutional category exists to do the work. Right now it does not. The senior engineer is doing the job of three roles at once: writer, reviewer, accountable supervisor. The supervisor role is the one she is failing at, not because she is bad at it but because the role was designed for a different time scale.

What One Person Can Build Without Waiting

I had undiagnosed ADHD for most of my life.

For decades, sitting in a frontal classroom was the wrong delivery mechanism for the way my attention actually operates. I skipped classes in undergraduate and went to the library. I ran my own research, wrote down the ideas I was tracking, photocopied the pages from the books that mattered, built a personal binder of references that bore no resemblance to the official syllabus and that I returned to constantly.

Medical school worked the same way. The lecture hall was a stretch. I mostly studied from copies of other students’ notes, which I treated as raw material rather than authority. I extracted, reorganized, annotated, cross-referenced. The notebook I built became the thing I studied from. The lectures were the source. The notebook was the curriculum.

That pattern (library work, active synthesis, building my own curriculum from primary sources) was the only way I could learn at any real depth. It worked. It is also slow. The depth I could reach in a week was limited by how many books I could carry from the stacks, how many photocopies the library would let me run, how much I could read between everything else demanding my attention. I knew the method. I did not have the scale.

Recently, the constraint disappeared.

The choreography

The first three sections of this chapter describe what is missing structurally inside enterprises. This section describes what one practitioner can build without waiting for the enterprise to catch up.

The argument and the spine are always mine. The lived experience is mine. The editing and the final calls are mine. Everything else is a deliberate orchestration of five different AI systems, each doing the thing it does best.

I write a one- to three-page spine document by hand: thesis, argument, personal experience, three or four claims that need evidence, gaps I need filled. Then I open three deep-research sessions in parallel (Gemini, ChatGPT, Perplexity), feed the same questions to each, and read the three reports side by side. The agreement is usually the canonical evidence. The disagreement is usually where the interesting argument lives. The omissions are what I learn to chase next. Triangulation by design.

Claude becomes the writing partner. I hand it the spine plus the synthesized research and ask for a draft. We iterate, sometimes ten or fifteen passes, until the piece sounds like me and makes the argument cleanly. Before publication, I run every factual claim, every citation, every dated study through Perplexity in validator mode. Fact-checking by a different model from the one that drafted.

The role nobody is writing about yet is the sixth: voice calibration across models. After the draft is technically correct, it still has to sound like me. Claude’s drafts read as Claude unless you actively pull them toward your own voice. Different versions bleed differently. The version of this step that works for me is cross-model coaching. I hand a draft to a different model that has months of accumulated context on my voice. It reads the draft, names the tells, and returns a structured list of voice issues. Five concrete rules come back from one pass. I give them to the drafting model. The next draft is materially closer to my voice.

That is the choreography. Six roles. I am the supervisor at the top of the loop, doing the work that none of them can do: the thesis, the personal experience, the final calls about what is true and what ships.

The catalog as the curriculum

I started keeping a catalog of every article I wrote. The first entries were just titles and dates. Within a month, the catalog had grown to include the core argument of each piece, the key evidence I cited, the frameworks I introduced or referenced, and the cross-links between articles. By the time I noticed what was happening, the catalog had become a working knowledge base.

The catalog is not the body of work. The articles are. The catalog is the index, the synthesis layer that lets me write the next article without losing track of the previous hundred. It is also what most education programs never give you. A curriculum that is yours, that reflects how you actually think about the field, that compounds with every piece you add.

The catalog also became the working memory for the LLMs and, when I am honest, for me. When I start a new article, I hand the catalog to the model so it can find the prior framework I want to extend, the example I already used, the article I do not want to repeat. The catalog remembers things on my behalf that I could not hold in working memory while doing the job. It also surfaces the gaps. A topic that has no entry is sometimes the most useful information the catalog produces.

I built it because I needed it. The byproduct is that I now have a textbook on the field I work in, written from my own perspective, organized in a way that matches my own working memory, in a field that has no textbook yet. I had to build the course in order to take it.

What this is and what this is not

Done this way, the loop is not a productivity tool. It is an active-learning protocol. Every article is a synthesis exercise that forces me to defend a thesis against three different research traces, work through ten or fifteen drafts of the argument, and check every factual claim against an independent validator. The cognitive work that happens during a single article is, in my experience, equivalent to what I would have learned in a full graduate seminar on the same topic, compressed into a week.

This is the practice the first three sections of this chapter imply but do not name. The validator-atrophy prediction is correct as a default. It is the default because most users are not running a deliberate loop. They are using AI as a content faucet. The faucet produces output, the user signs off on it, the loop has no validator role except by accident. The atrophy follows.

What I am describing here is the same problem from the inside. I am the supervisor in my own loop. The role is not implicit; it is the entire reason the loop exists. The AI does the volume work, the parallel research, the drafting, the fact-checking. I do the thinking that determines what the volume is for.

This is not a substitute for the institutional infrastructure the first three sections described. The active-learning loop solves the individual version of the validator-atrophy problem. It does not solve the enterprise version. An organization deploying agents to a hundred engineers cannot ask each engineer to run a personal five-LLM orchestration. The structural answer (rate-aware sampling, drift detection, validator rotation, the missing supervisor role) still has to be built at the institutional level.

What I would say to a PM asking whether they could do something similar: the specific stack does not matter. The roles matter. Find the version of the spine document that fits your work. Run the research step in parallel, not because more sources is better but because triangulation surfaces disagreement and disagreement is where you learn. Use one model to draft and another to validate. Keep a catalog from the first piece, before you think you need one. Treat yourself as the supervisor at the top of the loop, not as the user of a tool.

The library is still the right model. The library is just bigger now.

What this chapter changes

If you accept the four arguments, three things follow for your work as a PM building agentic AI products.

First, the question to ask before any AI deployment is not “what workflow does this fit into?” It is “what would HR require for a human in this role, and where is our AI equivalent?” If the answer to the second is “we do not have one,” you do not have a deployment plan. You have an unsupervised contractor with API access.

Second, the supervision-of-smarter-reports literature is the right place to look for vocabulary and structure. Not the AI safety literature, mostly written by people who have never managed a team. Not the governance literature, mostly written by people who manage committees rather than reports. The HR literature on delegation, trust calibration, and outcome-based management is the closest thing the field has to a working model. Use it.

Third, the personality and team-fit dimension that enterprise hiring has measured for decades deserves an AI equivalent, and it can be built with infrastructure that already exists. A personality eval is distinct from a capability eval. Capability eval asks “can the model do the task.” Personality eval asks “how does the model do the task when the task has slack.” Both are needed.

We are not missing the technology. We are missing the category, the institutional machinery that exists for every other kind of actor an organization adds to its workflows. We are missing it not because we forgot, but because nobody built the category yet. That is what needs to change.

Agents are structurally team members who skipped every HR gate: no interview, no reference check, no probationary period.
Every model upgrade is a personnel change: a version bump swaps that team member for one nobody interviewed, voiding the working contract.
The missing supervisor performs rate-aware sampling, drift detection, and validator rotation; no current org chart owns this role.
Configuration files are new-hire onboarding; the validator atrophies as validated volume grows, so rotate her back to original work.