Appendix C: The Frameworks
This appendix is the working library of named frameworks that recur across the book. Each entry gives the core insight in one paragraph. The entries are short on purpose. The book is where the reasoning lives. This is the index you flip to during a review meeting when somebody says “what was that thing about the supervisor again,” or when you are writing a spec and want to make sure you are not reinventing a category that already has a name.
Forty-eight frameworks, organized into nine categories that mirror the book’s argument from foundation to personnel infrastructure. The numbering is not sequential within each category; it reflects the order in which the frameworks were originally developed across the body of work, preserved here so cross-references from the chapters and the source articles resolve to the same identifier.
Category 1: Foundation
The vocabulary the rest of the book uses. Five frameworks that establish what is different about probabilistic systems, what the PM is now designing for, and why the data layer keeps appearing inside conversations that started somewhere else.
1. The Iceberg: Context Stays Behind When Data Moves
When data moves from its source system to an AI-capable platform, the visible tip (fields, values, structured records) transfers. The mass below the waterline (relationships, calculated semantics, governance rules, clinical reasoning, organizational conventions) stays behind. What arrives on the other side is technically accurate and operationally thin. The three things lost are the relational web that connected the records, the calculated semantics that defined what concepts meant in your organization’s terms, and the security context that encoded organizational judgment about whose authority is relevant where.
10. Abstraction Layer Obsolescence
Every generation of computing has required humans to speak the machine’s language. Basic, punch tape, SQL, APIs, prompt engineering. Each abstraction layer has eventually been superseded by a higher-level interface. Prompt engineering is the current layer. It will be abstracted away. Durable value is in understanding the outcome you want, not in mastering the current workaround.
17. The Vessel / Logic Distinction
Business and clinical processes have been structurally stable for millennia. Procure-to-pay and the clinical three-way match have not changed since ancient Mesopotamia and Hippocrates respectively. Every technology era has given the stable logic a better vessel: faster, more connected, more scalable. None of them changed what the system understood about the process. Agentic AI is the first vessel that begins to reason over the logic, not just carry it. That is the shift the PM is being asked to design for, named or not.
18. The AI Spectrum (Seven Generations in Healthcare)
The generative AI and agentic AI debate treats three recent categories as the full picture. Healthcare AI has at least seven generations, all still active. Rule-based expert systems (MYCIN lineage, 1970s). Statistical models (APACHE, sepsis scores, 1990s). Deep learning for imaging (2012+). Gradient boosting on EHR data (2015+). Generative AI for documentation and triage (2022+). Agentic systems with tool use (2024+). Neurosymbolic architectures (emerging). Governance and deployment strategy must match the specific generation, not “AI” as an undifferentiated category.
7. Agent-Based Experience (ABX)
AI products now have two simultaneous customers: humans and their AI agents. Designing for one while ignoring the other produces a product that works for the human but cannot be consumed by the agent, or works for the agent but is opaque to the human. ABX is the design discipline for the dual-customer product. The FHIR iceberg is the data-layer version of the same failure: a system that is technically correct for the receiver who reads it and useless to the agent that should reason over it.
Category 2: Discovery and Decision
Frameworks for the question that comes before the build: is an agent the right answer here, and at what scale does the economics work?
27. Agentic Suitability Assessment and Cost Model
Most agentic AI failures are not engineering failures. The model worked. The agent did what it was designed to do. The problem was the problem itself. Two questions must be answered before the first line of code. Is this problem actually suited for an agent? The four suitability conditions are repeats at volume, outcome measurable, tool use bounded, consequences recoverable. And at what volume does it break even? The interactive cost calculator in the chapter lets you stress-test floor price, per-task economics, and coordination overhead against the actual workload curve. Most teams fail the second question without realizing they were supposed to ask it.
4. The MVP House of Cards
Healthcare AI products are commonly built in MVP layers: validate the clinical workflow, then add features, then add users. Each layer defers governance because governance feels like a cost that cannot be afforded until next quarter. By the time the product has real patients and contractual commitments, the governance backlog is larger than the roadmap can address. The honest answer is not that the team was ignoring safety. They were running at a pace that made comprehensive governance feel like a cost the company could not afford until next quarter. No floor was built to stop the stacking.
14. Bridge Operator vs Bridge Architect
For twenty years, the PM’s job was to be the bridge: observe the customer, translate needs into capabilities, route to engineering. AI has dropped the administrative tax of that role to near zero. The question is not “how do I be a better bridge?” It is “what becomes possible when the bridge runs itself?” Bridge architects design decision councils, prototype validation loops, and strategy stress tests without engineering. Using AI to write PRDs faster is still being the bridge. Architecting the supervisory system is being the bridge architect.
9. The First-Contact Training Data Mismatch
AI clinical models are trained on documented hospital records. Documentation happens at the end of the clinical story, after diagnosis, after triage, after the critical first-contact signals have resolved. But the AI is deployed at first contact, where the signals the training data never captured are most important. The mismatch is structural, not accidental. The evidence anchor is the Macquarie 2026 replication: ChatGPT Health under-triaged 52% of real emergencies when constrained to forced-choice output, dropping to baseline only when allowed multi-turn free text.
Category 3: Design and Runtime
The four runtime artifacts every agent needs before it ships, plus the framing devices that make the runtime conversation tractable.
11. The Four Design Artifacts
Every agentic product requires four runtime design decisions, whether made deliberately or by default. Default means wrong. The autonomy boundary names what the agent may do unilaterally versus what requires a human decision, made legible at the moment it matters. The approval moment is the designed handoff between agent decision and human judgment: a decision package containing what the agent knows, its uncertainty, the consequences, and the alternatives, not a confirmation dialog. The audit surface is the decision trajectory (observable sequence of actions and retrieved context), not chain-of-thought self-justification, and must include whose authority was delegated and who carries the outcome. The recovery workflow is the compensating action, rollback option, or clear accounting of what cannot be undone. An error message is not a recovery workflow.
12. Agentic Experience Lifecycle
Agentic experience design is not only what the user sees at the moment of action. It is the governance of the agent’s behavior across its entire operational life. The eight phases are setup, configuration, testing, deployment, runtime monitoring, escalation, incident response, and retirement. Teams that treat only the runtime interaction as “the experience” are designing half a system.
13. Mental Model Declaration
Before designing any agentic experience artifact, the team must declare what kind of system this is. The three types are not interchangeable. A suggestion engine surfaces options and waits; the human always decides and acts. A copilot acts on explicit step-by-step instructions; the human initiates each step. An autonomous actor executes sequences of decisions without further approval within defined boundaries. Users who do not know which system they are operating develop the wrong mental model, and wrong mental models drive both overtrust and abandonment in equal measure.
24. Two-Channel Agentic Design
Agentic systems require two design channels. Channel 1 is the agent itself: autonomy boundary, logging, error handling, recovery workflow. Channel 2 is the human experience of supervising it: engagement, skill maintenance, workflow fit, supervisory interface. Most PMs design only Channel 1. Most change management programs focus only on Channel 2 and treat it as an adoption problem. Neither approach works alone. The actor-to-supervisor transition is the named design problem; it is not an adoption problem. The aviation precedent is Crew Resource Management, built specifically for the actor-to-supervisor transition after UA Flight 173 (December 1978), maintained as recurrent practice through LOSA, not as a course.
22. The Supervisory System (Second Product)
Every AI product ships with a second product attached: the supervisory system that surrounds it. How users know what the AI is doing, how they intervene, whether the workflow supports the engagement the AI was designed to invite. Most teams build only the first product. The Stankowski CHI 2026 paper is the clearest empirical demonstration: 93% of SAP employees said they would adopt a reflection-focused goal-setting AI after a structured session; the same users, on the same AI, reverted to output-extraction behavior once returned to their normal workflow. The structured session supplied three things the production environment did not: bounded time, external accountability, and a shared frame that reflection was the point.
Category 4: Evaluation
The eval frameworks.
25. Three Eval Breaks
The PM’s QA mental model translates almost perfectly to AI evals and breaks in exactly three places. Determinism: a passing eval is a sample from a distribution, not proof of fixed behavior. Compound probability: 0.95 over ten chained steps is 60% end-to-end success. Background failure: semantic validation is not state validation; an agent can hallucinate success while the underlying system saw no action. Evals approximate behavior; they do not prove correctness. They bound uncertainty. Production monitoring is an extension of the eval process, not a separate phase. The coverage problem (evals only test what someone thought to test) is why 95% of AI pilots succeed in demo and fail in production.
5. Standard First, Extension Last
Before creating a FHIR extension, exhaust the standard. An existing element may carry the information. An existing code system may express the concept. An annotation or identifier field may handle it without creating a new namespace. The governance question for every extension is who outside this organization will need to interpret it. If the answer is no one, the field belongs in internal workflow logic, not in the canonical FHIR output. National profile URLs (IL-Core, US Core, UK Core, DE Base) are interpretable across the national healthcare ecosystem. Organizational URLs are private dictionaries that external systems and AI agents cannot read.
6. Receiver-First Design (Top-Down vs Bottom-Up)
Bottom-up data architecture starts from what exists in the source system: inventory what you have, map it, publish it. This produces a comprehensive catalog that is internally coherent and largely opaque to anyone outside the organization. Top-down starts from the use case: what does the receiving side need to act? In FHIR: build the profile from the outside in. Define the canonical output before opening the source system mapping tool. Source fields become implementation details, not the starting point. Receiver-first thinking forces you to separate what belongs to the standard from what belongs to your organization.
Category 5: Observation and Operation
The observation phase, the six instruments, and the failure modes that surface only after launch.
26. Observation Phase and the Six Instruments
The observation phase of the product lifecycle was always supposed to exist; MVP culture quietly dropped it. For agentic AI, this failure is uniquely catastrophic. SaaS tools stagnate when unmonitored; agents compound errors autonomously. A second force compounds the gap: trust builds naturally after launch, supervision erodes gradually, and without a designed corrective loop the agent drifts to an autonomy level nobody authorized. The six instruments are task success rate, unintended action rate, override frequency, confidence calibration, rollback time, and incident recovery time. The metric is the test of the design. Absence of an instrument confirms absence of the corresponding design surface.
21. Interface-Induced Failure
What presents as an AI capability failure is often an interface failure. When a clinical AI is evaluated under conditions that suppress its reasoning mechanism (clinician-authored vignettes, forced A/B/C/D output, single-turn interaction, suppression of clarifying questions), the result measures the product, not the model. The Macquarie 2026 partial replication of the Nature Medicine 2024 ChatGPT triage study showed three of five frontier models going from 0 to 24% accuracy under forced choice to 100% under free text. Same models, same scenarios. The only variable was whether the AI was allowed to answer in its own words. Hampton et al., BMJ 1975, n=80: the correct diagnosis was present after history alone in 66 of 80 outpatient encounters. Fifty years later we are building AI triage tools that remove the very mechanism the 1975 data identified as the diagnostic instrument.
30. Upstream-Data-Wrong Hallucination
Standard hallucination taxonomies describe what the model fabricated. The upstream-data-wrong category describes something harder to detect: what the model faithfully reproduced from a complete but misleading input. A patient record can be comprehensive and clinically wrong at the same time. The AI reasons from what it receives. The failure is upstream, in the data, not in the model. Model-level evals test whether the output is faithful to the input. When the input is wrong, the “faithful” output is also wrong, and the eval shows green. This is a data integrity problem that has been redefined as a model output problem. Detection belongs at the data layer, not the model layer.
16. Clinical Quorum Bundle
A senior physician seeing a new consult does not have a complete longitudinal record. They reason under uncertainty, flag what they do not know, and make a calibrated recommendation. AI should do the same. For each specific decision, define the minimum coherent evidence set required to act safely. Not 400 fields. Just enough, assembled from whatever sources actually exist. The companion frameworks are coverage-aware AI (explicit missingness metadata attached to every AI output) and the purpose-scoped data product (narrow scope, real governance, for one decision type).
19. The Digital Twin Duality
Healthcare is building two types of digital twins simultaneously. Patient twins model the body: tumor growth, glucose response, adaptive therapy. Physician twins replicate expertise: diagnostic reasoning, documentation, supervised care streams. Both are viable under supervision. The dangerous zone is when they converge and the physician’s role reduces to validation. At 99% twin agreement, the human review becomes a formality. The shift from supervised assistant to autonomous agent happens through gradual drift, not a design decision. The boundary between AI twin (supervised, escalates, physician retains accountability) and autonomous agent (independent, no escalation path, liability unclear) is not technical; it is professional and legal.
Category 6: Supervision and Skill Erosion
The frameworks that explain why the human in the loop is not a fixed input.
33. The Supervision Paradox
Named and applied from Bainbridge (1983, “Ironies of Automation”). The automation that improves normal-state performance is the same automation that degrades the human monitoring skill required for abnormal-state intervention. The more reliable the system, the more thoroughly atrophied the oversight capacity. Two decay pathways. Deskilling: erosion of existing competency through sustained AI use, anchored in BudzyĆ/ACCEPT (21% relative ADR drop in 19 experienced endoscopists over three months) and Dratsch (automation bias in experienced radiologists with wrong AI). Never-skilling: failure to acquire foundational competency because AI is present during the entire formative window, anchored in Abdulnour NEJM 2025 and the Anthropic coding skills study (novices 50% vs 67% comprehension). The only institutional counter-model is aviation’s mandatory recurrent manual proficiency requirements (EASA SIB 2025-09). No equivalent exists in medicine or software engineering.
28. The Four Apprenticeship Conditions
The medical apprenticeship model built clinical judgment for centuries through four conditions operating in parallel: volume (thousands of cases before pattern recognition becomes automatic), acuity range (full spectrum of complexity, including the common cases that build foundation), failure (small diagnostic errors caught before harm, the curriculum, not the accidents), and feedback timing (correction close enough to the error to wire the two together). All four are now under simultaneous pressure. Volume: AI handles pattern recognition the trainee’s brain would otherwise develop. Acuity range: consumer health platforms and AI triage filter common cases before they reach the teaching clinic. Failure: AI absorbs the small errors that were the curriculum. Feedback timing: paradoxically the one condition AI can improve. Clinical AI product design for training environments is not identical to clinical AI product design for care delivery.
42. The Validator-of-Validator Regress
The human reviewer of AI output progressively loses validation capability because they have stopped practicing the original-work skill that validation depends on. The senior engineer who grew up writing code by hand can still validate AI-generated code through pattern recognition built on years of human-authored work. After two years of primarily reviewing rather than writing, that pattern recognition drifts toward “what AI-generated code looks like” rather than “what good code looks like.” Simultaneously, the AI is being trained on the outputs of these weakening validators, compounding ground-truth signal loss at every cycle. The validator atrophies as the validated work grows. Validator rotation is not optional; the supervisory infrastructure has to enforce it because the cognitive math will never favor it.
43. The Two-Stage Supervision Collapse
The supervision channel between humans and AI agents collapses in two stages with different time horizons. Stage 1 is happening now through speed asymmetry: the agent generates output a hundred times faster than the human can review with depth, and the human starts heuristic-validating because the cognitive math demands it. Heuristic-validation is not validation. Stage 2 emerges over eighteen months through skill atrophy: the human reviewer’s pattern recognition drifts because they have stopped practicing it. The two stages compound. The visible failure mode (speed) makes the invisible failure mode (skill loss) worse, and by the time anyone measures the second collapse, the validator no longer has the baseline to detect what they have lost. Both stages require separate countermeasures.
3. Criterion 4 Failure Mode
The FDA’s Non-Device CDS exemption requires that a qualified clinician independently review AI output before acting. This is Criterion 4. The assumption worked for alert-on-a-screen AI (sepsis score, risk threshold) where a human sees it, evaluates it, and decides. It fails when AI is a continuous presence (chronic disease app nudging behavior daily, never reviewed by a clinician), when AI is an orchestrating layer (ICU multi-stream coordination, no single recommendation to review), and when AI was built in MVP layers (governance was always next sprint, never built). The FDA’s January 2026 guidance names automation bias and says time-critical decisions fail Criterion 4. The problem is named. The replacement framework is not written.
34. Earned vs Scheduled Autonomy
Movement up the autonomy ladder, from suggesting to acting, from supervised to autonomous, should be triggered by demonstrated competence in the specific failure modes that matter for this decision type, not by reaching a review count. Utah’s Doctronic design used a threshold of 250 supervised renewals: once reached, the physician was removed. That is a schedule, not a safety criterion. The schedule measures how many times a physician said yes. It does not measure whether the AI correctly handles the edge cases where autonomous action is inappropriate, nor whether it has a recovery mechanism for when it is wrong. Closed-loop insulin delivery is the right precedent: real-time physiological feedback is architecturally embedded, the sensor reads continuously, and when the model errs the body signals it within minutes. Earned autonomy requires demonstrated competence in the specific failure modes, a real-time or near-real-time signal that catches errors before they compound, and a defined path back down the ladder.
35. Deference Allocation
The design challenge in clinical AI has shifted. The question is no longer whether AI can reason as well as a physician. On some tasks, under some conditions, it demonstrably can. The question is now about deference: how do you build workflows and interfaces that shift reliance toward whichever agent (human or AI) is more likely to be right on this specific case, without collapsing into blind automation on one side or reflexive rejection on the other? Deference allocation is not a static calibration. It depends on the case type, the confidence of the AI output, the acuity of the decision, and the comparative skill of the human reviewer for this specific decision type. A workflow that sets deference at a fixed level is miscalibrated for most of the cases it encounters. The field has spent a decade measuring AI capability. It has not yet built the interface science for calibrating human reliance on that capability case by case.
23. Automation Expectation vs Automation Complacency
Automation complacency, the older idea, describes users trusting a reliable system too much over time. Automation expectation is different: it is the orientation users bring with them before they ever touch your product. By the time users get to your AI, they have already learned from every other AI tool that the deal is hand off work, receive a result. When your product asks them to do something different, they push back because the muscle memory is not there. Onboarding cannot undo a pre-existing condition. The product must either conform to the existing expectation or supply the conditions that make the different interaction viable. A third option, “train the user harder,” is not a design.
Category 7: Governance and Ethics
Frameworks for turning regulation into product specification.
2. The Coverage Gradient
Healthcare AI regulation is not calibrated to actual patient risk. It is calibrated to the clinical claims made. A chronic disease management app that interacts with patients daily for years but avoids making diagnostic claims sits in a regulatory gray zone. A narrow diagnostic tool making explicit clinical assertions faces full device scrutiny. The gap is largest in the middle: tools with long patient contact, persistent influence, and real-world consequences that are framed as “general wellness.” Three zones: consumer mental health chatbots (visible, state-level regulation emerging, Utah HB452 model), the unprotected middle (chronic disease, remote monitoring, medication adherence, behavioral health, the most patient contact and the least governance), and ICU AI orchestration (visible, device framework doesn’t map well to systems-of-systems). Silent degradation, model drift over months without monitoring requirement or regulatory trigger, is the failure mode the gradient hides.
8. The Wall
Medical knowledge has been locked behind institutional walls (journals, boards, guilds, terminology designed to signal that access requires initiation) for centuries. Billions of people paid for brief, metered passage. AI is demolishing the wall by moving knowledge access from credentialed gatekeepers to any patient with a phone. AI challengers are not competing for the same market; they are redefining the playing field. The incumbents who survive will be those who understand what type of innovation they are actually facing.
32. The Michelin Condition
A guide-business-model holds only when accuracy is the mechanism that drives the behavior that drives the revenue. Michelin’s restaurant ratings work because honest ratings make people drive to good restaurants, and driving wears tires. Accuracy is not a value in tension with the business model; it is the business model. The condition fails when the guide’s revenue comes from a party whose interest is not in accurate information but in the audience’s attention or behavior. Clinical AI funded by pharmaceutical advertising does not have the Michelin accident in its favor. Ask whether the platform’s revenue depends on the guide being accurate, or merely on the audience trusting that it is. If the latter, the Michelin condition has failed before anyone makes a corrupt editorial decision.
15. Adaptive Governance
Healthcare governance modeled on banking creates fatal failure modes. Banking transactions can wait; clinical emergencies cannot. The blood bank story: a perfectly compliant multi-step SAP workflow nearly cost a patient’s life because it enforced static governance without modeling urgency. Governance in healthcare must be context-aware: rules that scale with clinical acuity, structured override rights with audit trails, and learning loops that improve the system when it fails under pressure. The four requirements are context-aware rules, structured override rights (not hidden workarounds), audit trails that capture why, not just what, and learning loops that update governance when it fails in emergencies.
20. The AI Constitutional Runtime Layer
Healthcare AI has design-time and deployment-time governance but no runtime-enforced ethical constraints. Physicians carry the Hippocratic oath as an internalized runtime constraint; it does not switch off. Healthcare AI outputs have no equivalent. Two layers are needed: ethical principles (non-negotiable values) and enforced runtime constraints that translate principles into architecture before the output reaches the patient. Anthropic’s Constitutional AI (2022) and Constitutional Classifiers (2025) demonstrate the mechanism exists. The healthcare-specific rules do not yet. The evidence anchor is the Nature Medicine respiratory failure case: the model identified the danger in its own reasoning and still told the patient to wait. That is a constraint failure, not a knowledge failure.
31. Four-Question Pre-Launch Review (Named Owners)
The operational translation of the Four Design Artifacts and the Six Observation Instruments into a four-question pre-launch review the CEO runs before any agent ships, with named owners. The review converts a values debate into a capability check. Head of engineering owns rollback time: how long from an incorrect action to the last known good state, measured. CFO owns per-task cost: what the agent costs per successful outcome, including review time, rework, and multi-agent coordination overhead. Head of legal owns audit surface: whether every decision the agent makes is reconstructable for a regulator, an auditor, or a customer. Head of product owns the approval moment: where a human is required to intervene, and how that moment is designed so it actually gets used rather than bypassed. The four questions align the C-suite around the same operating discipline without requiring any of them to become AI experts.
Category 8: Personnel Infrastructure
The institutional category the field has not yet built.
38. The Missing Category
Every category of actor an organization adds to its workflows has an existing infrastructure to absorb it. A new employee has HR. A new contractor has procurement. A new vendor has vendor management. A new system has IT operations. A new clinical device has biomed and credentialing. The AI agent fits none of these cleanly, and the category it actually belongs to has not been built yet. We treat it as a software integration because that is the only category we have. The category is the wrong fit, and the misfit is what makes the problem invisible until it surfaces as a supervision failure. The AI field is not ignoring HR or procurement; HR and procurement were not in the room when the norms were being formed. That gap is structural.
36. Non-Local Cultural Defaults
Culture fit, on the standard HR list, is the evaluation that catches misalignment between an actor’s default behaviors and the organization’s norms before the actor has authority. For human hires, defaults are local: the candidate developed them through prior employers, education, lived experience, and the customer organization evaluates fit against its own culture during the interview. For AI agents, defaults are not local. They are baked in upstream by the model provider during training and RLHF, reflecting the provider’s training data distribution and preferences. When an agent encounters an ambiguous instruction, what it defaults to (conservatism vs speed, escalation vs autonomy, caveats vs confidence) is determined by a company the customer does not work at, by people the customer has never met, optimizing for goals that may or may not align with the customer’s culture. Culture fit, the one HR mechanism that exists specifically to catch this kind of misalignment, cannot be performed by the customer organization at all.
37. Persistent Contractor With API Access
When the “AI agents are not team members, they are tools” objection arises, the structural reality is that the agent is a persistent contractor with no statement of work, no acceptance criteria, no performance bond, and no termination clause. Anthropomorphism is not the issue; the org chart does not care about consciousness, it cares about who acts. The procurement infrastructure for contractors (SOW, defined deliverables, performance metrics, indemnification, audit rights, ability to terminate) exists for the same reason HR infrastructure exists: to refuse to add an actor to the workflow until somebody who is accountable has looked at them and decided. AI agents skipped that infrastructure too. “API integration” was doing the work that a contractor SOW should have done.
39. Costume vs Actor
Persona prompts function as costumes that hold while instructions are clear, specific, and uncontested. When the instruction is ambiguous, when the context window compresses the persona to a faint signal, or when the user asks the agent something the persona did not anticipate, the costume slips and the underlying model personality asserts itself. The persona is the costume; the model is the actor; the actor wins when stakes get high or prompts get thin. Sonnet 4.6 underneath is light, fast, occasionally funny; Opus 4.7 underneath is serious, layered, takes longer to land. Same persona prompt produces different output across models because the default personality survives the instruction set under most real-world conditions.
40. Model Upgrade as Personnel Change
Every model upgrade is structurally equivalent to replacing a team member who has earned a working contract with their colleagues. The team has to renegotiate the rhythm, the register, the trust calibration. The replacement may be better at the underlying tasks, but the working dynamic has to be rebuilt, and the cost of that rebuild (productivity drop, friction, retraining) is real and routinely uncounted. For enterprises running internal copilots backed by frontier models, this personnel change is happening invisibly across thousands of users every time the underlying model updates, without warning. If your engineering organization’s productivity drops three to six weeks after a quiet model upgrade, you are not seeing feature regression. You are seeing a workforce that just had a team member replaced and did not get to interview the replacement.
41. AI Lumina (Personality Eval as Distinct from Capability Eval)
Enterprise hiring has spent fifty years measuring human team-fit through personality assessments (Lumina, DISC, MBTI, Five-Factor, Hogan). These instruments do not measure capability; they measure default behavior under ambiguity, stress, and team friction. The output is not the report; it is a shared vocabulary that lets a team adjust around predictable differences. An equivalent assessment for AI agents is not science fiction; it is a test suite of ambiguous scenarios designed to surface what the agent defaults to when the persona prompt is silent or contested. Score the responses across personality dimensions, publish the profile to the team that will use the agent, re-run on every model upgrade. Personality eval is distinct from capability eval. Capability eval asks “can the model do the task.” Personality eval asks “how does the model do the task when the task has slack.” Nobody has named the category.
44. Configuration as Onboarding
Persona prompts, style guides, CLAUDE.md files, cursor rules, and similar AI agent configuration files are functionally the new-hire onboarding packet. The list of code conventions, the architectural decisions encoded as “here is how we do it on this team,” the explicit rules about what good looks like, all of this used to live in a Confluence page that new hires read in their first week, plus three months of code review feedback that gradually taught the new hire the team’s style. For the agent, none of that gradual absorption happens. The agent reads the rules each session and starts over. If your team has not written its CLAUDE.md, your codebase is being written by an employee you did not onboard. The configuration file is also tuned for one model’s defaults; switching models without updating the configuration compounds the personnel-change problem.
45. The Missing Supervisor
The supervisory role for AI agents does not exist yet. The senior engineer reviewing PRs at 5:47 p.m. is performing three roles simultaneously (writer, reviewer, accountable supervisor) and failing at the third because the role was designed for a different time scale. There is no third party in the room mediating the friction between human and agent the way HR mediates friction between human manager and human report. The structural answers from fifty years of HR work (outcome-supervision, trust thresholds, peer review, declared confidence, sampling) are immediately applicable, but the AI version requires three additions specific to the speed-and-volume problem: rate-aware sampling triaged by consequence and confidence rather than uniform one-in-ten coverage; continuous drift detection that does not depend on human attention; and validator rotation that keeps the human reviewer practicing original work weekly. Until the supervisory role is built and staffed, the supervision channel will keep collapsing.
46. Active-Learning Loop (Five Systems, Six Roles, One Supervisor)
A multi-LLM orchestration pattern that produces durable learning, not just output. Five AI systems and six distinct roles, with one human supervisor at the top. The human owns the spine, the lived experience, the editorial calls, and the decision to ship. Three deep-research instruments (Gemini, ChatGPT, Perplexity) run in parallel for triangulated evidence. One model drafts (Claude). A different model fact-checks the same draft (Perplexity in judge mode). A different-version model coaches voice based on months of accumulated calibration (Sonnet 4.6 reviewing Opus 4.7). Each role plays to a strength; no role tries to be everything. The default usage pattern treats AI as a content faucet and produces validator atrophy. The active-learning loop inverts the default: the user becomes a deliberate supervisor, the loop produces both an artifact and learning, and the cognitive work of running the loop is itself the education.
47. Cross-Model Voice Coaching
The practice of using one LLM with months of accumulated voice calibration to critique drafts produced by a different LLM whose defaults bleed through. The model writing the draft has a personality; the model reviewing the draft has its own personality plus the accumulated context on the human writer’s voice; the two together, with the human as final editor, can converge on prose that reads as the writer’s own much faster than either could alone. One model coaches the other on which costume to wear. The coaching only works because the human spent the months building the calibration that one of them now carries. The fix is recursive.
48. The Catalog as Bidirectional Memory
A working catalog of a body of work functions simultaneously as the LLM’s working memory across sessions (so each new article integrates the previous hundred, retrieves frameworks, avoids repetition) and as the author’s own external memory (recall under cognitive load, gap-finding, framework reuse before meetings, talks, or difficult emails). The catalog remembers things on the author’s behalf that working memory could not hold while doing the job. It also surfaces gaps: a topic with no entry is sometimes the most useful information the catalog produces. Without the catalog, each new piece risks repeating prior arguments, missing prior frameworks, or contradicting prior claims. With it, the body of work compounds. The catalog is also a curriculum nobody is writing for fields that lack textbooks.
Category 9: Healthcare-Specific Clinical Frameworks
29. The xPhysician
AI will not replace physicians but will redefine the role. Borrowed from the xEngineer concept in software, the xPhysician is an AI-augmented generalist whose scope of impact is multiplied by AI-orchestrated workflows. Less time on routine pattern recognition; more time on reasoning that requires being a physician rather than a well-trained information processor. Designing care protocols, supervising AI-executed tasks, and handling edge cases that require judgment the AI cannot yet replicate. As knowledge shifts from memory to machine, the training system that built physicians around memorized knowledge becomes obsolete. The xPhysician is trained differently because the scope of impact and the shape of the work are different.