Chapter 8: The Proficiency Check
Medical training installs a game that nobody remembers being taught.
The ECG machine prints its own interpretation in the top left corner of the strip. It has been doing this since 1982, which means clinicians have been working next to an automated interpreter for longer than most product managers have been alive. The discipline is simple: you do not look. You cover the corner with your hand or fold the strip, and you read the trace yourself. Rate, rhythm, axis, intervals, morphology. You commit, out loud if a senior is standing behind you. Then you flip the card and read what the machine said.
The same game runs at the lab results. Blood work arrives today as a row of little scales, a dot on a colored bar for each value. Most people reading their own results check only whether the dot sits in the green zone. They do not read the number, and often not even what the dot measures. That is the flip-card game played in reverse: interpretation first, evidence never. The clinical discipline is the opposite. You read the numbers first and build the story they tell. What is off, what pattern connects the abnormalities, what you would check next. Then, and only then, you look at where the dots landed.
In radiology residency the game stops being a game and becomes the entire first month. You spend your opening weeks describing studies the attending has already read. Every film in front of you has an answer key sitting in the system. You describe what you see, fully, committed, and then you pull the report and grade yourself. First alone, where the only cost of being wrong is private. Later in front of the senior staff, where it is not.
Nobody called this pedagogy. It was simply how you learned to see. But look at the design. The ground truth is always present and always sequenced after your commitment, never before it. The interpreter you are checked against is usually right, and the discipline exists anyway, precisely because it is usually right; an interpreter that was usually wrong would pose no danger to your ability to look. And the game converts ordinary working artifacts, a strip, a printout, a worklist, into calibration reps, every shift, at zero added cost.
Medicine built this because it learned, earlier than any other profession, what happens to a reader who sees the interpretation first. You will find what the interpretation found. The trace stops being a trace and becomes an illustration of the machine’s conclusion, and the finding that sits outside that conclusion stays invisible, not because you lack the skill to see it but because you surrendered the act of looking before you began.
Clinical training installs the flip-card game in the first year, mostly without ever naming it. Almost no product manager working with AI plays it at all. The interpretation arrives first, every time, in the chat window, and there is no corner to cover.
This chapter is the game, rebuilt for the work you do now. The previous chapters gave you a practice: reps across real tasks, the active-learning loop, reading the model’s defaults, configuration as self-knowledge. This chapter is about something less pleasant. It is about proving, on a schedule, with a record, that the practice is working. Not feeling that it is working. Proving it.
The assumption nobody tests
The argument of this book so far compresses to three sentences. Judgment is the part of the job that cannot be handed to the machine, which makes it the binding constraint on everything else. Judgment is built and maintained through practice. And the tools that make you productive are quietly removing the practice.
The third sentence is the one people resist, so it is worth restating what the evidence actually shows. In the Wharton cognitive surrender experiments, access to AI raised participants’ confidence by nearly twelve percentage points whether the AI was right or wrong. When the AI was wrong, participants followed it on roughly four out of five trials, and their accuracy fell fifteen points below the no-AI baseline. The nineteen endoscopists in the Lancet study lost a fifth of their independent detection ability in three months, and not one of them reported feeling it happen. The MIT writing study found the cognitive work was not being redirected. It was being skipped, and the participants could not tell.
Hold those findings together and a trap closes. The decay is invisible to the performer. Confidence rises while the skill falls. Which means the sentence “my judgment is fine” carries no information. It is what a well-calibrated practitioner says, and it is also exactly what a deteriorating one says, and from the inside the two are indistinguishable. The Shaw and Nave moderator data adds the cruelest detail: the people most confident they are not at risk are the ones most at risk.
So invert the question. Stop asking whether your judgment is fine. Ask what evidence would prove it, to someone who had no reason to take your word.
One profession has answered that question seriously, and it answered with a check ride.
In 2025 the European Aviation Safety Agency reissued its safety bulletin on manual flight, warning that continuous use of automated systems does not contribute to maintaining manual flying skills and can degrade the ability to handle the aircraft when automation drops out. Long-haul captains on heavily automated aircraft have been estimated to accrue under one hour of true manual flying per year. Aviation’s response to this arithmetic was not a poster about vigilance. It was mandatory recurrent proficiency: pilots demonstrate, in the simulator, on a schedule, under off-normal conditions, that they can fly without the automation. The check is binding. The record is kept. The requirement does not care how experienced the captain is or how confident he feels, because the entire premise of the system is that confidence is not evidence.
No other knowledge profession has built the equivalent. Not medicine, which is further along in admitting the problem than in solving it. Not software engineering. Certainly not product management, a profession that has never had a licensing exam, a recertification cycle, or any institutional answer to the question of whether a practitioner can still do the thing they are paid to do.
Nobody is going to build it for you. On a well-run team the engineering manager owns the engineers’ skill maintenance, and that assignment is right. But no one on any org chart owns yours. The proficiency check for the agentic PM is a regime you run on yourself, and this chapter is its design.
Why willpower is not a design
Before the regime itself, one constraint that shapes all of it.
Picture the senior engineer from earlier in this book at 5:47 on a Thursday evening, nineteen pull requests in the queue, thirteen minutes before her next meeting. She approves at eight seconds per PR not because she is negligent but because the cognitive math demands it. The agent produced the work in seconds; reviewing it properly takes forty minutes; the queue is winning. Every individual act of deference is locally rational. That is precisely what makes the aggregate corrosive.
Your version of her queue is your calendar. On any given Tuesday, letting the model draft the prioritization rationale, accepting its synthesis of the customer calls, and reading its competitive summary instead of the filings is the right call, every single time, judged one decision at a time. The practice this chapter prescribes will lose the argument with your calendar every week it is asked to win one. The cognitive math will never favor maintenance, for the same reason nobody flosses because of an argument they won with themselves that morning.
There is also a pattern worth wiring into the design, one anybody who has run a workshop has watched happen. Put people in a structured session with a reflection-focused tool and most of them will say, sincerely, that they intend to keep using it; back in their normal workflow, the same people on the same tool revert within days. The structured session supplied three things the production environment did not: bounded time, external accountability, and a shared frame that the harder behavior was the point.
Those three conditions are the engineering requirements for everything that follows. Each practice below is built to be scheduled rather than willed, witnessed rather than private, and named rather than smuggled. A proficiency regime that depends on discipline is a plan to skip it.
What this regime is, and is not
Honesty before mechanics, because this chapter is where a careful reader should test the book’s own epistemics.
The threat in Part I is evidenced. The regime you are about to read is not. It is a designed practice, argued from mechanism and from precedent: aviation’s recurrent checks, medicine’s reading order, the calibration logic the eval discipline already uses on machines. Nobody has measured whether these five practices hold a product manager’s judgment stable, because nobody has measured product-manager judgment at all. There is no baseline, no cohort, no trial. You are the trial, and that is not a weakness smuggled into a footnote; it is the design. The regime’s first product is the record, and the record is what converts your own run from anecdote into evidence, which is the only evidence on this subject that will exist for a while.
Scope the claim too, because a fair objection is coming. Ninety minutes a week does not replace an apprenticeship, and this chapter never asks it to. Building judgment that does not yet exist is the slow work of all of Part II, and even that is the rebuilt version of years. A check does something different and smaller: it keeps judgment that exists from quietly leaving, and it tells you when it starts to. Aviation’s simulators never taught anyone to fly. They catch the captain who is forgetting, which is a different job, and it is the only job this chapter takes.
The regime
Five practices. Two are weekly, one is monthly, one is continuous and nearly free, and one is a standing rule that activates when the others produce a bad number. Together they cost roughly ninety minutes a week plus one hour a month. That is the maintenance fee on the only asset this book has argued cannot be replaced.
Practice one: the flip card
This is the game from the opening, applied to your own work.
Once a week, pick a live judgment that is yours to make. An effort estimate. A build-or-kill read on a feature. The go or no-go on a suitability call. A review of a spec. Before you let the agent touch it, write your own answer. Three sentences, timestamped, somewhere you cannot quietly edit later. Then run the agent and compare.
Three outcomes are possible and all three are useful. You agree, which is calibration data. You disagree and the resolution eventually proves you right, which tells you where your judgment still outruns the model and where deference would have cost you. You disagree and the resolution proves the model right, which is the most valuable outcome of all, because it maps the specific territory where your intuition has gone stale, and it does so while the stakes are one decision rather than a year of them.
The point is not to beat the model. Some weeks you will not, and the frequency of those weeks is information you can act on. The point is that the comparison is impossible to run if you look at the corner first. The moment the agent’s answer enters your head, your own answer is gone; what you produce afterward is not your judgment but your reaction to the machine’s, anchored to its framing the way a reader who opens the report first traces the prior radiologist’s path through the film. Anthropic’s own research on chain-of-thought made a version of this point about models narrating their reasoning after the fact. It applies with equal force to a PM who reads the recommendation and then reconstructs what they would have said.
Cover the corner. Commit. Then flip the card. Every week, on real work, forever.
Practice two: the manual leg
Once a week, produce one artifact end to end without AI. A one-page brief. A competitive read built from the actual filings. An analysis memo from the raw numbers. Pick the artifact that week’s work actually needs, so the practice produces value rather than homework.
The justification is not that the manual artifact will be better. It will frequently be worse, and it will always be slower, and if you measure the practice by the artifact you will rationally abandon it within a month. The justification is the validator regress from earlier in this book: the ability to evaluate work depends on the ability to produce it, and the production skill is the one that decays first because it is the one the tools absorb first. The senior engineer who still catches the subtle defect in an agent’s diff catches it with pattern recognition built from years of writing code by hand, and that asset depreciates from the moment she stops.
Aviation pays for simulators and takes revenue-generating pilots offline to fly them. The hour your manual brief costs you is the same expenditure with the same justification. It is not lost productivity. It is the premium on the only insurance available for the skill you sell.
Practice three: the planted defect
Monthly, and this is the one that requires a partner.
Find one other person running this regime. A fellow PM is ideal; the pairing matters more than the seniority. Once a month, each of you takes a real agent output the other would plausibly review in the course of work, a draft spec, an analysis, a recommendation package, and plants two defects in it. One factual: a number changed, a citation pointing at a source that says something else. One judgment-level: a recommendation subtly misaligned with a constraint or a strategy the reviewer should know. Then you each review the other’s doctored artifact cold, at normal working speed, and report what you caught.
Your catch rate is your personal eval. It does for you what the golden dataset, the agent’s graded bank of test cases, does for the agent: it converts an unmeasurable virtue, vigilance, into a number that moves. An LLM judge earns trust only after it is calibrated against human-labeled failures. You are the human judge in your own loop, and you are currently uncalibrated. Expect the first result to sting. If it does not, inspect the defects before you celebrate; plants that are too easy to find are how this exercise flatters its way into being abandoned. The Wharton subjects were not unsettled by their own performance either, and that was the problem.
Track the number by quarter. There is no industry baseline to compare against; nobody has measured PM catch rates anywhere, and you are your own control. That is exactly why the trend matters more than the level. A catch rate that holds while your AI leverage grows is the evidence “my judgment is fine” pretends to be.
Practice four: the disagreement ledger
Continuous, and nearly free.
Every time you override the model on something that matters, log it. One line: date, what it said, what you did instead. Every time you defer on something that matters against an instinct that said otherwise, log that too. When ground truth eventually arrives, and in product work it usually does, close the entry with who was right.
Monthly, read the ledger and compute two numbers. Override precision: when you disagreed with the model, how often were you right? Deference regret: when you suppressed your instinct, how often should you not have? The first number falling means your domain edge is eroding or the models have improved past you in that territory; either way, your deference policy needs updating. The second number rising means you are sliding from cognitive offloading, where you delegate the task and keep the judgment, into surrender, where the judgment goes with it.
This is the sixth chapter’s divergence habit upgraded into an instrument. When you run a question across multiple models and they disagree, you are forced to adjudicate, and an adjudication is gradeable the moment the truth lands. An agent’s dashboard carries six instruments, and confidence calibration sits fourth on the list. The ledger is that instrument pointed at the only system in your stack that ships without telemetry.
Practice five: the demotion rule
An agent’s autonomy must be earned through demonstrated competence in the failure modes that matter, never scheduled by a review count or a go-live date, and every ladder needs a defined path back down. Utah’s pharmacy system removed its physicians after 250 supervised approvals and called the threshold safety. It was a schedule wearing safety’s clothes.
Apply the same standard to yourself, because your reliance on AI is an autonomy ladder too, and most practitioners climb it on a schedule called getting comfortable.
Before the quarter starts, write down your thresholds. If the planted-defect catch rate drops below the floor you set, or override precision falls for two consecutive months, you demote your own reliance one rung in the affected territory. More flip-card reps in that domain. An extra manual leg. Agent output in that lane gets read at draft depth, not summary depth, until the numbers recover. Write the response down with the threshold, in advance, because a threshold negotiated after the bad number arrives will lose, every time, to the same cognitive math that produced the bad number.
The specific values matter less than their existence and their timestamps. Earned, not scheduled, is the standard an agent’s autonomy is held to. The practitioner does not get a softer standard than the machine.
One week, concretely
Here is what the regime looks like inside an ordinary week, with none of the practices padded into something they are not.
Monday, 9:40. The steering meeting at noon needs your call on whether the returns-triage agent moves up a rung, from acting with approval to acting with oversight. The agent’s own recommendation memo is sitting in your inbox. Before opening it you write three sentences in the log: “No. Override rate is still above eight percent, and both August incidents were boundary cases, not bugs. Revisit two cycles after the next model update settles.” Timestamped. Then you open the memo. The agent recommends promotion, citing ninety-four percent task success over sixty days. You now know three things you could not have known in the other order: where you and the system disagree, exactly why, and that its case rests on a metric your own three sentences never mentioned. The meeting gets a sharper conversation than either answer alone would have produced. The disagreement goes in the ledger, resolution pending.
Wednesday, blocked 4:00 to 5:30. The manual leg. A competitor shipped an agent SDK last week and the analysis has to exist by Friday. You write the one-pager yourself, from the announcement, the docs, and the pricing page, with the assistant closed. It takes the full ninety minutes and the prose is rougher than the model’s would have been. It also contains a sentence no model would have written, because it comes from a pricing structure you have seen kill a partnership before. Saturday you run the same task through the model out of curiosity and find its version cleaner, faster, and missing that sentence. Both facts go in your notes.
Thursday, 8:15. Flip card on something small: the effort estimate for the audit-surface work. You write “three sprints, the blocker is legal review, not engineering” before asking. The model says two sprints and does not mention legal. Ledger.
Last Friday of the month, 3:00. Your partner sends back your own draft rollout spec, doctored. You review it cold at working speed. You catch the factual plant in four minutes, a retention number that contradicts the dashboard you check weekly. You miss the judgment plant entirely: a rollout sequence that quietly routes EU pilot data through the wrong region. Catch rate this month: one of two. The miss is more useful than the catch; it tells you which class of defect your review pattern is blind to, and next month you read for sequence and constraint before you read for numbers.
Quarter end, one hour. The ledger holds eleven closed entries. Override precision overall is fine, but in one territory, integration scoping, you have been wrong in three of four disagreements. The demotion rule you wrote in January fires in reverse: in that lane the model has earned more of your deference, not less, and your instinct there is the thing that needs retraining. The flip card keeps running in that territory; the override does not, until the ledger says otherwise.
Nothing in that week required heroics. It required ninety minutes, a partner, and a calendar that did not renegotiate itself one Tuesday at a time. And notice what the quarter-end hour produced: not a feeling about your judgment, a map of it, with one territory marked for repair. That map is the thing no amount of daily AI use generates on its own.
Making it survive the calendar
The regime on paper is ninety minutes a week and an hour a month. The regime in practice is a fight with your own rationality, so build the three conditions in from the start.
Bounded time: the flip card and the manual leg go on the calendar as recurring blocks, named for what they are. Unscheduled practice is cancelled practice; the only variable is the date.
External accountability: the planted-defect partner is the load-bearing element of the entire structure, and not only for practice three. The pair reviews each other’s quarterly numbers. A private regime quietly dissolves; the senior engineer at 5:47 p.m. had no one expecting her numbers either. If you can find no partner, a second model can plant the defects, but a model cannot be disappointed in you, and disappointment is doing more work in human proficiency systems than anyone likes to admit.
A shared frame: tell your team what the blocks are. Not for permission, for protection. An hour of manual work by a PM with AI on tap looks like inefficiency to anyone who has not read this chapter, and the regime will not survive being mistaken for a productivity problem. Aviation never asks whether simulator time is a waste of a captain’s day. The frame is settled there. You have to settle it yourself, out loud, once.
The objections, in advance
This is overhead. It is, in exactly the sense that the simulator is overhead for the airline and recertification is overhead for the hospital. The alternative has a face: the Cigna reviewer spending 1.2 seconds per denial, a human in the loop in title, performing review-shaped motions over decisions no one was reviewing. That reviewer was not hired to be a rubber stamp. The role eroded under them, at queue speed, one rational deferral at a time. Ninety minutes a week is what it costs to be in the loop in fact rather than in title. The going rate for the title alone is zero, and so is its value.
I would notice if my judgment were slipping. The evidence says otherwise, and it says it specifically about people who say this. The decay is invisible to the performer by mechanism, not by carelessness; noticing it from the inside is the one instrument the condition disables first. That is not an insult to your self-awareness. It is the entire reason proficiency systems exist in every field that has them. The captain with twenty thousand hours flies the check ride anyway. The check is not an accusation. It is how the profession knows, rather than hopes, that the skill is still there.
The models will keep improving, so why maintain a skill the loop will need less of? Because every discipline in this book ends at a moment the model does not own. The go or no-go is yours. The boundary decision is yours. The demotion rule, for the agent and for yourself, only works if there is a functioning practitioner to demote back to; a fallback that cannot actually take over is a fallback in title. And the better the model gets, the fewer natural occasions you have to discover your own state, which is the supervision paradox doing exactly what it does: reliability consuming the evidence you would need to notice what reliability is costing you. Improving models make the check more necessary, not less, for the same reason aviation tightened proficiency requirements as the automation got better, not worse. Your week-to-week judgment may defer more and more, and should, where the ledger says so. The capacity to judge is what the regime maintains, because the gate does not transfer.
Why the order of operations is the whole discipline
Back to the strip with the covered corner.
Medicine did not respond to the automated interpreter by banning it, and it did not respond by trusting it. It responded with a reading order. The machine’s interpretation stayed on the strip, useful, usually right, one glance away, and the profession quietly arranged itself so that the human reading happened first anyway. Forty years later the game survives every generation of better machines, because the game was never about the machine’s accuracy. It was about preserving the reader. The ability to see independently is destroyed not by lack of talent but by the silent substitution of someone else’s looking for your own, and no profession has found a way to measure that ability by asking the practitioner about it. You can only preserve it by controlling the order of operations, every strip, every printout, every film, until the discipline is no longer a rule you follow but the way you see.
Say it once, where it lands hardest: oversight without understanding is not safety, it is sign-off. Everything in this book, the boundary, the approval moment, the gate you own, the gradient you sit on, assumes a practitioner on the human side of the loop whose judgment is live. Not credentialed. Not confident. Live, currently maintained, and demonstrable on demand.
Aviation does not ask its captains to promise they can still fly. It asks them to show it, on a schedule, and it writes the result down. No regulator, no manager, and no framework is going to ask this of you. You are the only party to the question with standing to ask it, and the only one who suffers if it goes unasked.
Cover the corner. Commit. Flip the card. Keep the record. The check is the difference between being the judgment in the room and being the person who used to be.