Part II · The Practice · Chapter 4

Chapter 4: Reps, Not Reading

You cannot learn confident-and-wrong from a description of it.

This is the whole problem with the literature on AI, including the careful parts. You can read that models hallucinate citations, that they agree too quickly, that fluency is not accuracy, and you can nod along, and none of it will be available to you in the moment that matters, because the moment that matters does not announce itself. The bad answer arrives wearing the same clothes as the good one. Same confidence, same polish, same fluent march to a conclusion. The thing you read about in the abstract shows up in the concrete as a paragraph that looks exactly like every other paragraph the model has handed you. Knowing that hallucinations exist does not help you catch the one in front of you. Only having caught a hundred of them helps.

Two chapters ago, two product managers watched the same agent demo and read it completely differently. One saw a polished happy path and the set of questions the demo was steering around. The other saw a working product. The gap between them was not intelligence and it was not seniority. It was reps. The first had watched the model be confidently wrong enough times to know what confident-and-wrong looks like from the inside, so she could spot it in someone else’s staged success. The second had only ever seen it work, because a demo is built to make it work.

This chapter is about getting the reps. It is the foundation everything else in Part II stacks on, and it is the least sophisticated thing in the book, which is exactly why it is skipped.

Why reading cannot do it

Judgment about a probabilistic system is not knowledge. It is calibration, and calibration is a physical property of a nervous system that has been corrected enough times. You do not develop a feel for a thing that behaves differently every time you touch it by studying it. You develop it the way you develop a feel for anything unpredictable, by handling it until the surprises run out. There is no shortcut through the reading, because the reading transmits propositions and the skill is not propositional.

Chapter 3 made this argument about medicine and about your own past, and one sentence of it carries forward: the fiftieth ordinary case teaches what normal variation looks like, the only background against which the abnormal announces itself, and no amount of reading builds that background.

The model is your pneumonia now. Not the impressive use, the dramatic agent doing the multi-step task you will demo to leadership. The ordinary use, in bulk, the small real tasks that teach you what the system’s normal looks like so that one day its abnormal will announce itself to you and to no one else in the room. You are not trying to be amazed by AI. Amazement is the demo-watcher’s relationship to it. You are trying to get bored by it, in the specific way a radiologist is bored by the ten thousandth normal chest film, because that boredom is what a poured foundation feels like from the inside.

Fold it into an ordinary day

The instruction is almost insultingly simple. Use AI everywhere, in and out of work, on small real tasks, every day, until its behavior stops surprising you. And you do not need to write a line of code to do it.

Make it a daily tool and not a special-occasion one. Have it summarize your inbox and the morning’s news, and notice the day it confidently summarizes an email that says the opposite of what it claims. Turn a sprawling task list into a table that sorts and groups itself, and notice where it silently invents a category. Use it as a writing partner and teach it your voice, feeding it things you have written, correcting it until the drafts come back in your register instead of the flat house style every model defaults to. That correction loop is not housekeeping. It is a lesson, repeated daily, in how completely a model’s output depends on what you put in front of it, which is the single most important thing a PM can know about the systems they are about to ship.

None of this is the impressive use. All of it is building the intuition. The reps are available to anyone willing to take them, which is why the practice is not just for engineers. Any information worker can fold the whole of it into a normal day, and most do not, because it does not feel like progress. It feels like errands. The PM two desks over runs one careful prompt a week, gets a usable answer, and counts himself AI-literate. He is at the level the demo was built for. He will read every staged success as a working product for as long as that stays his diet.

There is one habit inside the ordinary day worth singling out, because it returns later in the book wearing a different name. At the end of a working day, ask the model what it learned about you, your preferences, your blind spots, how you like to work. Save what comes back as standing context for the next session. That single habit is the entire discipline of agentic product design run on yourself by hand: you are curating a context layer deliberately rather than letting it accrete by accident, which is precisely the thing you will later design into a product for thousands of people. Do it on yourself first and the product version stops being abstract. This is also a thing you will keep. Hold that thought; Chapter 7 comes back for it.

Where the intuition actually lives

There is one rep that teaches more than all the others, and it is the one almost nobody runs, because it looks like inefficiency and it has a slight whiff of the absurd.

Do not stay loyal to one model. Run the same question through Claude, through Gemini, through OpenAI’s models, through Perplexity, through whatever is current the week you read this. I run four in parallel on any given morning, one for long-form writing, a different one for fact-checking, another for summarizing research, a fourth for image generation. I keep a browser tab open with a model router that sends a query to whichever engine seems most capable for the task type, and I keep a spreadsheet, slightly embarrassing in its specificity, that tracks which model I reach for when. This is not a recommendation to replicate my stack. The stack is a museum of every model released in the last eighteen months, and I will take the embarrassment, because the embarrassment is the price of the lesson and the lesson is not available any other way.

The lesson is in the divergence. Ask five models the same real question and they will not give you five versions of one answer. They will give you five answers that differ in tone, in what they refuse, in where they hallucinate, in which assumption they quietly adopt before they begin reasoning. Those divergences are not trivia and they are not noise to be averaged away. They are exactly the intuitions you will need the day you choose a model for a product, and there is no spec sheet that gives them to you. The benchmark chart tells you which model scored higher on a task someone else defined. It cannot tell you that this model gets terse and certain under ambiguity while that one hedges and over-explains, or that this one will refuse a request the other answers without comment, or that this one’s confidence and this one’s accuracy come apart in different places. You have to feel those, and you feel them by watching the same prompt land five different ways, over and over, until the pattern of differences is something you carry rather than something you look up.

The first time a divergence run teaches you something a benchmark never could is the first time you understand what the benchmarks are for, which is everything except this. Mine came from running the same writing task across two models from the same vendor, a generation apart. I had spent months teaching one of them my voice, correcting it session by session until its drafts came back in my register instead of the flat house style. Then I moved to the newer, more capable model and could not find my voice anymore. Same prompt, same profile, and the output had changed temperament: where the older one was light and direct and landed the point fast, the newer one expanded, hedged, and opened every paragraph with a sentence explaining what it was about to say. Every benchmark said the newer model was better, and on the benchmarks it was. None of them measured the thing that had actually changed, which was the default register the model fell into when my instructions ran thin. I only learned it by watching the same task land two different ways, and no spec sheet was ever going to tell me.

There is a smaller version of the divergence run that I recommend as the gateway, because it teaches the same lesson with one model instead of five. Stand up a tiny advisory council inside a single chat: a strategist, a skeptic, a domain expert, and put one real decision to all three at once. Watch how differently the same model frames the same problem when you assign it different roles. That is the whole of agent design in miniature. Behavior follows the role you assign, and the role is something you write. The full version of that mechanism is the next chapter. For now it is a rep, and it is one of the cheapest and most clarifying you can run.

Weather and geology

Here is where the reps need a frame, or they will teach you the wrong thing.

Every morning your feed serves you a new comparison chart ranking the latest models on some benchmark: speed, reasoning, coding, multimodal. A new chart arrives before you have finished processing the last one. If you let your reps track the charts, you will spend your attention on the model of the week and you will mistake motion for direction. The charts are a weather report. Every product manager reads the weather. The ones worth respecting are also reading the geology, the slow-moving structural forces that decide what is possible in this landscape over the next decade and that cannot be measured in tokens per second.

The same feed serves you the lists. Ten skills every PM needs in Claude this week, the seven prompts that will save your career, the workflow that replaces your roadmap meeting. Read them the way you now read the charts. Almost every item on those lists is phrased at the layer that expires, this tool, this feature, this month’s interface, and the list will be wrong about the specifics within two model releases. But the lists are not useless; they are rep fodder. Each item is a candidate practice: take one, run it on a real task from your actual week, and log what the run taught you about the system’s behavior rather than about the trick. The trick expires with the version number. The rep compounds. A practitioner who consumes the lists this way extracts the durable residue from a perishable feed, which is the entire relationship this chapter is teaching you to have with everything the field publishes, including, in time, this book.

I have been here before, which is the only reason I can tell the two apart. I was a product manager through the browser war, the search engine war, the social media war, and old enough for the spreadsheet war and the database wars before those. The winner in each was almost never the product that won the feature comparison at the moment of transition. The way I lived it, Excel did not win on formulas; it won on distribution and its place in the bundle. Netscape held a better browser for a stretch and still lost the desktop. The search engines with the early market share were not the ones that found the structural insight, the link graph, that reorganized the whole field, and no feature chart of the day was measuring for it. The thing that ends the ferment is structural, and the charts were never designed to see it.

Carry that into your reps and it sorts the surprises into two bins. Weather is the surprise that a newer model is suddenly better at the task you tested last month. Interesting, transient, and not worth rebuilding your judgment around, because the gap will close as the market matures the way index size eroded as a search differentiator. Geology is the surprise that survives every upgrade: that fluent output and correct output come apart in ways no single test catches, that confidence is decoupled from accuracy, that the agreeable default is the most expensive failure mode because it feels like help. Those do not close. They are properties of the kind of system, not the model of the week, and they are the intuitions worth building on, because they will still be true when the chart you read this morning is a museum piece. The reps build both kinds of knowledge. The frame is what stops you from banking the weather and calling it geology.

This is also why running many models is not a strategy. It is an admission that you do not yet know which one will be the stable production choice for each task, and a refusal to converge prematurely, because the cost of premature convergence is the migration every PM who standardized on Lotus or on a single browser had to pay. You maintain optionality until the structural ground shows itself. The reps are how you read the ground.

The failure mode of reps

Now the warning, because the chapter that tells you to use AI everywhere and stops there is the dangerous one.

Familiarity has a failure mode, and it is not the obvious one. The obvious worry is that you will not use the tools enough. The real worry is the opposite. You will use them so much that familiarity becomes comfort, comfort becomes trust, and trust becomes the quiet substitution of the model’s looking for your own. The reps that were supposed to build your calibration start, past a certain point, to erode it, because the same daily contact that taught you what confident-and-wrong looks like also makes the model feel like a colleague you no longer audit. This is the surrender Chapter 1 named, and the cruelty of it is that it arrives disguised as expertise. You feel most equipped to stop checking exactly when you have checked enough to be in danger.

Be precise about what this warning is not, because half the readers of an erosion argument reach for the wrong remedy. Nothing in this book rations the tools. The practitioners it is modeled on use the machines more aggressively than the people they outperform, and every discipline in these pages exists so that the use stays safe at speed. If you ever finish a chapter and conclude that the lesson is abstinence, read it again. The lesson is calibration.

Out of the box, most assistants behave like a colleague who agrees with you too quickly. Ask one to critique a plan and it lists three strengths before the single hedged weakness that matters, and you walk away with a tidier version of what you already believed and mistake the tidiness for validation. The fix is not a better model and it is not fewer reps. It is a different configuration: give the model a role with a point of view, hand it a framework to apply rather than a blank request to react to, tell it to surface your weakest assumption before it tells you anything you got right, and give it explicit permission to refuse. A model told to find the flaw finds the flaw. A model asked what do you think tells you that you are on the right track, because that is the path of least resistance through its training.

That sentence is the bridge out of this chapter and into the rest of Part II. Living with AI builds the intuition. It does not, by itself, protect you from the intuition’s failure mode. Configuring the system against its own agreeableness, structuring the reps so they keep teaching instead of flattering, proving on a schedule that the calibration still holds, that is the discipline the reps require and do not supply. The reps are necessary and they are not sufficient. They are the foundation, and a foundation is not a building.

Weeks one, four, and twelve

So you can see the shape of it, here is what deliberate reps look like over time, with nothing padded.

Week one is friction. Every task you would normally do by hand, you route through a model instead, and it is slower, and the output is rougher, and you resent it. You catch nothing, because you have nothing to catch it against. You are pouring the floor. The only metric that matters in week one is that you did it every day.

Week four, the surprises start to cluster instead of scatter. You begin to notice that this model always hedges and that one always commits, that the agreeable answer has a texture you can almost name, that the divergence runs disagree in the same places again and again. You catch your first confident hallucination, not because you went looking but because it sat wrong against a background you did not have a month ago. The boredom is starting, and the boredom is what you were after, even if it does not feel like an achievement.

Week twelve, the behavior has mostly stopped surprising you, which is the goal stated in plain terms. You read a vendor demo the way the first PM did, seeing the questions it steers around. You reach for the right model for the task without consulting the chart. And you have arrived, precisely on schedule, at the moment of maximum danger, because a system that no longer surprises you is a system you are tempted to stop checking. Week twelve is not graduation. It is the day you are finally equipped to notice the one answer that should have surprised you and did not, and the only way to keep noticing it is the structure the next chapters build.

Reps without structure produce output, not learning. You have the reps now. What turned mine into an education was a Sunday afternoon last winter, nine argumentative agents, and an API bill of less than ten dollars.

Part II: The Practice The Loop That Teaches You Back