Reference

Appendix E: The Gate Owner’s Checklist

Carry it to every gate; ask the five questions the runs are supposed to answer. Companion to Chapter 10.

You do not take the eval owner’s job. She runs the mechanics; you own the gate. But you cannot own a decision you cannot interrogate, so you ask five questions. You did not memorize them; you earned each one from a moment in the walkthrough.

1. The target

Question: Which failures are tolerable and which are never-ship, and is the agent failing in the tolerable direction? A good answer contains: a stated tolerable direction with a target rate, a never-ship direction, and where the current failures land. Red flag: a single aggregate pass rate offered as the whole story. Earned from: the two high-value auto-approvals the ninety-one-percent average hid.

2. Coverage

Question: What does the suite test, what does it not, and how does the error rate distribute across the people it touches? A good answer contains: an explicit coverage statement and the breakdown of the failing percentage by segment. Red flag: a high pass rate with no account of what was not tested. Earned from: the injection technique that was not in your set, and the nine percent you had to go looking for.

3. Judge calibration

Question: How was the judge calibrated, against what human-labeled failures, and what is its false-pass rate on the failures that matter? A good answer contains: a named human-labeled set and a check that the judge fails the cases the senior lead failed. Red flag: no answer; then the green bar is the lenient judge’s opinion, not a measurement. Earned from: the bar climbing to near-perfect when you loosened the rubric and changed nothing else.

4. The end-to-end number

Question: Not the component pass rates and not a single run, but the system-level number across repeated runs, the P50 and the P10? A good answer contains: a distribution, median and bad run, not a single green checkmark. Red flag: a single pass presented as proof the agent passes. Earned from: the goodwill case that disagreed with itself four times in ten.

5. The version policy

Question: What happens to all of the above when the model updates next month? A good answer contains: a re-run plan, because a model update is a deployment, not an infrastructure event. Red flag: treating last month’s green suite as proof about a new model. Earned from: realizing every number in the walkthrough was tied to one model on one day.

The rule

Read the document, not the dot.

D. The Two Briefs F. The Steady-State Page