Build / review preflight

The P1/P2 failure modes to root out before a PR exists — distilled from a retrospective over ~350 external code-review findings (≈68 P1, 115 P2, 167 P3) across several build and spec waves, in a workflow where one model builds and a second, independent model reviews every PR. Examples are sanitized to the abstract pattern; the CHECKs are the evergreen part.

The reviewer is token-gated to roughly one round per multi-hour window, so a P1/P2 that survives to review — or a new one a fix introduces in round 2 — costs a whole window. The goal is first-pass-clean.

~40% of all P1s are the same mistake: the spec/build asserts something about existing source — an API, a field, a line number, an env-var, a flag, a sibling spec, a precedent, an invariant — that it never opened the file to verify. Pattern-matching a plausible API from memory is the single biggest defect source.

The checklist (at a glance)

Reviewers run the full list against the diff. Builders carry only mode 1 — the one failure that is build-time-irreducible. For a validator / gate / linter / parser / config PR, weight the adversarial lens on mode 10 + the SHALLOW-CHECK trap.

  1. API-anchor drift — grep or open every symbol, field, API, env-var, flag, and sibling-spec before you cite it; if grep finds nothing, it does not exist. (~40% of all P1s.)
  2. Posture leaks — for a descriptive tool, ship no scalar or label that is one hand-edit from a threshold or that names the inference target; fail closed.
  3. Capability-boundary overclaim — don't call a path pure / additive / lightweight unless import proves it and CI actually runs it.
  4. Stale registration and count conventions — drop-in fragments over editing a shared file; no ==N count literals anywhere.
  5. Untestable or undelivered claims — every AC, and every "guarantee / invariant / bounded" word, needs a test that would FAIL if it were violated.
  6. Math and data-structure errors — bounds hold on saturated / empty / tie input; immutables aren't "added to"; consumed iterables are materialized.
  7. Process discipline — honor the PR-split; complete the paper trail; re-count every prose count against the diff.
  8. Provenance and identity rebinding — imported data carries its own recorded provenance or null, never the current run's; no stamp without backing records.
  9. Measurement validity — confirm the control isolates the named confound and is not correlated with (or identical to) the signal you are measuring.
  10. Override and gate hygiene — a bogus id silences nothing; classify on parsed structure, not substrings; cover the format's full spec; the gate obeys its own rule; one matcher (a state machine), adopted everywhere.

Plus the round-2 discipline:


The modes in full

1. API-anchor drift

DOMINANT (~40% of P1s). Phantom APIs, wrong signatures, non-existent env-vars / tiers / flags, mis-cited precedents, dangling sibling-spec citations. Patterns that reach review:

CHECK: for EVERY symbol / field / line / env-var / tier / flag / sibling-spec / precedent you cite, grep or open it. If grep returns nothing, it does not exist — do not assert it. Verify the actual signature (args, kwargs, return shape) AND the actual return value (including truncation / caps). Never describe a precedent file you have not opened this session.

2. Posture leaks

A thresholdable scalar or verdict-adjacent label. For any tool meant to be descriptive rather than a classifier, the danger is shipping an artifact that quietly becomes a decision input.

CHECK: does the result carry ANY scalar thresholdable into a selection / verdict? Does any label name the inference TARGET vs the measured property? Is the fail-direction closed? Walk the output and confirm no field is one hand-edit from a back door.

3. Capability-boundary overclaim

Claiming a component is dependency-free / pure / additive / lightweight when it isn't — or when it literally doesn't run.

CHECK: prove the claimed-pure path is import-clean (does import pull the heavy dependency?). Does every claimed path actually RUN in CI? If the core needs the dependency, label it that way — don't claim the lighter tier.

4. Stale registration and count conventions

Hand-maintained registries and counts that drift.

CHECK: prefer drop-in fragments over editing a shared file plus a count literal — a per-id fragment, NO ==N literal anywhere, the manifest fragment matching a real sibling (open one). git add the fragment explicitly; run the drop-in + drift + docs-freshness gates.

5. Untestable or undelivered claims

Two faces of one defect: a test that asserts something false, and a docstring / spec that promises a runtime property the code doesn't deliver.

CHECK: every AC must be runnable against REAL behavior (frozen-fixture test for "byte-identical"; open line X before "copied from line X"). Every "guarantees / invariant / bounded / validated / N-min" word in a docstring OR spec must be backed by code that delivers it AND state that can support the check — if you can't write a test that would FAIL when the property is violated, you haven't delivered it.

6. Math and data-structure errors

CHECK: bounds hold on adversarial input (saturation, ties, empties); immutable structures aren't "added to"; once-consumed iterables are materialized. Run the edge cases, not just the happy path.

7. Process discipline

CHECK: honor the spec's PR-split; complete the paper trail even where CI doesn't gate it; re-count every prose count against the diff (don't copy the spec's number).

8. Provenance and identity rebinding

RECURRING. Imported or foreign data gets re-stamped with the current run's identity, falsely certifying it. A copied anti-pattern that recurs across files.

CHECK: a result READ from a manifest / import carries its own recorded provenance (fingerprint, source, author list) or null — never the current run's. "Produced here" and "read from elsewhere" must stamp differently. Any "MIX / derived / attributed / verified-under-X" claim requires the backing records to actually exist (≥ the asserted count); if they don't, refuse — don't fabricate the stamp.

9. Measurement validity

The rarest and subtlest: the code runs, is internally consistent, and is conceptually wrong — it measures the wrong thing or defeats its own stated purpose.

CHECK: does the control / normalization variable actually isolate the named confound, or is it correlated with — or identical to — the signal? Stratifying or normalizing on a dimension REMOVES that dimension from the result; confirm it is a confound you want gone, not the property you are trying to measure.

10. Override and gate hygiene

The dominant class in validator / gate builds. A check, its OVERRIDE, and any meta-gate are themselves code that fails the same ways the checked artifact does. Whole review rounds land here — and almost none is a live crash; they are robustness / scope / claim defects a happy-path run and a naive diff-review both pass.

CHECK: for every override / exemption / gate — does a bogus id silence nothing? is scope a real anchor, not a token? do prose / code-span / suffix / cross-ref decoys fail in EVERY form (single + multi-backtick, fenced + tilde)? does the matcher catch every spelling (string prefixes, zero / extra whitespace, digit ids)? does the tool satisfy its own rule? is the class gated EVERYWHERE (grep, don't trust "I fixed the named site")? do the primary and fallback implementations accept the identical input set — confirmed on the MIN runtime, not just the dev shell? And: have I now patched this format-matcher more than once? If so, replace it with a state machine / parser before the next round — don't ship a third alternation.


Round-2 prevention

The fix loop

A fix responding to a review finding is a build — give it the same pre-flight. After folding: (a) confirm the fix FULLY resolves the finding; (b) self-review the fix against modes 1–10 (a fix that adds a field can introduce a mode-1/2/4 defect; a docstring claim a mode-5/8; a control a mode-9; a matcher / override a mode-10); (c) re-run the full suite; THEN push. Round 2 should be empty because you, not the reviewer, caught the regression.

Verify on the minimum runtime

A green dev-shell run is NOT a green min-runtime run, and the gap is a free review round. Shell gates commonly run under set -euo pipefail on macOS /bin/bash 3.2 in a UTF-8 locale — where an unbraced $VAR adjacent to a multibyte character is parsed as one (unbound) name and aborts under set -u, a break that newer bash and Linux CI both hide. Before "ready," run the shell aggregate on the oldest supported shell under LC_ALL=…UTF-8, and the interpreted languages on their pinned versions. Brace expansions adjacent to non-word characters; test the floor, not the ceiling.

The SHALLOW-CHECK trap

The most common round-2 P1 is a fix that adds the right kind of check but stops short, so the next review round walks straight around it. The guard exists; it just doesn't cover the whole surface.

CHECK: ask "what's the SMALLEST input that satisfies my new guard but still breaks downstream?" and test THAT. Enumerate the input shapes (empty / whitespace / wrong-type-but-truthy / non-iterable / duplicate-across-records / forged-but-well-shaped) and cover the axis, not one point on it. And before pushing a fix, read the existing self-tests / regressions — a too-aggressive guard that breaks a legitimate pinned case is itself a round-2 regression.