thepragmaticquant.com

Green is not evidence

TL;DR — Three verification gate designs can read green while verifying nothing: an empirical validation that measures the wrong quantity against the wrong envelope, a canary that returns the same exit code whether the gate is alive or dead, and a reviewer-facing audit command that produces no evidence and exits 0. Each failure mode became a mechanical rule — the table at the end carries all three.

An empirical validation that streams a diagonal sum-of-squares and checks it against the raw kernel’s envelope — the wrong quantity, against the wrong bound — will pass. Of course it will pass: an experiment that validates the wrong thing looks exactly like one that validates the right thing. Both produce a table of cells under a curve and a green check mark. The centred paper’s validation instead runs the genuine off-diagonal cross-covariance grid whose numbers the previous article reports — every cell contained, against the centred envelope that actually governs it. That corrected design is the good news. This article is about the failure mode: how the wrong design passes, and two other gate designs that fail for equally hollow reasons.

Consistency is not faithfulness

Formal gates — proof checkers, CI assertions — check that conclusions follow from premises. None of them check that the premises are true of the deployed code. The wrong-envelope experiment above is internally consistent: the harness runs, the comparison executes, every cell lands under the curve. The flaw sits upstream of everything the gates can see — the quantity being streamed is not the quantity the deployed kernel streams, and the envelope it is compared against is not the theorem that covers it.

This is the same failure class as the misformalization the previous article describes, one layer down: there, analyzing the textbook kernel while production runs a different one; here, validating a different quantity while believing you validated the real one. A claim can ship false while every formal gate is green. Consistency and faithfulness need different machinery.

The uncomfortable property of this failure class is why it attracts no scrutiny. A passing test looks fine: the harness is careful, the comparison is correct given its inputs, and reviewer instinct spends attention on red cells, of which there are none. Nothing in the output reveals that the quantity being measured is the wrong one. The wrongness lives in the binding between the experiment and the thing it claims to be about — a place no amount of staring at output can reach. So the binding is what a machine must test, deliberately, in every build.

Strict equality or nothing

The fix was a rebuild, not a patch: a different streamed quantity, a different envelope, the whole grid rerun from scratch. The machinery that keeps it fixed: one symbolic oracle as the single source of truth for what the deployed recurrence computes, fuzzed against the actual pip-installed kernel on integer-valued inputs where floating point is exact — so the comparison is strict ==. Any tolerance parameter is rejected outright, because tolerance creep is the named decay path: every atol starts as “just to absorb harmless noise” and ends as the hole the next wrong-quantity bug walks through.

python
# the shape of the rule, not project source
assert oracle_value == deployed_value          # accepted
assert abs(oracle_value - deployed_value) < 1e-9   # rejected at review, always

And because the oracle itself is now the single point of failure, it gets the same treatment: a deliberately poisoned mutant oracle is swapped in on every run, and the harness must catch it or the build fails. The gate demonstrates it can go red in every build that goes green.

The canary that couldn’t sing

That last rule has its own postmortem. One gate’s self-test — its canary — ran the gate against deliberately broken input and asserted a non-zero exit. The canary returned exit 1 in the healthy branch and in the broken branch. A dead gate read as PASS, indistinguishable from a live one, for the entire lifetime of the canary. An adversarial review pass — run by an AI agent — is the class of check that catches it. The fix is to give every outcome a distinct signature: poison caught, poison missed, clean input passed. That fix, when applied to the first canary, immediately exposed the identical defect in the suite’s one other, older canary: two canaries, one shared blind spot.

The canary truth table, before and after. Before the fix, both the healthy and the broken gate produced the same exit code on poisoned input. A sketch of the documented bug mechanism, not project source.

Days later, an audit found that the planted poisons themselves proved less than they appeared to. Both canaries’ poisons were true statements whose only falsehood was a placeholder proof. They fired — but one well-meaning edit completing the placeholder would have disarmed them silently, leaving a canary that reports healthy and tests nothing. Every poison now states a proposition that is actually false.

Exit-0 theater

The third postmortem is the simplest. A reviewer-facing audit command is supposed to print, for each theorem in the certificate, its full assumption set — the artifact a referee checks. In this failure class it prints section headers and no evidence, for dozens of theorems, and exits 0. The cause: the prover’s batch mode silently ignores input on stdin and reports success having checked nothing. The verification command verifies nothing, cleanly, every time — and no measurement can catch it without an independent content assertion.

sh
# the trap: success is inferred from the absence of error
check_all | tee audit.txt ; echo "exit=$?"   # exit=0, audit.txt: headers only, zero axiom sets

# the fix: assert on content, not on exit code
for thm in $THEOREMS; do
  out=$(probe "$thm")
  [ -n "$out" ] || { echo "FATAL: $thm produced no evidence"; exit 1; }
done

The rebuilt command runs one compile probe per theorem and fails loudly if any probe does not produce its evidence. Exit 0 is a claim about plumbing — the process ran and nothing crashed. It is not a claim about the world.

The rules

what read greenwhat was actually truethe rule it became
Validation passed, all cells under the envelopeWrong quantity, wrong envelope — the real kernel was never measuredOne oracle as source of truth; strict equality at trust boundaries; no tolerance parameters, ever
Canary asserted non-zero exit, got itSame exit code from a live gate and a dead oneDistinct failure signatures per outcome; mutation-test the test
Planted poisons fired and the canaries reported healthyThe poisons were true statements with placeholder proofs — one edit would have disarmed them silentlyEvery poison states an actually false proposition, and demonstrably fires
Audit command exited 0Headers and no evidence — the prover’s batch mode checked nothingAssert on evidence produced, never on absence of error

The generalization is one the AI crowd already lives with: these are evals. A benchmark harness that passes while measuring the wrong thing is indistinguishable from one that works — until you plant a failure that must fail and watch whether it does. If a planted failure doesn’t fail CI, your gate is decoration, and green is a color, not evidence. The gates that survived are the ones where judgment lives in an artifact a small kernel re-checks rather than in an exit code — the next article is about those.