The accumulator that never moved
TL;DR — There is an input on which round-to-nearest discards every increment, in the same direction, forever: the computed accumulator never moves while the true sum climbs. That stall proves the window-magnitude factor in the opening article’s envelope is necessary, and it proves the hypothesis behind the natural probabilistic refinement false in the worst case, which is why that refinement ships as a conjecture. A separate matching lower bound (products pre-rounded, accumulator exact) shows the linear error growth itself is intrinsic: no summation algorithm escapes it. The escape is a different rounding mode: stochastic rounding gives an unconditional √n envelope — the model-derived fp32 deep-n slope is 0.5014, a direct 20,000-seed emulator run to n = 3×10⁷ measures 0.4978, and both confidence intervals exclude the deterministic rate 1.
Park an accumulator at an exactly representable value. Feed it increments that are each just under half an ulp of that value. Round-to-nearest evaluates the addition, finds the result closer to where it started, and snaps back. Every step. The same direction. The computed sum is stationary while the true sum climbs monotonically — and the discarded residuals stack into an error of order p(w−p)·u·T★, quadratic in the window parameters when the carry length sits near half the window. The round-back induction is machine-checked in Coq/Flocq (what that means, and what it does not, is a later article’s job).
The drift floor no algorithm escapes
The opening article gave an upper bound that grows linearly in the carry length p. The reflexive objection is that the linearity is analysis slack — surely a better accumulator beats it. My matching lower bound says no: there is an adversarial input forcing drift of at least c·p·u·T★ even if the accumulator performs every summation step in exact arithmetic. This is a different construction from the stall above, its mirror image: here the products are rounded before they ever reach the accumulator, the adversary places each one a hair past a rounding midpoint so every error lands with the same sign, and the chain has no internal cancellation to spend. In the stall, the products are benign and the accumulator’s own rounding carries the error. (The model excludes fused multiply-add; that is a different machine.)
That construction is why the upper bound is tight. Kahan, Ogita–Rump, pairwise — pick any summation scheme you like. The error floor is set by the rounding mode, before the algorithm gets a vote.
A conjecture, not a theorem
The natural next move is probabilistic: model the rounding errors as mean-zero given the past, run a martingale argument, and the linear rate sharpens to √(p+w). That result appears in my paper — as a conjecture, not a theorem. Its load-bearing hypothesis, mean-independence of the rounding errors under round-to-nearest (the paper calls it L1), is what my own stall construction falsifies, and I did not have to hunt for the counterexample: the lower-bound witness and the counterexample are the same object. The input I built to prove the magnitude factor necessary is also an input on which every error is discarded in the same direction, so conditioned on the past the errors are deterministically same-signed; mean-zero fails outright. The hypothesis is provably false in the worst case. Whether it still holds for generic zero-mean data is open — the same question flagged in the recent numerical-analysis literature (research problem 16.1 of Mary’s 2025 habilitation). Demoting it had a price: the deterministic √(p+w) rate stays unclaimed, and the unconditional stochastic-rounding theorem below had to carry the paper’s headline instead.
Flip a coin
Stochastic rounding rounds up or down at random, with probability proportional to proximity. That randomness removes the one property the stall relies on: every error landing the same way. Under stochastic rounding the per-step error splits into a mean-zero random part and a small deterministic residual, and the mean-zero part is mean-zero by construction, for every input. No L1 hypothesis, no distributional assumption. My main theorem turns that into an unconditional high-probability bound at the √n rate, parametric in three quantities: the storage precision, the accumulator precision, and the hardware random-bit budget r. The residual deterministic bias scales as 2−r, so a finite random-bit budget leaves a finite bias.
20,000 seeds say √n
A probabilistic envelope is a claim about a distribution, so I validated it with an ensemble: 20,000 independent stochastic-rounding seeds, run on the coherent regime where round-to-nearest is at its worst. The rate is the claim that matters. At fp32 the in-binade walk is provably a binomial process, and that model puts the deep-n slope at 0.5014 with a confidence interval [0.4990, 0.5039]; a direct emulator run out to n = 3×10⁷ measures 0.4978 [0.4898, 0.5058]. Both intervals exclude the deterministic slope 1, and fp16 (0.5027) and bf16 (0.5144) read the same way. The mean-zero claim checks out too: a 95% interval brackets zero at every precision tier. Envelope coverage is 1.000 at every tested failure probability (λ = 0.1, 0.05, 0.01), with Wilson 99.99% lower bounds at 0.999 — and the bound is not slack for all that coverage, because the empirical-to-theoretical constant ratio sits between 0.56 and 0.72, order one, which is the tightness check.
data table
| tier | coverage (λ=0.1 / 0.05 / 0.01) | slope [CI] | const. ratio |
|---|---|---|---|
| fp32 | 1.000 [0.999] at all three | 0.5014 [0.4990, 0.5039] (model) / 0.4978 [0.4898, 0.5058] (measured) | 0.72 |
| fp16 | 1.000 [0.999] at all three | 0.5027 [0.4885, 0.5168] | 0.65 |
| bf16 | 1.000 [0.999] at all three | 0.5144 [0.4544, 0.5745] | 0.56 |
| demo (u=2⁻¹³) | 1.000 [0.999] at all three | 0.4964 [0.4877, 0.5051] | 0.69 |
This is the same mechanism deep-learning accelerators already ship, here with a proven envelope and the random-bit budget as an explicit parameter. The budget is documented on one vendor’s hardware-SR path (the Graphcore IPU, r ≈ 23), inferable for NVIDIA Hopper, and undisclosed for Blackwell’s down-conversion path. So the bound reads as a hardware contract: the number of random bits a device must supply to sustain a streaming join without coherent drift.
All of this covers the textbook inner-product kernel. The one production actually ships is structurally different — and it had no analysis at all.