Thirty-one papers, zero error analyses
TL;DR — The streaming sliding-window inner-product kernel — the primitive deployed as the matrix profile — updates by subtracting one product and adding one, reusing yesterday’s value forever. Classical analysis says its rounding error grows like O(N·u) in stream length, and across the entire MP-I..MP-XXXI literature no paper had ever bounded that drift. Measured against a 128-bit oracle on the sweep’s cleanest corpus: naive f64 drift climbs 3.8 decades over 16.7M ticks; at f32, even Kahan–Babuška–Neumaier (KBN) and Ogita–Rump–Oishi compensated summation grow ~3,009× and ~412× on the same data where they are essentially exact at f64. The fix is one line of topology — reseed the carry chain every k ticks — and it collapses to a closed-form tuning rule you set in a config file: fp64-safe at the default, but silently 43× off at fp32. All 18 anchored cells hold the bound with zero violations, and I keep the row where the fix loses. The math is in a public preprint (doi:10.5281/zenodo.20599315).
Nobody can tell you whether the world’s most-deployed time-series similarity kernel survives a trillion ticks at fp32, on hardware with no fp64 unit — the question has no published answer. The primitive underneath is a sliding-window inner product carried forward one product at a time; it ships as the matrix profile, the engine behind motif discovery, discord detection, and a long line of anomaly-mining systems, across thirty-one papers in its canonical series, MP-I through MP-XXXI. None of them carries a forward-error analysis of the streaming recurrence the deployed kernels actually run. So I did the math, and I started by reading the shipped code.
The kernel everyone ships
The streaming kernel maintains a sliding-window inner product per row with a rank-one subtract–add update: one operand pair leaves the window, one enters, and the cached value from the previous tick is updated in place.
# the streaming update, schematically
qt = qt_prev - leaving_a * leaving_b + entering_a * entering_bThat reuse is the whole story. Every tick commits a rounding error into the accumulator, and because tomorrow’s value is computed from today’s, the error never leaves. The carry chain grows with the stream.
I audited the deployed implementations at their public release tags to see who actually carries this chain. stumpy@v1.13.0 carries the recurrence indefinitely with only a single per-tick head refresh. go-matrixprofile@v0.2.0 goes to the other extreme and recomputes a full FFT every tick. SCAMP@v4.0.3 and matrixprofile@v1.1.10 are batch-only — no streaming carry at all. Every implementation sits at an extreme; none occupies the middle ground of a periodic full-state reseed, which is the regime the analysis ends up recommending.
data table
| implementation | release tag | kernel class |
|---|---|---|
| stumpy | v1.13.0 | streaming carry, kept indefinitely |
| SCAMP | v4.0.3 | batch, per-tile, no carry |
| matrixprofile | v1.1.10 | batch, diagonal sampling |
| go-matrixprofile | v0.2.0 | full FFT recompute per tick |
What linear drift looks like
I ran the recurrence as a single unbroken carry chain across stream lengths N = 2¹³ to 2²⁴ — past 16.7 million ticks — against a 128-bit MPFR reference oracle (~38 decimal digits), over seven synthetic corpora, 30 seeds each, three accumulator strategies, two precisions. On the cleanest corpus in the sweep — a deterministic linear-congruential walk, which adds no rounding of its own above the accumulator — the naive f64 accumulator climbs roughly 3.8 decades, from 1.75×10⁻¹⁰ to 1.08×10⁻⁶, tracing the predicted O(N·u) envelope. Compensated Kahan–Babuška–Neumaier (KBN) at f64 holds the same chain near the oracle floor — never above 7.5×10⁻⁹. Ogita–Rump–Oishi at f64 is correctly rounded on this corpus: drift identically zero, off the log axis.
Then the f32 twist. On the same corpus, the Ogita correctly-rounded condition is first crossed around n = 2²⁰ — the exact crossing point depends on a small constant in the condition — and both compensated paths grow near-linearly from there: KBN-f32 grows ~3,009× to 9.27, Ogita-f32 ~412× to 1.27, and naive f32 reaches 1.71×10³. The algorithms did not get worse — the precision tier moved the condition boundary under them. Compensated summation alone cannot arrest the growth once the correctly-rounded condition is crossed, and dropping to f32 crossed it on this data.
data table
| precision | accumulator | drift min (LCG) | drift max (LCG) |
|---|---|---|---|
| f64 | naive | 1.75e-10 | 1.08e-6 |
| f64 | KBN | 5.82e-11 | 7.45e-9 |
| f64 | Ogita | 0 | 0 |
| f32 | naive | 1.25e-2 | 1.71e3 |
| f32 | KBN | 3.08e-3 | 9.27 |
| f32 | Ogita | 3.08e-3 | 1.27 |
Why not just use fp64
Because deployment already left it behind. High-throughput GPU similarity joins are bounded by memory bandwidth, not arithmetic: holding the stream in a 64-bit storage format halves effective cache and bus throughput against a 32-bit one. Power- and area-constrained edge processors frequently ship no double-precision unit at all. This is the same economics that put bf16 and fp8 into deep-learning training — reduced precision is the operating point, not a tuning choice. What was missing is a forward-error bound parametric in the working precision, stating the conditions under which the stream stays inside tolerance.
One caveat up front, because it is the structural difference the bound has to price in: a streaming carry chain compounds across ticks. That is strictly worse than one-shot GEMM accumulation, where each output element’s error closes out when the reduction ends. Here nothing ever closes out — and the production target is the trillions-of-ticks regime, several decades beyond where the naive recurrence already fails.
Cut the carry chain
The fix is one line of topology. Reseed the recurrence from a full dot product every k ticks, and the error bound stops depending on how long you stream — the stream length N drops out of the formula exactly. The streaming kernel’s whole problem is that a rounding error committed at tick t rides inside the accumulator for every remaining tick. The re-anchored topology breaks that ride: at every tick, two anchor rows are reseeded from a full length-w inner product, and a Periodic{k} policy reseeds the whole window every k ticks. A carry chain that reaches any cell now has length p ≤ k, never N.
The resulting deterministic envelope, quoted from the preprint (doi:10.5281/zenodo.20599315, Theorem 4.1):
|computed QT − exact QT| ≤ γ_{4(p+w)}(u_acc) · (2p + w) · T★The Higham envelope γn(u) = (1+u)n − 1 ≈ nu counts compounded roundings; 4(p+w) over-counts the roundings along the longest surviving chain — at most four ops per carry step (4p) plus the length-w reseed’s 2w − 1 roundings, together ≤ 4(p+w); (2p+w) counts the operands whose magnitudes can feed the error; T★ bounds the magnitude of any single pairwise product of stream samples (T★ = M² for a sample magnitude bound M); uacc is the accumulator’s unit roundoff. Nothing in the formula grows with N. Stream for a trillion ticks and the bound is the same number it was at tick k.
The theorem becomes a knob
A bound is only useful if someone can act on it. Because the envelope is strictly monotone in the carry length p, it inverts into a tuning rule: given an error tolerance τ, the largest safe reseed horizon is
k★ ≈ sqrt( τ / (8 · u_acc · T★) )— the 8 is just the operation-count factor 4 times the operand factor 2. Deployment is that one division and a square root. The theorem compiles down to one integer:
# streaming similarity join — retention policy
anchor:
policy: periodic
k: 4096 # construction default: fp64-safe, ~4000x headroom
# k: 628 # the fp32-safe bisection root of the envelope constraint (closed form ~676)The numbers, from the preprint’s retention-tuning section (§SM10.1): at fp64 the safe root is around 1.7 × 107, so the construction default k = 4096 sits under it with roughly 4,000x headroom. At fp32 the safe root is k* = 628. The same default sits ~4,000x under the fp64 root and 6.5x over the fp32 one — ship fp32 at k = 4096 and you silently accept a (4096/628)² ≈ 43x tolerance excess. Nothing in the config changed — only the precision — and the theorem says the safe value moved by four orders of magnitude. That is what a bound in a config file buys you: the failure is computable before you ship it.
The row where anchoring loses
The measured validation, run on the self-anchored sliding sum-of-squares kernel that carries the same recurrence topology (Table 2 of the preprint): 18 (precision, accumulator, k) cells across N = 213…224, seven corpora, 30 seeds. Every cell holds its measured drift below the envelope — zero violations. At f64 the anchored cells keep drift at or below 4.9 × 10−8 against uniform-in-N envelopes of 7.6 × 10−4 (k = 676) and 2.5 × 10−2 (k = 4096), while the unanchored baseline’s envelope grows to 4.1 × 105 by N = 224.
data table
| precision | accumulator | k | drift max | envelope max |
|---|---|---|---|---|
| f64 | naive | 0 | 1.07e-06 | 4.08e+05 |
| f64 | naive | 676 | 1.49e-08 | 7.60e-04 |
| f64 | naive | 4096 | 4.84e-08 | 2.49e-02 |
| f64 | kbn | 0 | 0 | 4.08e+05 |
| f64 | kbn | 676 | 2.24e-08 | 7.60e-04 |
| f64 | kbn | 4096 | 1.49e-08 | 2.49e-02 |
| f64 | ogita | 0 | 7.45e-09 | 4.08e+05 |
| f64 | ogita | 676 | 2.24e-08 | 7.60e-04 |
| f64 | ogita | 4096 | 1.49e-08 | 2.49e-02 |
| f32 | naive | 0 | 1.71e+03 | 3.07e+12 |
| f32 | naive | 676 | 1.73e+01 | 4.08e+05 |
| f32 | naive | 4096 | 3.16e+01 | 1.34e+07 |
| f32 | kbn | 0 | 9.27e+00 | 3.07e+12 |
| f32 | kbn | 676 | 6.89e+00 | 4.08e+05 |
| f32 | kbn | 4096 | 5.27e+00 | 1.34e+07 |
| f32 | ogita | 0 | 1.27e+00 | 3.07e+12 |
| f32 | ogita | 676 | 6.89e+00 | 4.08e+05 |
| f32 | ogita | 4096 | 5.27e+00 | 1.34e+07 |
Now the row that argues against the fix. The unanchored Ogita–Rump accumulator at f32 reaches a worst-case drift of 1.27 — below every anchored f32 cell (those run 5.27 to 31.6). The reason is mechanical: the reseed itself rounds a length-w sum every k ticks, a cost an unbroken compensated chain never pays. Where the accumulator was already strong enough to tame the full chain, anchoring adds rounding events. Where it was designed to help, it helps unambiguously: naive at every tier (1.7 × 103 down to ~30 at f32), KBN at f32 (9.27 down to 5.27). And for every accumulator it wins in the worst case, where only bounded retention keeps the envelope from growing with N.
The regime boundary
The envelope is a theorem with a boundary. It says something only in the convergence regime 4(p+w)·uacc < 1. At the w = 64 deployment with an fp32 accumulator, 4w·uacc ≈ 1.5 × 10−5 — comfortable. Swap in a bf16 accumulator (uacc = 2−8) and 4w·uacc ≈ 1.0 already at p = 0: the envelope is vacuous before the first reseed. This is why the bound keeps storage precision and accumulator precision as separate parameters — bf16 storage is fine; bf16 accumulation is outside the theorem (preprint, Table SM2).
driftbound: frequently asked questions
- What is driftbound?
- driftbound is a forward-error analysis of the streaming sliding-window inner product, the kernel deployed as the matrix profile for motif discovery, discord detection, and anomaly mining. It bounds how far the streaming result can drift from the exact value, and reduces the fix to a single integer you set in a config file. The bound is machine-checked in Lean and Coq.
- What problem does driftbound solve?
- The streaming kernel updates by subtracting one product and adding one. It reuses the previous tick's value indefinitely, so each tick's rounding error stays in the accumulator and grows linearly with stream length. Across thirty-one matrix-profile papers, none carried a forward-error analysis of this drift. At 32-bit precision, even compensated summation grows without bound past roughly a million ticks.
- What is the fix?
- Reseed the carry chain from a full dot product every
kticks. Stream length then drops out of the error bound, so the bound holds the same at a trillion ticks as at the first reseed. The largest safekfollows a closed-form tuning rule that reduces to one integer in a config file. At 32-bit precision the safe value is about 43 times tighter than at 64-bit, so a mis-set config is computable before you ship. - Is the bound proven and validated?
- Yes. The deterministic envelope is proven in a public preprint (doi:10.5281/zenodo.20599315) and machine-checked. Across 18 precision, accumulator, and
kconfigurations over seven corpora and stream lengths to 16.7 million ticks, every cell held its measured drift below the bound with zero violations. This includes the one disclosed row where periodic reseeding loses to compensated summation. Two papers are under review.
Where the series goes
That is the deterministic result: an envelope uniform in stream length, and a knob to set it with. What it cannot cover is the worst case itself — this bound has a matching adversary no summation algorithm escapes, and a stall input that makes round-to-nearest discard every increment in the same direction. After that: the kernel production actually runs, how the proofs were made checkable, and a closing scoreboard where every empirical leg of the program reports its violation count. Next: the adversary and the coin flip that breaks it.