The numbers and the trust trail: benchmarking waitbus honestly

TL;DR — There are two things you have to be able to trust before you install waitbus: the speed numbers and the artifact itself. The numbers in the deep-dive only hold up if the method behind them does — my first benchmarks were a lie until I corrected for Coordinated Omission, the same byte-identical code runs ~2.5x slower on a cloud host the spec sheet swears is identical, and the daemon’s real costs (idle memory, CPU under load) are published here as losses, not hidden. And six things make the build trustable to install: SLSA build provenance, sigstore-keyless attestations, a CycloneDX SBOM, an osv-scanner gate on publish, byte-reproducible builds, and a swap from keyring to systemd-creds that cut ten transitive packages from the secret-read path — plus an explicit list of the gaps that remain.

The companion deep-dive clocks waitbus at single-digit milliseconds — 100 to 400x faster than polling. A number like that is only as good as the method that produced it, and a daemon is only as safe as your ability to check the bytes you install. This piece earns both — the speed numbers first, then the artifact.

The speed numbers, and the method behind them

Most published benchmarks understate the tail

Published broker benchmarks are mostly fiction, and the mechanism has a name: Coordinated Omission. The standard closed-loop pattern records t_response - t_actual_dispatch. When one iteration stalls, the next simply waits for it — so the stall never enters the distribution, and the tail is silently truncated.

The fix is an open-loop scheduler: pre-compute every sample’s intended dispatch time and record t_response - t_intended. If an iteration is slow, the lateness lands in the distribution where it belongs. My first benchmarks were a lie until I made this change; it is the spine of every number in this series.

Coordinated Omission, drawn: the closed-loop scheduler waits a stall out and records nothing tall; the open-loop scheduler puts the lateness where it belongs. An illustrative schematic of the mechanism, not measured data.

Idle memory, measured and published

nats-server idles ~14x lighter. waitbus pays the Python-interpreter tax; that buys the per-source plumbing nats does not have.

Memory is one cost; CPU is the other. Idle, the daemon is almost free — but put it under real load and it does real, measurable work.

Under 50 producers at 200 Hz the daemon does real work: user CPU and scheduler time both climb off the idle floor, and the gap is not noise (Mann-Whitney p ≈ 3.5e-18 on the scheduler-runtime arm).

data table

metric	idle (ms/s)	loaded (ms/s)
user CPU	0.00	62
scheduler run	0.16	106

The same code is not the same speed on every host

Here is the caveat the tight confidence intervals hide. I re-ran the byte-identical github benchmark on eight freshly-provisioned dedicated-vCPU cloud hosts. The p99 did not cluster around one number — it split in two.

Same code, 8 different hosts. The p99 is bimodal — a fast cluster near 5 ms and a slow one near 13 ms, with nothing between.

data table

cluster	p99 (ms)	hosts
fast	~5.0	3
slow	~13.3	5

Draw a host. Same code, a fresh dedicated-vCPU instance each click. Watch where its p99 lands.

~5 ~13

github p99 (ms) — 0 to 16

drawn 0 · 0 fast / 0 slow

My first guess was “different CPU generations.” Wrong — every host reported the identical CPU model (AMD EPYC-Milan) and NUMA layout. I probed /proc/cpuinfo for a clock difference and found one — but backwards: the fast-responding hosts read a slightly lower clock (~2197 MHz) than the slow ones (~2400 MHz), the opposite of what a clock-speed story would predict. So clock is a red herring, not the cause. The conclusion is uncomfortable: the cloud’s “dedicated vCPU” SKU is served on physically heterogeneous hosts, and which one you happen to draw sets your tail — for a reason the spec sheet hides and I could not isolate from inside the guest.

The lesson generalizes past my benchmark: cloud “dedicated” does not mean “homogeneous,” a single capture cannot reveal between-host variance, and the only honest number for an absolute latency is a range measured across hosts. Which is why the claim I actually stand behind is the ratio — waitbus beats polling by two to three orders of magnitude, and that is robust to whichever box you drew.

The benchmark methodology, the per-host data, and the verified cause are all committed in the repo under benchmarks/baselines/. Run ./scripts/capture_baselines.sh on a fresh instance and you will get your own draw from the distribution.

The artifact, and its chain of custody

The speed numbers above hold up. But a benchmark only tells you what some bytes did on some host — it says nothing about whether the bytes you install are the bytes I built. That takes a different kind of evidence, with its own trail.

In October 2021, the maintainer of ua-parser-js — about eight million weekly downloads — discovered his npm account had been hijacked and the package compromised. The malicious versions were live for about four hours, installing a cryptominer and a credential harvester. The supply-chain attack does not announce itself. waitbus is a small workstation daemon, but the threat class is real regardless of scale, and getting the plumbing right on a small project is easier than retrofitting it on a large one.

The chain of custody

Source to install, each step attested. The osv-scanner gate blocks publish on any known CVE in the lockfile.

Source, pinned. Every GitHub Action in the build is pinned to a full commit SHA, not a moving tag — the lone exception is the SLSA reusable generator workflow, which SLSA’s own design requires be referenced by release tag (the boundary that creates is dissected below). The input to the chain is a fixed, auditable artifact — not “whatever @v4 resolved to today”.

Build and provenance. A reproducible build emits SLSA provenance: a signed record of exactly which workflow, at which ref, produced these bytes. Run it again, get the same hash.

Sign and log. A sigstore/Fulcio certificate signs the artifact and the signature lands in Rekor, the public transparency log — so a forged signature is detectable, not silent.

Gate, then verify. An osv-scanner gate blocks publish on any known CVE in the lockfile; PyPI gets a PEP 740 attestation, and install-time gh attestation verify checks the whole trail end to end.

The boundary that matters. The SLSA provenance records the upstream generator workflow’s identity, pinned at a tag — not the caller’s. A contributor with merge access can change what source goes into the build; they cannot change what the pinned generator does.

The dependency cut

waitbus originally read its HMAC webhook secret from GNOME Keyring. An audit measured the real cost: importing keyring pulled in secretstorage, cryptography (Rust), cffi (C) — ten transitive packages — and cost +21.6 MiB RSS to read a 64-byte string.

keyring (10 transitive deps, +21.6 MiB) versus systemd-creds (0 deps, 2 lines). Every native extension removed is daemon attack surface removed.

systemd decrypts the credential into a per-unit tmpfs before ExecStart runs, keyed to TPM2 or a root-only file. A lifted disk image cannot decrypt it on another machine.

The lesson: dependencies are surface. A library that pulls in native Rust and C to solve a problem you could solve with two lines of standard library is not a neutral choice.

What is not yet there

The attestation trail that exists is real and verifiable. The gaps it does not yet close: no Rekor monitor for unauthorized attestations under the waitbus identity; single-signer provenance (no multi-party signing); no hardware-attested build environment (full SLSA L3); and third-party source plugins are treated as in-process untrusted code with full daemon privileges — operators vet them. These are the gaps; the four items above are the ones the trail does not yet close.

That is the whole bargain. The speed numbers are a range you can reproduce on your own host, and the artifact is a chain of custody you can verify before it ever runs. Neither one asks you to take my word for it.