← Writing
Trust · 2026

Reproduce it yourself: benchmarks you can run on your own hardware

The most useful thing a vendor can hand a skeptical engineer is not a chart. It is the script that made the chart, written to run against a fresh random problem on their machine, with their key.

A benchmark you cannot run is a marketing claim

Every headline number on this site comes from a controlled benchmark. That is necessary and not sufficient. A number you cannot reproduce is something you have to take on trust, and deep-tech buyers are right not to. So the load-bearing results ship as self-contained scripts: paste, set your key, run, watch the same shape appear on a problem we never saw.

What you can run today

The reproduce page walks through two results end-to-end with the licensed client. The first is the regulation measurement-efficiency result: tracking quality stays flat as you scale the channel count, on one measurement per round where a finite-difference gradient would need n+1. The second is the deadline crossover: under a fixed wall-clock, a sequential annealer wins at small sizes, the runtime overtakes it around five hundred variables, and the gap widens from there. Both scripts print a table you can read in one glance, and both run in minutes.

The result we will not fake a script for

There is a third result — the sub-quadratic measurement-scaling of the score-weighted estimator — that a clean one-shot script understates rather than demonstrates. A fair scaling exponent needs a fixed hard quality bar, fresh non-overlapping seeds, and a log-log fit with bootstrap confidence intervals. Hand someone an easy bar and both estimators hit it in the first round and look identical; the result vanishes into the setup. So rather than ship a script that cannot reproduce its own claim, we publish the protocol and the measured exponents and let you run the full thing properly. Shipping a flaky demo would cost us exactly the credibility the page exists to build.

What we want back

If a number does not hold up on your hardware, that is something we want to know about — not something to paper over. The premises are stated in the open: the deadline result rests on configuration measurement being parallel and constant-cost in n, and that is the hardware step we are still validating. Run the scripts, push on the premises, and tell us where they bend. A claim that survives that is worth more than one that was never tested.

Run it on your workload →See pricing