Summary
After the 0.5.0 crate release, investigate the synthetic bench_rank deltas against the committed README baseline. This is curious and worth understanding, but it is not a 0.5.0 release blocker.
A 10-run aggregate on the current 0.5.0-pre branch shows several large speedups that are plausible given recent kernel and two-stage optimization work, plus a few slower rows that should be explained before refreshing public synthetic claims.
Evidence
Command:
cargo run --release --example bench_rank
Method used for this note:
- baseline: committed README /
benchmarks/rank_modes_results.txt synthetic table
- current: 10 consecutive runs on the current release-prep branch
- machine: same local x86_64 AVX-512 desktop class as the existing synthetic fixture
- quality columns are deterministic for this seeded synthetic corpus; latency and throughput are wall-clock measurements
Headline current 10-run p50 aggregate versus README baseline:
| Mode |
Current p50 mean +/- sd ms |
Current CV |
README p50 ms |
Delta vs README |
Current R@10 |
| Rank asym |
3.6400 +/- 0.0748 |
2.1% |
3.7118 |
-1.9% |
0.8330 |
| RankQuant b=4 asym |
0.3258 +/- 0.0071 |
2.2% |
0.3132 |
+4.0% |
0.8095 |
| RankQuant b=2 asym |
0.2640 +/- 0.0082 |
3.1% |
0.2382 |
+10.8% |
0.5785 |
| RankQuant b=2 FastScan |
0.1086 +/- 0.0024 |
2.2% |
0.0901 |
+20.6% |
0.5845 |
| RankQuant b=2 byte-LUT |
0.6212 +/- 0.0168 |
2.7% |
0.7542 |
-17.6% |
0.5785 |
| RankQuant b=4 byte-LUT |
1.2246 +/- 0.0314 |
2.6% |
1.6437 |
-25.5% |
0.8095 |
| Bitmap n_top=64 |
0.0634 +/- 0.0022 |
3.5% |
0.0812 |
-21.9% |
0.2495 |
| SignBitmap probe |
0.0498 +/- 0.0015 |
3.0% |
0.0912 |
-45.4% |
0.2745 |
| TwoStage b=2 M=100 |
0.0538 +/- 0.0012 |
2.3% |
0.0977 |
-44.9% |
0.5795 |
| TwoStage b=2 M=500 |
0.0675 +/- 0.0024 |
3.5% |
0.1089 |
-38.0% |
0.5785 |
| TwoStage b=2 M=1000 |
0.0804 +/- 0.0029 |
3.6% |
0.1225 |
-34.4% |
0.5785 |
| TwoStage b=2 M=5000 |
0.1884 +/- 0.0073 |
3.8% |
0.2398 |
-21.4% |
0.5785 |
| SignTwoStage b=2 M=500 |
0.0676 +/- 0.0022 |
3.2% |
0.1057 |
-36.0% |
0.5785 |
Initial read
The large speedups in bitmap/sign/two-stage and byte-LUT rows are likely explained by recent kernel and caller-owned/two-stage optimization work. The b=2 asym and FastScan slowdowns are larger than the observed run-to-run CV, so they deserve a focused follow-up rather than being dismissed as jitter.
Suggested investigation after 0.5.0
- Make the built-in synthetic benchmark optionally run and emit N-sample aggregates, probably with a flag such as
--samples 10.
- Preserve single-run output for quick local smoke runs.
- Compare current
main against the README baseline and, where useful, against nearby historical commits around the kernel/two-stage changes.
- Attribute deltas to specific changes where possible: byte-LUT scoring, bitmap/sign probe kernels, two-stage candidate generation, exact rerank, FastScan layout/scoring, and RNG/corpus/ground-truth changes.
- Refresh
benchmarks/rank_modes_results.txt and README synthetic rows only after the investigation separates expected kernel wins from benchmark harness or corpus drift.
Non-goals
- Do not block the 0.5.0 release on this.
- Do not update public performance claims from one-off local timings.
- Do not change BEIR/real-corpus claims as part of this issue; use the BEIR harness for those.
Summary
After the 0.5.0 crate release, investigate the synthetic
bench_rankdeltas against the committed README baseline. This is curious and worth understanding, but it is not a 0.5.0 release blocker.A 10-run aggregate on the current 0.5.0-pre branch shows several large speedups that are plausible given recent kernel and two-stage optimization work, plus a few slower rows that should be explained before refreshing public synthetic claims.
Evidence
Command:
Method used for this note:
benchmarks/rank_modes_results.txtsynthetic tableHeadline current 10-run p50 aggregate versus README baseline:
Initial read
The large speedups in bitmap/sign/two-stage and byte-LUT rows are likely explained by recent kernel and caller-owned/two-stage optimization work. The b=2 asym and FastScan slowdowns are larger than the observed run-to-run CV, so they deserve a focused follow-up rather than being dismissed as jitter.
Suggested investigation after 0.5.0
--samples 10.mainagainst the README baseline and, where useful, against nearby historical commits around the kernel/two-stage changes.benchmarks/rank_modes_results.txtand README synthetic rows only after the investigation separates expected kernel wins from benchmark harness or corpus drift.Non-goals