bench: all-Rust BEIR benchmark + README above-the-fold#237
Conversation
Adds a reproducible BEIR benchmark per the repo spec, evaluating nDCG@10 vs BEIR
qrels with the same Harrier embeddings across OrdVec, FAISS FlatIP, and HNSW.
Python prepares public BEIR data + Harrier embeddings (cached normalized f32
.npy) and evaluates qrels; Rust owns every OrdVec hot path; FAISS/HNSW are native
baselines. No 'import ordvec' in the harness (the hot path is the Rust crate).
- benchmarks/beir/: common.py (shared contract), beir_prepare.py (BEIR download +
Harrier encode, sentence-transformers/CUDA + Ollama/GGUF lanes, validation),
beir_baselines.py (FAISS FlatIP + hnswlib), beir_eval.py (nDCG@10/MAP/Recall/MRR/
Precision vs qrels + paired bootstrap CIs vs FAISS), beir_report.py (tables),
README, requirements.txt.
- examples/beir_ordvec.rs: loads cached .npy, runs rq2/rq4 (batched
search_asymmetric) and bitmap-rq2/sign-rq2 (batched CSR candidate-gen + pooled
SubsetScratch allocation-free rerank); emits top-k JSONL (real rank-ordered
scores) + summary JSON (bytes/vec, build/latency p50/p95/p99, simd_detected).
No BEIR metrics computed in Rust.
- root Makefile: benchmark-beir / -smoke / -bm25 + bench-beir-{setup,prepare,
prepare-ollama,ordvec,baselines,eval,guardrail,clean,clean-cache}.
- guardrail target greps for 'import ordvec' in benchmarks/beir and fails.
Headline discipline: nDCG@10 vs qrels is the metric; FAISS FlatIP is a full-float
dense baseline, NOT ground truth; ANN-recall-vs-FAISS is an optional diagnostic.
Verified: cargo build --release --example beir_ordvec (+fmt), python3 -m
py_compile all scripts, make -n benchmark-beir-smoke.
Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
…fold Replace the split Python-baselines + single-method-example harness with a single all-Rust comparison binary, and lead the README with the reproducible results. Why: the previous design measured ordvec (Rust) against FAISS/hnswlib (Python), so the latency comparison crossed a language/FFI boundary and used a single-query baseline — not apples-to-apples, and the headline multiplier depended on un-stated batch/thread choices. benchmarks/beir-bench (new workspace member, publish=false, isolated so its hnsw_rs/matrixmultiply deps never touch the `-p ordvec` deps gate or the published crate): - flat: exact inner product (== FAISS IndexFlatIP retrieval) via a pure-Rust SIMD GEMM (matrixmultiply) — a competitive baseline, not a scalar strawman. - hnsw: pure-Rust HNSW (hnsw_rs, M=32/ef=128) — the portable stand-in for C++ hnswlib (no maintained Rust binding to the latter exists). - ordvec rq2/rq4 + bitmap-rq2/sign-rq2 via the batched/pooled/SIMD fast paths. - One process, matched batch; `--threads N` pins query latency to a rayon pool (build still uses all cores), `--max-docs M` sub-samples the corpus for the scaling sweep. Emits per-(method,n,threads) timing.jsonl + full-corpus topk/summary for offline nDCG. Harness/Python: - beir_prepare.py: canonical llamacpp lane (GGUF Q8 Harrier via llama-cpp-python, CUDA) + skip-if-cached guard so multi-target runs don't re-embed. - beir_plot.py (new): renders the three README figures (scaling curve + single-thread/threaded bars) from timing.jsonl. - beir_eval/beir_report: classify the flat/hnsw slugs; baseline-relative deltas use the actual baseline name. - requirements.txt: drop the unbuildable `beir`/`pytrec_eval`; vendored BEIR loader + pytrec-eval-terrier + huggingface-hub + matplotlib; llama-cpp-python is built with CUDA flags by `make bench-beir-setup`. - Makefile: benchmark-beir = guardrail + quality(nDCG) + scaling + graphics; .NOTPARALLEL so an inherited MAKEFLAGS=-jN can't race the cache. - Delete beir_baselines.py (Python FAISS/hnswlib) and examples/beir_ordvec.rs — both superseded by beir-bench. README: drop the private-arXiv real-embedding block; add a "Benchmark at a glance" hero (scaling curve + one-command reproduce) and a BEIR section with the nDCG@10 table, the three latency views (single-query ~100x, batched, threaded), and an explicit ordvec-vs-HNSW tradeoff table (HNSW edges threaded latency; ordvec wins build [training-free vs 51s], memory [8-16x], and single-query). benchmarks/ excluded from the published crate; figures referenced by absolute raw URL. Verified: cargo fmt --all --check, clippy -p ordvec / -p beir-bench (0 warnings), cargo build --locked, deps gate (`cargo tree -p ordvec` clean), py_compile all scripts, and a full `make benchmark-beir` on scifact + trec-covid (171,332 docs). Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Code Review by Qodo
1.
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Code Review
This pull request introduces a reproducible BEIR benchmark harness, adding a new Rust workspace member beir-bench and Python scripts to measure and plot ordvec's retrieval latency and quality against exact flat and HNSW baselines. The review feedback is highly constructive, identifying several robustness and performance improvements. These include replacing fragile external shell commands for SHA-256 hashing with the pure-Rust sha2 crate, replacing a custom JSON parser with serde_json to correctly handle unicode escapes, vectorizing the Python bootstrap loop using NumPy for a significant speedup, and adding a defensive check in local_topk to prevent a potential underflow panic when top_k is zero.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
PR Summary by QodoAdd all-Rust BEIR benchmark harness and surface results in README WalkthroughsDescription• Add reproducible BEIR benchmark pipeline: embed → run all-Rust retrieval → score qrels → plot figures. • Introduce beir-bench Rust binary to benchmark ordvec vs exact flat and pure-Rust HNSW in-process. • Update README/CHANGELOG to lead with BEIR quality + latency results and reproduction instructions. Diagramgraph TD
A["Makefile: benchmark-beir"] --> B["Python: beir_prepare.py"] --> C[("Embedding cache")]
C --> D["Rust: beir-bench"] --> E[("results/beir")]
E --> F["Python: beir_eval.py + beir_report.py"] --> E
E --> G["Python: beir_plot.py"] --> H["README figures + tables"]
subgraph Legend
direction LR
_doc["Document/command"] ~~~ _proc["Process"] ~~~ _db[("Data files")]
end
High-Level AssessmentThe following are alternative approaches to this PR: 1. Use existing Rust crates for .npy/JSON parsing (serde_json, ndarray-npy)
2. Keep baselines in Python (FAISS/hnswlib) and benchmark ordvec via FFI boundary
Recommendation: The PR’s approach (single-process Rust benchmarking for all retrieval methods, with Python only for embedding/eval/plotting) is the best fit for fair latency comparisons and reproducibility. If maintenance burden becomes a concern, consider swapping the hand-rolled File ChangesEnhancement (2)
Documentation (4)
Other (8)
|
Remediate the gemini/qodo/Codex review on #237 and the failing release-publish-invariants gate. beir-bench (gemini HIGH/MED + qodo Bug): - sha256_file: pure-Rust `sha2` instead of shelling out to sha256sum/shasum/ openssl — portable (Windows / minimal containers) and byte-identical to the Python hashlib digest. (gemini, main.rs) - load_json_string_array + the topk/timing/summary writers: use `serde_json` for both reading and writing, so document/query IDs containing quotes, backslashes, or unicode escapes can no longer produce invalid JSON that breaks downstream `json.loads` (qodo Correctness bug) or be mis-read (gemini unicode-escape finding). Adds sha2 + serde_json to beir-bench deps (both already in the workspace lock). - local_topk: guard `k > 0` before `select_nth_unstable_by(k - 1, ..)` so a zero top_k can never underflow to usize::MAX. (gemini) beir_eval.py (gemini perf): vectorize the paired bootstrap — draw all (n_iters x n) resample indices at once and reduce along the query axis, instead of an n_iters Python loop. Same paired resampling, NumPy-internal speed. Release-publish invariant (CI fail): the crate `exclude` was over-broad (`benchmarks/`), which dropped `benchmarks/rank_modes_results.txt` — a README-linked file the publish invariant requires in the package. Narrow the exclude to `benchmarks/beir/` + `benchmarks/beir-bench/` (the dev-only BEIR harness + figures + bench crate); the synthetic-bench results files stay packaged. Verified with `cargo package --list`. README (Codex + review nits): - Drop the false "nothing here is hand-entered / every figure regenerated" claim: the harness writes the figures + nDCG/timing summaries, the tables transcribe them, and you can regenerate/verify everything (latencies vary by hardware/batch). Clarify the default run covers scifact + trec-covid (nfcorpus /fiqa supported), not all four. - Add a balanced above-the-fold one-liner (quality-at-compression + no-build + single-query latency; HNSW wins threaded graph serving). - Remove a duplicated `### Synthetic stress test` heading. - The relative link to the now-excluded benchmarks/beir becomes an absolute GitHub URL so it doesn't dangle in the published crate. benchmarks/beir/README.md: refresh the stale page to the merged design — GGUF Q8 llama-cpp-python canonical lane (st/ollama optional), vendored BEIR loader, flat/hnsw/ordvec method table, timing.jsonl + figures, and a corrected `import ordvec` rule (external driver; the hot path is the Rust beir-bench binary; the Python ordvec package is intentionally not imported and the wheel is not required). Verified: fmt --all --check, clippy -p ordvec / -p beir-bench (0 warnings), build --locked, cargo package --list (rank_modes kept, beir excluded), py_compile, serde_json output parses + sha2 == system/Python digest, vectorized bootstrap matches. Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
|
Review round 1 remediated in
Plus the failing release-publish-invariants gate: the crate Verified locally: fmt, clippy (0/0), build --locked, package list, serde_json output parses, vectorized bootstrap matches. |
`_download_beir` called `ZipFile.extractall()` on a remote archive without validating member paths — a malicious/tampered zip could path-traverse (`../`, absolute paths) and overwrite arbitrary files under the running user. Validate every member before extracting: resolve `raw_dir / member` and reject any whose resolved path isn't `raw_dir` itself or under it, raising a clear ValueError. Verified the guard rejects `../`, `../../etc/...`, `/etc/passwd`, and `a/../../escape` while allowing legitimate `dataset/...` members. Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
flat and the ordvec rows are deterministic (byte-identical run to run, verified); the hnsw row is approximate — hnsw_rs builds the graph with a parallel insert, so its nDCG and latency vary slightly between runs (≈±0.003 nDCG, within the same bootstrap-noise band). Note this in the quality section so the "regenerate every number" claim stays honest; the story (hnsw ≈ flat within noise; ordvec within noise at 8–16× smaller) is unchanged. Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
) OpenSSF Scorecard / OSV flagged ~20 advisories on main after the BEIR benchmark landed (#237). ALL are dev/benchmark tooling — none reach the published `ordvec` crate or the `ordvec` PyPI wheel. Python (benchmarks/beir/requirements.txt): the deps were UNPINNED, so OSV flagged each against its entire historical CVE list (an unconstrained version cannot be ruled non-vulnerable). The actual resolved-latest versions are already patched. Lower-bound-pin every package at its first patched release — clears the flags (OSV excludes a `>=fixed` range) while `>=` keeps installs on the latest compatible wheel, incl. recent CPython: - requests>=2.32.4 (GHSA-9hjg-9r4m-mvj7 .netrc leak + all older requests CVEs) - hnswlib>=0.8.0 (GHSA-xwc8-rf6m-xr86 double free) - numpy>=1.26.0 (symlink-write + incorrect-comparison CVEs) - safe floors for scipy/pandas/tqdm/tabulate/huggingface-hub/faiss-cpu/ pytrec-eval-terrier/matplotlib. Verified the local cp314 venv satisfies all. Rust (RUSTSEC-2025-0141): bincode 1.x is UNMAINTAINED (informational advisory, not a vulnerability), pulled only transitively via hnsw_rs in the dev-only benchmarks/beir-bench harness. `cargo tree -p ordvec` is clean of bincode, so it does not reach the shipped crate. Add a documented deny.toml ignore so cargo-deny (configured to error on unmaintained crates) stays green; revisit if a maintained HNSW crate that does not pull bincode 1.x is adopted. Verified: `cargo tree -p ordvec` clean of bincode; `cargo deny check advisories` ok; benchmark venv versions satisfy the new floors. Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
Reworks the BEIR benchmark into a single all-Rust comparison harness and leads the README with the reproducible results. Replaces the earlier split design (ordvec in Rust vs FAISS/hnswlib in Python), which crossed a language/FFI boundary and used an un-stated batch/thread baseline — not apples-to-apples.
What changed
benchmarks/beir-bench— new workspace member (publish = false), isolated so itshnsw_rs/matrixmultiplydeps never touch the-p ordvecdeps gate or the published crate. All latency is measured in one process, matched batch:flat— exact inner product (identical retrieval to FAISSIndexFlatIP) via a pure-Rust SIMD GEMM (matrixmultiply) — a competitive baseline, not a scalar strawman.hnsw— pure-Rust HNSW (hnsw_rs, M=32/ef=128) — the portable stand-in for C++ hnswlib.rq2/rq4+bitmap-rq2/sign-rq2via the batched/pooled/SIMD fast paths.--threads Npins query latency to a rayon pool (build still uses all cores);--max-docs Msub-samples the corpus for the scaling sweep.Python stays out of the hot path — it embeds (Harrier-Q8 GGUF via
llama-cpp-python, CUDA, with a skip-if-cached guard), scores nDCG@10 vs qrels, and renders figures (beir_plot.py).requirements.txtdrops the unbuildablebeir/pytrec_eval(vendored loader +pytrec-eval-terrier).make benchmark-beir= guardrail → quality → scaling → graphics,.NOTPARALLELso an inheritedMAKEFLAGS=-jNcan't race the cache. Deletedbeir_baselines.py+examples/beir_ordvec.rs(superseded).README — dropped the private-arXiv block; added a "Benchmark at a glance" hero + a BEIR section with the nDCG table, three latency views, and an explicit ordvec-vs-HNSW tradeoff table.
Headline results (trec-covid, 171,332 docs, Harrier-Q8 1024-d)
Single-query latency, exact
flatvs ordvec — and the speedup grows with corpus size:nDCG@10 vs qrels (within bootstrap noise of exact, at 8–16× smaller): scifact rq4 = 0.7549 vs flat 0.7551; trec-covid ordvec rows 0.7613–0.7638 vs flat 0.7574.
Honest framing (in the README): ordvec's huge latency win is single-query/low-batch and grows with
n; under large-batch throughput a batched exact GEMM is strong and HNSW threads best. Where HNSW edges threaded latency it pays 8–16× the memory and a 51 s graph build (ordvec: ~0.3 s, training-free), and ordvec still wins single-query (~3×) and ties quality.flatis a comparison reference, not ground truth.Test plan
cargo fmt --all --checkcargo clippy -p ordvec --all-targets --all-features+-p beir-bench— 0 warningscargo build --lockedcargo tree -p ordvec --all-features --edges normal,build,devclean (no blas/ndarray/faer/statrs — member is isolated)python3 -m py_compileall harness scriptsmake benchmark-beiron scifact + trec-covid (171K docs) — figures + tables regenerated end-to-enddefault-members)Figures in the README reference absolute
mainraw URLs and render after merge.