Skip to content

bench: all-Rust BEIR benchmark + README above-the-fold#237

Merged
Navi Bot (project-navi-bot) merged 5 commits into
mainfrom
feat/beir-benchmark
Jun 15, 2026
Merged

bench: all-Rust BEIR benchmark + README above-the-fold#237
Navi Bot (project-navi-bot) merged 5 commits into
mainfrom
feat/beir-benchmark

Conversation

@Fieldnote-Echo

Copy link
Copy Markdown
Member

Reworks the BEIR benchmark into a single all-Rust comparison harness and leads the README with the reproducible results. Replaces the earlier split design (ordvec in Rust vs FAISS/hnswlib in Python), which crossed a language/FFI boundary and used an un-stated batch/thread baseline — not apples-to-apples.

What changed

benchmarks/beir-bench — new workspace member (publish = false), isolated so its hnsw_rs/matrixmultiply deps never touch the -p ordvec deps gate or the published crate. All latency is measured in one process, matched batch:

  • flat — exact inner product (identical retrieval to FAISS IndexFlatIP) via a pure-Rust SIMD GEMM (matrixmultiply) — a competitive baseline, not a scalar strawman.
  • hnsw — pure-Rust HNSW (hnsw_rs, M=32/ef=128) — the portable stand-in for C++ hnswlib.
  • ordvec rq2/rq4 + bitmap-rq2/sign-rq2 via the batched/pooled/SIMD fast paths.
  • --threads N pins query latency to a rayon pool (build still uses all cores); --max-docs M sub-samples the corpus for the scaling sweep.

Python stays out of the hot path — it embeds (Harrier-Q8 GGUF via llama-cpp-python, CUDA, with a skip-if-cached guard), scores nDCG@10 vs qrels, and renders figures (beir_plot.py). requirements.txt drops the unbuildable beir/pytrec_eval (vendored loader + pytrec-eval-terrier). make benchmark-beir = guardrail → quality → scaling → graphics, .NOTPARALLEL so an inherited MAKEFLAGS=-jN can't race the cache. Deleted beir_baselines.py + examples/beir_ordvec.rs (superseded).

README — dropped the private-arXiv block; added a "Benchmark at a glance" hero + a BEIR section with the nDCG table, three latency views, and an explicit ordvec-vs-HNSW tradeoff table.

Headline results (trec-covid, 171,332 docs, Harrier-Q8 1024-d)

Single-query latency, exact flat vs ordvec — and the speedup grows with corpus size:

scaling curve

regime flat ordvec sign→rq2 speedup
single query (batch 1, 1 thread) 56.2 ms 0.53 ms ~106×
batched (batch 32, 1 thread) 4.03 ms 0.50 ms ~8×
threaded (batch 32, 32 threads) 1.08 ms 0.52 ms ~2× (HNSW leads at 4.8×)

nDCG@10 vs qrels (within bootstrap noise of exact, at 8–16× smaller): scifact rq4 = 0.7549 vs flat 0.7551; trec-covid ordvec rows 0.7613–0.7638 vs flat 0.7574.

Honest framing (in the README): ordvec's huge latency win is single-query/low-batch and grows with n; under large-batch throughput a batched exact GEMM is strong and HNSW threads best. Where HNSW edges threaded latency it pays 8–16× the memory and a 51 s graph build (ordvec: ~0.3 s, training-free), and ordvec still wins single-query (~3×) and ties quality. flat is a comparison reference, not ground truth.

Test plan

  • cargo fmt --all --check
  • cargo clippy -p ordvec --all-targets --all-features + -p beir-bench — 0 warnings
  • cargo build --locked
  • deps gate: cargo tree -p ordvec --all-features --edges normal,build,dev clean (no blas/ndarray/faer/statrs — member is isolated)
  • python3 -m py_compile all harness scripts
  • full make benchmark-beir on scifact + trec-covid (171K docs) — figures + tables regenerated end-to-end
  • CI green (lint / no-default / experimental / MSRV 1.89 / deps + publish dry-run — core gates unaffected; member is not in default-members)

Figures in the README reference absolute main raw URLs and render after merge.

Adds a reproducible BEIR benchmark per the repo spec, evaluating nDCG@10 vs BEIR
qrels with the same Harrier embeddings across OrdVec, FAISS FlatIP, and HNSW.
Python prepares public BEIR data + Harrier embeddings (cached normalized f32
.npy) and evaluates qrels; Rust owns every OrdVec hot path; FAISS/HNSW are native
baselines. No 'import ordvec' in the harness (the hot path is the Rust crate).

- benchmarks/beir/: common.py (shared contract), beir_prepare.py (BEIR download +
  Harrier encode, sentence-transformers/CUDA + Ollama/GGUF lanes, validation),
  beir_baselines.py (FAISS FlatIP + hnswlib), beir_eval.py (nDCG@10/MAP/Recall/MRR/
  Precision vs qrels + paired bootstrap CIs vs FAISS), beir_report.py (tables),
  README, requirements.txt.
- examples/beir_ordvec.rs: loads cached .npy, runs rq2/rq4 (batched
  search_asymmetric) and bitmap-rq2/sign-rq2 (batched CSR candidate-gen + pooled
  SubsetScratch allocation-free rerank); emits top-k JSONL (real rank-ordered
  scores) + summary JSON (bytes/vec, build/latency p50/p95/p99, simd_detected).
  No BEIR metrics computed in Rust.
- root Makefile: benchmark-beir / -smoke / -bm25 + bench-beir-{setup,prepare,
  prepare-ollama,ordvec,baselines,eval,guardrail,clean,clean-cache}.
- guardrail target greps for 'import ordvec' in benchmarks/beir and fails.

Headline discipline: nDCG@10 vs qrels is the metric; FAISS FlatIP is a full-float
dense baseline, NOT ground truth; ANN-recall-vs-FAISS is an optional diagnostic.

Verified: cargo build --release --example beir_ordvec (+fmt), python3 -m
py_compile all scripts, make -n benchmark-beir-smoke.

Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
…fold

Replace the split Python-baselines + single-method-example harness with a single
all-Rust comparison binary, and lead the README with the reproducible results.

Why: the previous design measured ordvec (Rust) against FAISS/hnswlib (Python),
so the latency comparison crossed a language/FFI boundary and used a single-query
baseline — not apples-to-apples, and the headline multiplier depended on
un-stated batch/thread choices.

benchmarks/beir-bench (new workspace member, publish=false, isolated so its
hnsw_rs/matrixmultiply deps never touch the `-p ordvec` deps gate or the
published crate):
- flat: exact inner product (== FAISS IndexFlatIP retrieval) via a pure-Rust
  SIMD GEMM (matrixmultiply) — a competitive baseline, not a scalar strawman.
- hnsw: pure-Rust HNSW (hnsw_rs, M=32/ef=128) — the portable stand-in for C++
  hnswlib (no maintained Rust binding to the latter exists).
- ordvec rq2/rq4 + bitmap-rq2/sign-rq2 via the batched/pooled/SIMD fast paths.
- One process, matched batch; `--threads N` pins query latency to a rayon pool
  (build still uses all cores), `--max-docs M` sub-samples the corpus for the
  scaling sweep. Emits per-(method,n,threads) timing.jsonl + full-corpus
  topk/summary for offline nDCG.

Harness/Python:
- beir_prepare.py: canonical llamacpp lane (GGUF Q8 Harrier via llama-cpp-python,
  CUDA) + skip-if-cached guard so multi-target runs don't re-embed.
- beir_plot.py (new): renders the three README figures (scaling curve +
  single-thread/threaded bars) from timing.jsonl.
- beir_eval/beir_report: classify the flat/hnsw slugs; baseline-relative deltas
  use the actual baseline name.
- requirements.txt: drop the unbuildable `beir`/`pytrec_eval`; vendored BEIR
  loader + pytrec-eval-terrier + huggingface-hub + matplotlib; llama-cpp-python
  is built with CUDA flags by `make bench-beir-setup`.
- Makefile: benchmark-beir = guardrail + quality(nDCG) + scaling + graphics;
  .NOTPARALLEL so an inherited MAKEFLAGS=-jN can't race the cache.
- Delete beir_baselines.py (Python FAISS/hnswlib) and examples/beir_ordvec.rs —
  both superseded by beir-bench.

README: drop the private-arXiv real-embedding block; add a "Benchmark at a
glance" hero (scaling curve + one-command reproduce) and a BEIR section with the
nDCG@10 table, the three latency views (single-query ~100x, batched, threaded),
and an explicit ordvec-vs-HNSW tradeoff table (HNSW edges threaded latency;
ordvec wins build [training-free vs 51s], memory [8-16x], and single-query).
benchmarks/ excluded from the published crate; figures referenced by absolute
raw URL.

Verified: cargo fmt --all --check, clippy -p ordvec / -p beir-bench (0 warnings),
cargo build --locked, deps gate (`cargo tree -p ordvec` clean), py_compile all
scripts, and a full `make benchmark-beir` on scifact + trec-covid (171,332 docs).

Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@qodo-code-review

qodo-code-review Bot commented Jun 15, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX issues (0) 🔗 Cross-repo conflicts (0)

Grey Divider


Action required

1. Invalid JSON serialization ✓ Resolved 🐞 Bug ≡ Correctness
Description
beir-bench writes *.topk.jsonl and timing.jsonl by interpolating string fields (e.g., qid,
doc_id, dataset) into JSON without escaping, so quotes/backslashes/newlines in IDs will produce
invalid JSON. Downstream scripts parse these files with json.loads(...), so the benchmark pipeline
can fail at eval/plot time.
Code

benchmarks/beir-bench/src/main.rs[R401-425]

+            doc_idxs_str.push_str(&di_usize.to_string());
+            let doc_id = if di_usize < n_corpus {
+                corpus_ids[di_usize].as_str()
+            } else {
+                ""
+            };
+            doc_ids_str.push('"');
+            doc_ids_str.push_str(doc_id);
+            doc_ids_str.push('"');
+            let sc = scores.get(qi * k + j).copied().unwrap_or(0.0);
+            if sc.is_finite() {
+                scores_str.push_str(&sc.to_string());
+            } else {
+                scores_str.push_str("0.0");
+            }
+        }
+        doc_idxs_str.push(']');
+        doc_ids_str.push(']');
+        scores_str.push(']');
+
+        writeln!(
+            writer,
+            r#"{{"dataset":"{dataset}","split":"{split}","method":"{method}","qid_idx":{qi},"qid":"{qid}","k":{k},"doc_idxs":{doc_idxs_str},"doc_ids":{doc_ids_str},"scores":{scores_str}}}"#,
+            qid = query_ids[qi],
+        )
Evidence
write_topk_jsonl and write_record_json insert raw string values directly into JSON (no
escaping), while the Python harness reads these files using json.loads(...), which will fail on
invalid JSON.

benchmarks/beir-bench/src/main.rs[371-426]
benchmarks/beir-bench/src/main.rs[454-486]
benchmarks/beir/common.py[280-289]
benchmarks/beir/beir_plot.py[50-57]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`benchmarks/beir-bench/src/main.rs` manually constructs JSON strings for `*.topk.jsonl`, `*.summary.json`, and `timing.jsonl` by concatenating/interpolating unescaped string values (doc IDs, query IDs, dataset names, etc.). This can emit invalid JSON when any field contains characters that must be escaped in JSON, and it breaks consumers that use strict `json.loads` parsing.

## Issue Context
Python consumers (`benchmarks/beir/common.py`, `benchmarks/beir/beir_eval.py`, `benchmarks/beir/beir_plot.py`) read these outputs via `json.loads(line)`, so malformed JSON stops the benchmark workflow.

## Fix Focus Areas
- benchmarks/beir-bench/src/main.rs[371-486]

## Implementation notes
- Add `serde` + `serde_json` as dependencies **in `benchmarks/beir-bench` only**.
- Define small serializable structs (e.g., `TopkRow`, `Record`) and write them via `serde_json::to_writer` / `to_string`.
- Ensure arrays like `simd_detected` are serialized as JSON arrays, not via manual string building.
- Optionally replace `load_json_string_array` with `serde_json::from_reader` for robustness once `serde_json` is available.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Unsafe zip extraction ✓ Resolved 🐞 Bug ⛨ Security
Description
beir_prepare._download_beir downloads a zip from a remote host and calls
ZipFile.extractall(raw_dir) without validating member paths, which is vulnerable to Zip Slip (path
traversal) if the archive is malicious or tampered with. This can overwrite arbitrary files on the
machine running the benchmark, under that user's permissions.
Code

benchmarks/beir/beir_prepare.py[R294-296]

+    with zipfile.ZipFile(zip_path) as zf:
+        zf.extractall(raw_dir)
+    zip_path.unlink(missing_ok=True)
Evidence
The downloader fetches a remote zip and extracts it with extractall directly into raw_dir
without any path validation, which is the classic Zip Slip pattern.

benchmarks/beir/beir_prepare.py[261-296]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`benchmarks/beir/beir_prepare.py` extracts a downloaded zip via `ZipFile.extractall(raw_dir)` without checking archive member paths. A crafted archive containing `../` or absolute paths could write outside `raw_dir` (Zip Slip).

## Issue Context
Even though the URL host is fixed, the code is still extracting a remote archive and should fail-closed by validating that every extracted path remains within the intended directory.

## Fix Focus Areas
- benchmarks/beir/beir_prepare.py[261-301]

## Implementation notes
- Replace `extractall` with a safe extraction routine:
 - Iterate over `zf.infolist()`.
 - For each member, compute the destination path as `(raw_dir / member.filename)` using POSIX semantics for zip paths.
 - Resolve/normalize and verify it is within `raw_dir` (e.g., `dest.resolve().is_relative_to(raw_dir.resolve())` on py>=3.9, or manual prefix check).
 - Reject absolute paths and any member with `..` components.
- Consider validating `dataset` against a conservative regex (e.g., `^[A-Za-z0-9_-]+$`) before using it in filenames/paths.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

@codecov

codecov Bot commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a reproducible BEIR benchmark harness, adding a new Rust workspace member beir-bench and Python scripts to measure and plot ordvec's retrieval latency and quality against exact flat and HNSW baselines. The review feedback is highly constructive, identifying several robustness and performance improvements. These include replacing fragile external shell commands for SHA-256 hashing with the pure-Rust sha2 crate, replacing a custom JSON parser with serde_json to correctly handle unicode escapes, vectorizing the Python bootstrap loop using NumPy for a significant speedup, and adding a defensive check in local_topk to prevent a potential underflow panic when top_k is zero.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread benchmarks/beir-bench/src/main.rs
Comment thread benchmarks/beir-bench/src/main.rs
Comment thread benchmarks/beir-bench/Cargo.toml
Comment thread benchmarks/beir/beir_eval.py Outdated
Comment thread benchmarks/beir-bench/src/main.rs Outdated
@qodo-code-review

Copy link
Copy Markdown

PR Summary by Qodo

Add all-Rust BEIR benchmark harness and surface results in README
✨ Enhancement 📝 Documentation ⚙️ Configuration changes 🕐 40+ Minutes

Grey Divider

Walkthroughs

Description
• Add reproducible BEIR benchmark pipeline: embed → run all-Rust retrieval → score qrels → plot
  figures.
• Introduce beir-bench Rust binary to benchmark ordvec vs exact flat and pure-Rust HNSW
  in-process.
• Update README/CHANGELOG to lead with BEIR quality + latency results and reproduction instructions.
Diagram
graph TD
  A["Makefile: benchmark-beir"] --> B["Python: beir_prepare.py"] --> C[("Embedding cache")]
  C --> D["Rust: beir-bench"] --> E[("results/beir")]
  E --> F["Python: beir_eval.py + beir_report.py"] --> E
  E --> G["Python: beir_plot.py"] --> H["README figures + tables"]
  subgraph Legend
    direction LR
    _doc["Document/command"] ~~~ _proc["Process"] ~~~ _db[("Data files")]
  end
Loading
High-Level Assessment

The following are alternative approaches to this PR:

1. Use existing Rust crates for .npy/JSON parsing (serde_json, ndarray-npy)
  • ➕ Much less custom parsing code in beir-bench (lower maintenance risk)
  • ➕ Better format coverage and error handling
  • ➕ Easier future extension (e.g., other dtypes/shapes, richer JSON schemas)
  • ➖ Adds dependencies (may conflict with the repo's dependency-gating goals)
  • ➖ Potentially larger compile times for a dev-only harness
  • ➖ May require extra care to keep benchmark member deps isolated from ordvec publish surface
2. Keep baselines in Python (FAISS/hnswlib) and benchmark ordvec via FFI boundary
  • ➕ Baseline implementations match commonly-used reference libraries
  • ➕ Less Rust code for baseline kernels
  • ➖ Cross-language boundary complicates apples-to-apples latency comparisons
  • ➖ Harder to ensure matched batching/threading and consistent timing methodology
  • ➖ More moving parts (ABI/FFI, packaging) for reproducibility

Recommendation: The PR’s approach (single-process Rust benchmarking for all retrieval methods, with Python only for embedding/eval/plotting) is the best fit for fair latency comparisons and reproducibility. If maintenance burden becomes a concern, consider swapping the hand-rolled .npy/JSON readers in beir-bench for lightweight parsing crates while keeping the benchmark crate isolated from the published ordvec dependency gate.

Grey Divider

File Changes

Enhancement (2)
Cargo.toml Create dev-only 'beir-bench' Rust binary crate with isolated dependencies +28/-0

Create dev-only 'beir-bench' Rust binary crate with isolated dependencies

• Adds a new workspace member binary ('publish = false') that depends on 'ordvec', 'hnsw_rs', 'rayon', and 'matrixmultiply'. The crate is designed to keep these benchmark-only deps from affecting the published 'ordvec' crate dependency gate.

benchmarks/beir-bench/Cargo.toml


main.rs Implement all-Rust retrieval benchmark harness (flat/HNSW/ordvec) with pinned threading +1327/-0

Implement all-Rust retrieval benchmark harness (flat/HNSW/ordvec) with pinned threading

• Implements the 'beir-bench' binary to load cached embeddings, validate normalization, and benchmark multiple retrieval methods in a single process with matched batching and a configurable rayon pool ('--threads'). Produces append-only timing records plus full-corpus top-k JSONL and per-method summary JSON for downstream qrels evaluation and plotting.

benchmarks/beir-bench/src/main.rs


Documentation (4)
CHANGELOG.md Document new reproducible BEIR benchmark harness and README refresh +15/-0

Document new reproducible BEIR benchmark harness and README refresh

• Adds an Unreleased changelog entry describing the new end-to-end BEIR benchmark pipeline, its in-process latency measurement design, and the README’s shift to public, reproducible results.

CHANGELOG.md


README.md Lead README with BEIR benchmark results and reproduction instructions +126/-41

Lead README with BEIR benchmark results and reproduction instructions

• Adds an above-the-fold "Benchmark at a glance" section with scaling/latency figures and explicit reproduction commands. Replaces the previous private dataset section with a detailed BEIR benchmark section including nDCG@10 table, three latency regimes, and an explicit ordvec-vs-HNSW tradeoff framing.

README.md


README.md Add harness documentation, claims discipline, and cache/results contracts +158/-0

Add harness documentation, claims discipline, and cache/results contracts

• Documents the BEIR harness goals, claims policy, dataset suite, encoder configuration, quickstart commands, and the cache/results file layouts. Explicitly states and enforces the "no 'import ordvec'" rule for Python harness scripts.

benchmarks/beir/README.md


beir_report.py Generate markdown report tables and embed required claims text +470/-0

Generate markdown report tables and embed required claims text

• Adds report rendering to produce comparison matrices and per-dataset/rollup tables from evaluation summaries, always including encoder provider metadata. Includes verbatim required-claims paragraphs to keep benchmark reporting policy consistent.

benchmarks/beir/beir_report.py


Other (8)
Cargo.toml Add 'benchmarks/' to exclude list and register 'beir-bench' workspace member +2/-1

Add 'benchmarks/' to exclude list and register 'beir-bench' workspace member

• Updates workspace configuration to include 'benchmarks/beir-bench' as a member while keeping benchmarks excluded from the published crate surface. Keeps 'default-members' unchanged so CI/publish behavior remains focused on the core crate.

Cargo.toml


Makefile Add 'benchmark-beir' orchestration targets with guardrails and serial execution +194/-0

Add 'benchmark-beir' orchestration targets with guardrails and serial execution

• Introduces a complete Make-driven BEIR benchmarking pipeline (setup, build, guardrail, quality, scaling, plot, cleanup). Enforces '.NOTPARALLEL' to avoid cache races and adds a guardrail that fails if benchmark Python code imports the 'ordvec' Python package directly.

Makefile


beir_eval.py Add qrels-based evaluation + paired bootstrap deltas and artifact generation +789/-0

Add qrels-based evaluation + paired bootstrap deltas and artifact generation

• Adds a BEIR run evaluator that loads '.topk.jsonl' outputs, computes nDCG/MAP/Recall/MRR/Precision via pytrec_eval semantics, and performs paired bootstrap deltas vs a baseline method. Writes machine-readable summaries and triggers report rendering for README-friendly markdown outputs.

benchmarks/beir/beir_eval.py


beir_plot.py Render scaling curve and latency bar charts from Rust timing records +234/-0

Render scaling curve and latency bar charts from Rust timing records

• Adds a headless matplotlib plotting script that reads 'timing.jsonl', dedupes by last-run, and generates the README’s scaling curve and latency bars for single-thread and multi-thread regimes. Outputs both PNG and SVG variants.

benchmarks/beir/beir_plot.py


beir_prepare.py Download BEIR datasets and produce cached Harrier embeddings with provenance +800/-0

Download BEIR datasets and produce cached Harrier embeddings with provenance

• Implements dataset download/loading (vendored BEIR reader) and embedding production for multiple providers (canonical llama-cpp GGUF lane plus optional ST/Ollama lanes). Writes normalized float32 '.npy' embeddings, ids/qrels, manifests, and checksums, with skip-if-cached behavior for expensive corpora.

benchmarks/beir/beir_prepare.py


common.py Add shared path/slug/manifest utilities and embedding validation contract +289/-0

Add shared path/slug/manifest utilities and embedding validation contract

• Introduces shared utilities for encoder slugging, cache discovery, manifest/qrels/id loading, '.npy' loading/validation, and hashing. Provides the common contract used by prepare/eval/report scripts.

benchmarks/beir/common.py


requirements.txt Define Python dependencies for embedding, baselines, evaluation, and plotting +39/-0

Define Python dependencies for embedding, baselines, evaluation, and plotting

• Adds a dedicated requirements file for the BEIR harness, explicitly avoiding the 'beir' package due to 'pytrec_eval' build issues and using 'pytrec-eval-terrier' instead. Includes baseline deps (faiss-cpu, hnswlib), plotting (matplotlib), and download helpers.

benchmarks/beir/requirements.txt


.gitkeep Ensure 'results/beir' directory is tracked +0/-0

Ensure 'results/beir' directory is tracked

• Adds a placeholder to keep the BEIR results directory present in a clean checkout, matching the harness’s default output path.

results/beir/.gitkeep


Grey Divider

Qodo Logo

Comment thread benchmarks/beir-bench/src/main.rs Outdated
Remediate the gemini/qodo/Codex review on #237 and the failing
release-publish-invariants gate.

beir-bench (gemini HIGH/MED + qodo Bug):
- sha256_file: pure-Rust `sha2` instead of shelling out to sha256sum/shasum/
  openssl — portable (Windows / minimal containers) and byte-identical to the
  Python hashlib digest. (gemini, main.rs)
- load_json_string_array + the topk/timing/summary writers: use `serde_json`
  for both reading and writing, so document/query IDs containing quotes,
  backslashes, or unicode escapes can no longer produce invalid JSON that
  breaks downstream `json.loads` (qodo Correctness bug) or be mis-read
  (gemini unicode-escape finding). Adds sha2 + serde_json to beir-bench deps
  (both already in the workspace lock).
- local_topk: guard `k > 0` before `select_nth_unstable_by(k - 1, ..)` so a
  zero top_k can never underflow to usize::MAX. (gemini)

beir_eval.py (gemini perf): vectorize the paired bootstrap — draw all
(n_iters x n) resample indices at once and reduce along the query axis, instead
of an n_iters Python loop. Same paired resampling, NumPy-internal speed.

Release-publish invariant (CI fail): the crate `exclude` was over-broad
(`benchmarks/`), which dropped `benchmarks/rank_modes_results.txt` — a
README-linked file the publish invariant requires in the package. Narrow the
exclude to `benchmarks/beir/` + `benchmarks/beir-bench/` (the dev-only BEIR
harness + figures + bench crate); the synthetic-bench results files stay
packaged. Verified with `cargo package --list`.

README (Codex + review nits):
- Drop the false "nothing here is hand-entered / every figure regenerated"
  claim: the harness writes the figures + nDCG/timing summaries, the tables
  transcribe them, and you can regenerate/verify everything (latencies vary by
  hardware/batch). Clarify the default run covers scifact + trec-covid (nfcorpus
  /fiqa supported), not all four.
- Add a balanced above-the-fold one-liner (quality-at-compression + no-build +
  single-query latency; HNSW wins threaded graph serving).
- Remove a duplicated `### Synthetic stress test` heading.
- The relative link to the now-excluded benchmarks/beir becomes an absolute
  GitHub URL so it doesn't dangle in the published crate.

benchmarks/beir/README.md: refresh the stale page to the merged design — GGUF
Q8 llama-cpp-python canonical lane (st/ollama optional), vendored BEIR loader,
flat/hnsw/ordvec method table, timing.jsonl + figures, and a corrected
`import ordvec` rule (external driver; the hot path is the Rust beir-bench
binary; the Python ordvec package is intentionally not imported and the wheel
is not required).

Verified: fmt --all --check, clippy -p ordvec / -p beir-bench (0 warnings),
build --locked, cargo package --list (rank_modes kept, beir excluded),
py_compile, serde_json output parses + sha2 == system/Python digest, vectorized
bootstrap matches.

Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
@Fieldnote-Echo

Copy link
Copy Markdown
Member Author

Review round 1 remediated in 3004a41:

  • sha256_file (gemini HIGH) → pure-Rust sha2 (portable; byte-identical to the Python hashlib digest, verified).
  • load_json_string_array + all JSON writers (gemini unicode + qodo invalid-JSON Bug) → serde_json for read and write, so IDs with quotes/backslashes/unicode can't produce invalid JSON.
  • beir-bench deps (gemini) → added sha2/serde_json (already in the workspace lock).
  • local_topk (gemini) → k > 0 guard before select_nth_unstable_by(k-1).
  • paired bootstrap (gemini perf) → vectorized over (n_iters × n) in NumPy.

Plus the failing release-publish-invariants gate: the crate exclude was over-broad and dropped benchmarks/rank_modes_results.txt (a required packaged file) — narrowed to benchmarks/beir/ + benchmarks/beir-bench/ only (verified with cargo package --list). README claims tightened (no "nothing hand-entered"; default run = scifact+trec-covid), duplicate heading removed, and benchmarks/beir/README.md refreshed to the all-Rust design.

Verified locally: fmt, clippy (0/0), build --locked, package list, serde_json output parses, vectorized bootstrap matches.

`_download_beir` called `ZipFile.extractall()` on a remote archive without
validating member paths — a malicious/tampered zip could path-traverse
(`../`, absolute paths) and overwrite arbitrary files under the running user.

Validate every member before extracting: resolve `raw_dir / member` and reject
any whose resolved path isn't `raw_dir` itself or under it, raising a clear
ValueError. Verified the guard rejects `../`, `../../etc/...`, `/etc/passwd`,
and `a/../../escape` while allowing legitimate `dataset/...` members.

Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
flat and the ordvec rows are deterministic (byte-identical run to run, verified);
the hnsw row is approximate — hnsw_rs builds the graph with a parallel insert, so
its nDCG and latency vary slightly between runs (≈±0.003 nDCG, within the same
bootstrap-noise band). Note this in the quality section so the "regenerate every
number" claim stays honest; the story (hnsw ≈ flat within noise; ordvec within
noise at 8–16× smaller) is unchanged.

Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
@project-navi-bot Navi Bot (project-navi-bot) merged commit b5551f0 into main Jun 15, 2026
38 checks passed
@project-navi-bot Navi Bot (project-navi-bot) deleted the feat/beir-benchmark branch June 15, 2026 14:53
Navi Bot (project-navi-bot) pushed a commit that referenced this pull request Jun 15, 2026
)

OpenSSF Scorecard / OSV flagged ~20 advisories on main after the BEIR benchmark
landed (#237). ALL are dev/benchmark tooling — none reach the published `ordvec`
crate or the `ordvec` PyPI wheel.

Python (benchmarks/beir/requirements.txt): the deps were UNPINNED, so OSV flagged
each against its entire historical CVE list (an unconstrained version cannot be
ruled non-vulnerable). The actual resolved-latest versions are already patched.
Lower-bound-pin every package at its first patched release — clears the flags
(OSV excludes a `>=fixed` range) while `>=` keeps installs on the latest
compatible wheel, incl. recent CPython:
  - requests>=2.32.4   (GHSA-9hjg-9r4m-mvj7 .netrc leak + all older requests CVEs)
  - hnswlib>=0.8.0     (GHSA-xwc8-rf6m-xr86 double free)
  - numpy>=1.26.0      (symlink-write + incorrect-comparison CVEs)
  - safe floors for scipy/pandas/tqdm/tabulate/huggingface-hub/faiss-cpu/
    pytrec-eval-terrier/matplotlib. Verified the local cp314 venv satisfies all.

Rust (RUSTSEC-2025-0141): bincode 1.x is UNMAINTAINED (informational advisory,
not a vulnerability), pulled only transitively via hnsw_rs in the dev-only
benchmarks/beir-bench harness. `cargo tree -p ordvec` is clean of bincode, so it
does not reach the shipped crate. Add a documented deny.toml ignore so cargo-deny
(configured to error on unmaintained crates) stays green; revisit if a maintained
HNSW crate that does not pull bincode 1.x is adopted.

Verified: `cargo tree -p ordvec` clean of bincode; `cargo deny check advisories`
ok; benchmark venv versions satisfy the new floors.

Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants