bench: all-Rust BEIR benchmark + README above-the-fold by Fieldnote-Echo · Pull Request #237 · Project-Navi/ordvec

Fieldnote-Echo · 2026-06-15T14:18:35Z

Reworks the BEIR benchmark into a single all-Rust comparison harness and leads the README with the reproducible results. Replaces the earlier split design (ordvec in Rust vs FAISS/hnswlib in Python), which crossed a language/FFI boundary and used an un-stated batch/thread baseline — not apples-to-apples.

What changed

benchmarks/beir-bench — new workspace member (publish = false), isolated so its hnsw_rs/matrixmultiply deps never touch the -p ordvec deps gate or the published crate. All latency is measured in one process, matched batch:

flat — exact inner product (identical retrieval to FAISS IndexFlatIP) via a pure-Rust SIMD GEMM (matrixmultiply) — a competitive baseline, not a scalar strawman.
hnsw — pure-Rust HNSW (hnsw_rs, M=32/ef=128) — the portable stand-in for C++ hnswlib.
ordvec rq2/rq4 + bitmap-rq2/sign-rq2 via the batched/pooled/SIMD fast paths.
--threads N pins query latency to a rayon pool (build still uses all cores); --max-docs M sub-samples the corpus for the scaling sweep.

Python stays out of the hot path — it embeds (Harrier-Q8 GGUF via llama-cpp-python, CUDA, with a skip-if-cached guard), scores nDCG@10 vs qrels, and renders figures (beir_plot.py). requirements.txt drops the unbuildable beir/pytrec_eval (vendored loader + pytrec-eval-terrier). make benchmark-beir = guardrail → quality → scaling → graphics, .NOTPARALLEL so an inherited MAKEFLAGS=-jN can't race the cache. Deleted beir_baselines.py + examples/beir_ordvec.rs (superseded).

README — dropped the private-arXiv block; added a "Benchmark at a glance" hero + a BEIR section with the nDCG table, three latency views, and an explicit ordvec-vs-HNSW tradeoff table.

Headline results (trec-covid, 171,332 docs, Harrier-Q8 1024-d)

Single-query latency, exact flat vs ordvec — and the speedup grows with corpus size:

regime	flat	ordvec sign→rq2	speedup
single query (batch 1, 1 thread)	56.2 ms	0.53 ms	~106×
batched (batch 32, 1 thread)	4.03 ms	0.50 ms	~8×
threaded (batch 32, 32 threads)	1.08 ms	0.52 ms	~2× (HNSW leads at 4.8×)

nDCG@10 vs qrels (within bootstrap noise of exact, at 8–16× smaller): scifact rq4 = 0.7549 vs flat 0.7551; trec-covid ordvec rows 0.7613–0.7638 vs flat 0.7574.

Honest framing (in the README): ordvec's huge latency win is single-query/low-batch and grows with n; under large-batch throughput a batched exact GEMM is strong and HNSW threads best. Where HNSW edges threaded latency it pays 8–16× the memory and a 51 s graph build (ordvec: ~0.3 s, training-free), and ordvec still wins single-query (~3×) and ties quality. flat is a comparison reference, not ground truth.

Test plan

cargo fmt --all --check
cargo clippy -p ordvec --all-targets --all-features + -p beir-bench — 0 warnings
cargo build --locked
deps gate: cargo tree -p ordvec --all-features --edges normal,build,dev clean (no blas/ndarray/faer/statrs — member is isolated)
python3 -m py_compile all harness scripts
full make benchmark-beir on scifact + trec-covid (171K docs) — figures + tables regenerated end-to-end
CI green (lint / no-default / experimental / MSRV 1.89 / deps + publish dry-run — core gates unaffected; member is not in default-members)

Figures in the README reference absolute main raw URLs and render after merge.

Adds a reproducible BEIR benchmark per the repo spec, evaluating nDCG@10 vs BEIR qrels with the same Harrier embeddings across OrdVec, FAISS FlatIP, and HNSW. Python prepares public BEIR data + Harrier embeddings (cached normalized f32 .npy) and evaluates qrels; Rust owns every OrdVec hot path; FAISS/HNSW are native baselines. No 'import ordvec' in the harness (the hot path is the Rust crate). - benchmarks/beir/: common.py (shared contract), beir_prepare.py (BEIR download + Harrier encode, sentence-transformers/CUDA + Ollama/GGUF lanes, validation), beir_baselines.py (FAISS FlatIP + hnswlib), beir_eval.py (nDCG@10/MAP/Recall/MRR/ Precision vs qrels + paired bootstrap CIs vs FAISS), beir_report.py (tables), README, requirements.txt. - examples/beir_ordvec.rs: loads cached .npy, runs rq2/rq4 (batched search_asymmetric) and bitmap-rq2/sign-rq2 (batched CSR candidate-gen + pooled SubsetScratch allocation-free rerank); emits top-k JSONL (real rank-ordered scores) + summary JSON (bytes/vec, build/latency p50/p95/p99, simd_detected). No BEIR metrics computed in Rust. - root Makefile: benchmark-beir / -smoke / -bm25 + bench-beir-{setup,prepare, prepare-ollama,ordvec,baselines,eval,guardrail,clean,clean-cache}. - guardrail target greps for 'import ordvec' in benchmarks/beir and fails. Headline discipline: nDCG@10 vs qrels is the metric; FAISS FlatIP is a full-float dense baseline, NOT ground truth; ANN-recall-vs-FAISS is an optional diagnostic. Verified: cargo build --release --example beir_ordvec (+fmt), python3 -m py_compile all scripts, make -n benchmark-beir-smoke. Signed-off-by: Nelson Spence <nelson@projectnavi.ai>

…fold Replace the split Python-baselines + single-method-example harness with a single all-Rust comparison binary, and lead the README with the reproducible results. Why: the previous design measured ordvec (Rust) against FAISS/hnswlib (Python), so the latency comparison crossed a language/FFI boundary and used a single-query baseline — not apples-to-apples, and the headline multiplier depended on un-stated batch/thread choices. benchmarks/beir-bench (new workspace member, publish=false, isolated so its hnsw_rs/matrixmultiply deps never touch the `-p ordvec` deps gate or the published crate): - flat: exact inner product (== FAISS IndexFlatIP retrieval) via a pure-Rust SIMD GEMM (matrixmultiply) — a competitive baseline, not a scalar strawman. - hnsw: pure-Rust HNSW (hnsw_rs, M=32/ef=128) — the portable stand-in for C++ hnswlib (no maintained Rust binding to the latter exists). - ordvec rq2/rq4 + bitmap-rq2/sign-rq2 via the batched/pooled/SIMD fast paths. - One process, matched batch; `--threads N` pins query latency to a rayon pool (build still uses all cores), `--max-docs M` sub-samples the corpus for the scaling sweep. Emits per-(method,n,threads) timing.jsonl + full-corpus topk/summary for offline nDCG. Harness/Python: - beir_prepare.py: canonical llamacpp lane (GGUF Q8 Harrier via llama-cpp-python, CUDA) + skip-if-cached guard so multi-target runs don't re-embed. - beir_plot.py (new): renders the three README figures (scaling curve + single-thread/threaded bars) from timing.jsonl. - beir_eval/beir_report: classify the flat/hnsw slugs; baseline-relative deltas use the actual baseline name. - requirements.txt: drop the unbuildable `beir`/`pytrec_eval`; vendored BEIR loader + pytrec-eval-terrier + huggingface-hub + matplotlib; llama-cpp-python is built with CUDA flags by `make bench-beir-setup`. - Makefile: benchmark-beir = guardrail + quality(nDCG) + scaling + graphics; .NOTPARALLEL so an inherited MAKEFLAGS=-jN can't race the cache. - Delete beir_baselines.py (Python FAISS/hnswlib) and examples/beir_ordvec.rs — both superseded by beir-bench. README: drop the private-arXiv real-embedding block; add a "Benchmark at a glance" hero (scaling curve + one-command reproduce) and a BEIR section with the nDCG@10 table, the three latency views (single-query ~100x, batched, threaded), and an explicit ordvec-vs-HNSW tradeoff table (HNSW edges threaded latency; ordvec wins build [training-free vs 51s], memory [8-16x], and single-query). benchmarks/ excluded from the published crate; figures referenced by absolute raw URL. Verified: cargo fmt --all --check, clippy -p ordvec / -p beir-bench (0 warnings), cargo build --locked, deps gate (`cargo tree -p ordvec` clean), py_compile all scripts, and a full `make benchmark-beir` on scifact + trec-covid (171,332 docs). Signed-off-by: Nelson Spence <nelson@projectnavi.ai>

chatgpt-codex-connector · 2026-06-15T14:18:41Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

qodo-code-review · 2026-06-15T14:18:44Z

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX issues (0) 🔗 Cross-repo conflicts (0)

1. ~~Invalid JSON serialization~~ ✓ Resolved 🐞 Bug ≡ Correctness

Description

beir-bench writes *.topk.jsonl and timing.jsonl by interpolating string fields (e.g., qid,
doc_id, dataset) into JSON without escaping, so quotes/backslashes/newlines in IDs will produce
invalid JSON. Downstream scripts parse these files with json.loads(...), so the benchmark pipeline
can fail at eval/plot time.

Code

benchmarks/beir-bench/src/main.rs[R401-425]

+            doc_idxs_str.push_str(&di_usize.to_string());
+            let doc_id = if di_usize < n_corpus {
+                corpus_ids[di_usize].as_str()
+            } else {
+                ""
+            };
+            doc_ids_str.push('"');
+            doc_ids_str.push_str(doc_id);
+            doc_ids_str.push('"');
+            let sc = scores.get(qi * k + j).copied().unwrap_or(0.0);
+            if sc.is_finite() {
+                scores_str.push_str(&sc.to_string());
+            } else {
+                scores_str.push_str("0.0");
+            }
+        }
+        doc_idxs_str.push(']');
+        doc_ids_str.push(']');
+        scores_str.push(']');
+
+        writeln!(
+            writer,
+            r#"{{"dataset":"{dataset}","split":"{split}","method":"{method}","qid_idx":{qi},"qid":"{qid}","k":{k},"doc_idxs":{doc_idxs_str},"doc_ids":{doc_ids_str},"scores":{scores_str}}}"#,
+            qid = query_ids[qi],
+        )

Evidence
write_topk_jsonl and write_record_json insert raw string values directly into JSON (no
escaping), while the Python harness reads these files using json.loads(...), which will fail on
invalid JSON.
benchmarks/beir-bench/src/main.rs[371-426]
benchmarks/beir-bench/src/main.rs[454-486]
benchmarks/beir/common.py[280-289]
benchmarks/beir/beir_plot.py[50-57]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`benchmarks/beir-bench/src/main.rs` manually constructs JSON strings for `*.topk.jsonl`, `*.summary.json`, and `timing.jsonl` by concatenating/interpolating unescaped string values (doc IDs, query IDs, dataset names, etc.). This can emit invalid JSON when any field contains characters that must be escaped in JSON, and it breaks consumers that use strict `json.loads` parsing.

## Issue Context
Python consumers (`benchmarks/beir/common.py`, `benchmarks/beir/beir_eval.py`, `benchmarks/beir/beir_plot.py`) read these outputs via `json.loads(line)`, so malformed JSON stops the benchmark workflow.

## Fix Focus Areas
- benchmarks/beir-bench/src/main.rs[371-486]

## Implementation notes
- Add `serde` + `serde_json` as dependencies **in `benchmarks/beir-bench` only**.
- Define small serializable structs (e.g., `TopkRow`, `Record`) and write them via `serde_json::to_writer` / `to_string`.
- Ensure arrays like `simd_detected` are serialized as JSON arrays, not via manual string building.
- Optionally replace `load_json_string_array` with `serde_json::from_reader` for robustness once `serde_json` is available.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. ~~Unsafe zip extraction~~ ✓ Resolved 🐞 Bug ⛨ Security

Description

beir_prepare._download_beir downloads a zip from a remote host and calls
ZipFile.extractall(raw_dir) without validating member paths, which is vulnerable to Zip Slip (path
traversal) if the archive is malicious or tampered with. This can overwrite arbitrary files on the
machine running the benchmark, under that user's permissions.

Code

benchmarks/beir/beir_prepare.py[R294-296]

+    with zipfile.ZipFile(zip_path) as zf:
+        zf.extractall(raw_dir)
+    zip_path.unlink(missing_ok=True)

Evidence

The downloader fetches a remote zip and extracts it with extractall directly into raw_dir
without any path validation, which is the classic Zip Slip pattern.

benchmarks/beir/beir_prepare.py[261-296]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`benchmarks/beir/beir_prepare.py` extracts a downloaded zip via `ZipFile.extractall(raw_dir)` without checking archive member paths. A crafted archive containing `../` or absolute paths could write outside `raw_dir` (Zip Slip).

## Issue Context
Even though the URL host is fixed, the code is still extracting a remote archive and should fail-closed by validating that every extracted path remains within the intended directory.

## Fix Focus Areas
- benchmarks/beir/beir_prepare.py[261-301]

## Implementation notes
- Replace `extractall` with a safe extraction routine:
 - Iterate over `zf.infolist()`.
 - For each member, compute the destination path as `(raw_dir / member.filename)` using POSIX semantics for zip paths.
 - Resolve/normalize and verify it is within `raw_dir` (e.g., `dest.resolve().is_relative_to(raw_dir.resolve())` on py>=3.9, or manual prefix check).
 - Reject absolute paths and any member with `..` components.
- Consider validating `dataset` against a conservative regex (e.g., `^[A-Za-z0-9_-]+$`) before using it in filenames/paths.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

codecov · 2026-06-15T14:19:55Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

gemini-code-assist

Code Review

This pull request introduces a reproducible BEIR benchmark harness, adding a new Rust workspace member beir-bench and Python scripts to measure and plot ordvec's retrieval latency and quality against exact flat and HNSW baselines. The review feedback is highly constructive, identifying several robustness and performance improvements. These include replacing fragile external shell commands for SHA-256 hashing with the pure-Rust sha2 crate, replacing a custom JSON parser with serde_json to correctly handle unicode escapes, vectorizing the Python bootstrap loop using NumPy for a significant speedup, and adding a defensive check in local_topk to prevent a potential underflow panic when top_k is zero.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

qodo-code-review · 2026-06-15T14:20:49Z

PR Summary by Qodo

Add all-Rust BEIR benchmark harness and surface results in README
✨ Enhancement 📝 Documentation ⚙️ Configuration changes 🕐 40+ Minutes

Walkthroughs

Description

• Add reproducible BEIR benchmark pipeline: embed → run all-Rust retrieval → score qrels → plot
  figures.
• Introduce beir-bench Rust binary to benchmark ordvec vs exact flat and pure-Rust HNSW
  in-process.
• Update README/CHANGELOG to lead with BEIR quality + latency results and reproduction instructions.

Diagram

graph TD
  A["Makefile: benchmark-beir"] --> B["Python: beir_prepare.py"] --> C[("Embedding cache")]
  C --> D["Rust: beir-bench"] --> E[("results/beir")]
  E --> F["Python: beir_eval.py + beir_report.py"] --> E
  E --> G["Python: beir_plot.py"] --> H["README figures + tables"]
  subgraph Legend
    direction LR
    _doc["Document/command"] ~~~ _proc["Process"] ~~~ _db[("Data files")]
  end

High-Level Assessment

The following are alternative approaches to this PR:

1. Use existing Rust crates for .npy/JSON parsing (serde_json, ndarray-npy)

➕ Much less custom parsing code in beir-bench (lower maintenance risk)
➕ Better format coverage and error handling
➕ Easier future extension (e.g., other dtypes/shapes, richer JSON schemas)
➖ Adds dependencies (may conflict with the repo's dependency-gating goals)
➖ Potentially larger compile times for a dev-only harness
➖ May require extra care to keep benchmark member deps isolated from ordvec publish surface

2. Keep baselines in Python (FAISS/hnswlib) and benchmark ordvec via FFI boundary

➕ Baseline implementations match commonly-used reference libraries
➕ Less Rust code for baseline kernels
➖ Cross-language boundary complicates apples-to-apples latency comparisons
➖ Harder to ensure matched batching/threading and consistent timing methodology
➖ More moving parts (ABI/FFI, packaging) for reproducibility

Recommendation: The PR’s approach (single-process Rust benchmarking for all retrieval methods, with Python only for embedding/eval/plotting) is the best fit for fair latency comparisons and reproducibility. If maintenance burden becomes a concern, consider swapping the hand-rolled .npy/JSON readers in beir-bench for lightweight parsing crates while keeping the benchmark crate isolated from the published ordvec dependency gate.

File Changes

Enhancement (2)

Cargo.toml Create dev-only 'beir-bench' Rust binary crate with isolated dependencies +28/-0
Create dev-only 'beir-bench' Rust binary crate with isolated dependencies
• Adds a new workspace member binary ('publish = false') that depends on 'ordvec', 'hnsw_rs', 'rayon', and 'matrixmultiply'. The crate is designed to keep these benchmark-only deps from affecting the published 'ordvec' crate dependency gate.
benchmarks/beir-bench/Cargo.toml

main.rs Implement all-Rust retrieval benchmark harness (flat/HNSW/ordvec) with pinned threading +1327/-0
Implement all-Rust retrieval benchmark harness (flat/HNSW/ordvec) with pinned threading
• Implements the 'beir-bench' binary to load cached embeddings, validate normalization, and benchmark multiple retrieval methods in a single process with matched batching and a configurable rayon pool ('--threads'). Produces append-only timing records plus full-corpus top-k JSONL and per-method summary JSON for downstream qrels evaluation and plotting.
benchmarks/beir-bench/src/main.rs

Documentation (4)

CHANGELOG.md Document new reproducible BEIR benchmark harness and README refresh +15/-0
Document new reproducible BEIR benchmark harness and README refresh
• Adds an Unreleased changelog entry describing the new end-to-end BEIR benchmark pipeline, its in-process latency measurement design, and the README’s shift to public, reproducible results.
CHANGELOG.md

README.md Lead README with BEIR benchmark results and reproduction instructions +126/-41
Lead README with BEIR benchmark results and reproduction instructions
• Adds an above-the-fold "Benchmark at a glance" section with scaling/latency figures and explicit reproduction commands. Replaces the previous private dataset section with a detailed BEIR benchmark section including nDCG@10 table, three latency regimes, and an explicit ordvec-vs-HNSW tradeoff framing.
README.md

README.md Add harness documentation, claims discipline, and cache/results contracts +158/-0
Add harness documentation, claims discipline, and cache/results contracts
• Documents the BEIR harness goals, claims policy, dataset suite, encoder configuration, quickstart commands, and the cache/results file layouts. Explicitly states and enforces the "no 'import ordvec'" rule for Python harness scripts.
benchmarks/beir/README.md

beir_report.py Generate markdown report tables and embed required claims text +470/-0
Generate markdown report tables and embed required claims text
• Adds report rendering to produce comparison matrices and per-dataset/rollup tables from evaluation summaries, always including encoder provider metadata. Includes verbatim required-claims paragraphs to keep benchmark reporting policy consistent.
benchmarks/beir/beir_report.py

Other (8)

Cargo.toml Add 'benchmarks/' to exclude list and register 'beir-bench' workspace member +2/-1
Add 'benchmarks/' to exclude list and register 'beir-bench' workspace member
• Updates workspace configuration to include 'benchmarks/beir-bench' as a member while keeping benchmarks excluded from the published crate surface. Keeps 'default-members' unchanged so CI/publish behavior remains focused on the core crate.
Cargo.toml

Makefile Add 'benchmark-beir' orchestration targets with guardrails and serial execution +194/-0
Add 'benchmark-beir' orchestration targets with guardrails and serial execution
• Introduces a complete Make-driven BEIR benchmarking pipeline (setup, build, guardrail, quality, scaling, plot, cleanup). Enforces '.NOTPARALLEL' to avoid cache races and adds a guardrail that fails if benchmark Python code imports the 'ordvec' Python package directly.
Makefile

beir_eval.py Add qrels-based evaluation + paired bootstrap deltas and artifact generation +789/-0
Add qrels-based evaluation + paired bootstrap deltas and artifact generation
• Adds a BEIR run evaluator that loads '.topk.jsonl' outputs, computes nDCG/MAP/Recall/MRR/Precision via pytrec_eval semantics, and performs paired bootstrap deltas vs a baseline method. Writes machine-readable summaries and triggers report rendering for README-friendly markdown outputs.
benchmarks/beir/beir_eval.py

beir_plot.py Render scaling curve and latency bar charts from Rust timing records +234/-0
Render scaling curve and latency bar charts from Rust timing records
• Adds a headless matplotlib plotting script that reads 'timing.jsonl', dedupes by last-run, and generates the README’s scaling curve and latency bars for single-thread and multi-thread regimes. Outputs both PNG and SVG variants.
benchmarks/beir/beir_plot.py

beir_prepare.py Download BEIR datasets and produce cached Harrier embeddings with provenance +800/-0
Download BEIR datasets and produce cached Harrier embeddings with provenance
• Implements dataset download/loading (vendored BEIR reader) and embedding production for multiple providers (canonical llama-cpp GGUF lane plus optional ST/Ollama lanes). Writes normalized float32 '.npy' embeddings, ids/qrels, manifests, and checksums, with skip-if-cached behavior for expensive corpora.
benchmarks/beir/beir_prepare.py

common.py Add shared path/slug/manifest utilities and embedding validation contract +289/-0
Add shared path/slug/manifest utilities and embedding validation contract
• Introduces shared utilities for encoder slugging, cache discovery, manifest/qrels/id loading, '.npy' loading/validation, and hashing. Provides the common contract used by prepare/eval/report scripts.
benchmarks/beir/common.py

requirements.txt Define Python dependencies for embedding, baselines, evaluation, and plotting +39/-0
Define Python dependencies for embedding, baselines, evaluation, and plotting
• Adds a dedicated requirements file for the BEIR harness, explicitly avoiding the 'beir' package due to 'pytrec_eval' build issues and using 'pytrec-eval-terrier' instead. Includes baseline deps (faiss-cpu, hnswlib), plotting (matplotlib), and download helpers.
benchmarks/beir/requirements.txt

.gitkeep Ensure 'results/beir' directory is tracked +0/-0
Ensure 'results/beir' directory is tracked
• Adds a placeholder to keep the BEIR results directory present in a clean checkout, matching the harness’s default output path.
results/beir/.gitkeep

Remediate the gemini/qodo/Codex review on #237 and the failing release-publish-invariants gate. beir-bench (gemini HIGH/MED + qodo Bug): - sha256_file: pure-Rust `sha2` instead of shelling out to sha256sum/shasum/ openssl — portable (Windows / minimal containers) and byte-identical to the Python hashlib digest. (gemini, main.rs) - load_json_string_array + the topk/timing/summary writers: use `serde_json` for both reading and writing, so document/query IDs containing quotes, backslashes, or unicode escapes can no longer produce invalid JSON that breaks downstream `json.loads` (qodo Correctness bug) or be mis-read (gemini unicode-escape finding). Adds sha2 + serde_json to beir-bench deps (both already in the workspace lock). - local_topk: guard `k > 0` before `select_nth_unstable_by(k - 1, ..)` so a zero top_k can never underflow to usize::MAX. (gemini) beir_eval.py (gemini perf): vectorize the paired bootstrap — draw all (n_iters x n) resample indices at once and reduce along the query axis, instead of an n_iters Python loop. Same paired resampling, NumPy-internal speed. Release-publish invariant (CI fail): the crate `exclude` was over-broad (`benchmarks/`), which dropped `benchmarks/rank_modes_results.txt` — a README-linked file the publish invariant requires in the package. Narrow the exclude to `benchmarks/beir/` + `benchmarks/beir-bench/` (the dev-only BEIR harness + figures + bench crate); the synthetic-bench results files stay packaged. Verified with `cargo package --list`. README (Codex + review nits): - Drop the false "nothing here is hand-entered / every figure regenerated" claim: the harness writes the figures + nDCG/timing summaries, the tables transcribe them, and you can regenerate/verify everything (latencies vary by hardware/batch). Clarify the default run covers scifact + trec-covid (nfcorpus /fiqa supported), not all four. - Add a balanced above-the-fold one-liner (quality-at-compression + no-build + single-query latency; HNSW wins threaded graph serving). - Remove a duplicated `### Synthetic stress test` heading. - The relative link to the now-excluded benchmarks/beir becomes an absolute GitHub URL so it doesn't dangle in the published crate. benchmarks/beir/README.md: refresh the stale page to the merged design — GGUF Q8 llama-cpp-python canonical lane (st/ollama optional), vendored BEIR loader, flat/hnsw/ordvec method table, timing.jsonl + figures, and a corrected `import ordvec` rule (external driver; the hot path is the Rust beir-bench binary; the Python ordvec package is intentionally not imported and the wheel is not required). Verified: fmt --all --check, clippy -p ordvec / -p beir-bench (0 warnings), build --locked, cargo package --list (rank_modes kept, beir excluded), py_compile, serde_json output parses + sha2 == system/Python digest, vectorized bootstrap matches. Signed-off-by: Nelson Spence <nelson@projectnavi.ai>

Fieldnote-Echo · 2026-06-15T14:37:10Z

Review round 1 remediated in 3004a41:

sha256_file (gemini HIGH) → pure-Rust sha2 (portable; byte-identical to the Python hashlib digest, verified).
load_json_string_array + all JSON writers (gemini unicode + qodo invalid-JSON Bug) → serde_json for read and write, so IDs with quotes/backslashes/unicode can't produce invalid JSON.
beir-bench deps (gemini) → added sha2/serde_json (already in the workspace lock).
local_topk (gemini) → k > 0 guard before select_nth_unstable_by(k-1).
paired bootstrap (gemini perf) → vectorized over (n_iters × n) in NumPy.

Plus the failing release-publish-invariants gate: the crate exclude was over-broad and dropped benchmarks/rank_modes_results.txt (a required packaged file) — narrowed to benchmarks/beir/ + benchmarks/beir-bench/ only (verified with cargo package --list). README claims tightened (no "nothing hand-entered"; default run = scifact+trec-covid), duplicate heading removed, and benchmarks/beir/README.md refreshed to the all-Rust design.

Verified locally: fmt, clippy (0/0), build --locked, package list, serde_json output parses, vectorized bootstrap matches.

`_download_beir` called `ZipFile.extractall()` on a remote archive without validating member paths — a malicious/tampered zip could path-traverse (`../`, absolute paths) and overwrite arbitrary files under the running user. Validate every member before extracting: resolve `raw_dir / member` and reject any whose resolved path isn't `raw_dir` itself or under it, raising a clear ValueError. Verified the guard rejects `../`, `../../etc/...`, `/etc/passwd`, and `a/../../escape` while allowing legitimate `dataset/...` members. Signed-off-by: Nelson Spence <nelson@projectnavi.ai>

flat and the ordvec rows are deterministic (byte-identical run to run, verified); the hnsw row is approximate — hnsw_rs builds the graph with a parallel insert, so its nDCG and latency vary slightly between runs (≈±0.003 nDCG, within the same bootstrap-noise band). Note this in the quality section so the "regenerate every number" claim stays honest; the story (hnsw ≈ flat within noise; ordvec within noise at 8–16× smaller) is unchanged. Signed-off-by: Nelson Spence <nelson@projectnavi.ai>

) OpenSSF Scorecard / OSV flagged ~20 advisories on main after the BEIR benchmark landed (#237). ALL are dev/benchmark tooling — none reach the published `ordvec` crate or the `ordvec` PyPI wheel. Python (benchmarks/beir/requirements.txt): the deps were UNPINNED, so OSV flagged each against its entire historical CVE list (an unconstrained version cannot be ruled non-vulnerable). The actual resolved-latest versions are already patched. Lower-bound-pin every package at its first patched release — clears the flags (OSV excludes a `>=fixed` range) while `>=` keeps installs on the latest compatible wheel, incl. recent CPython: - requests>=2.32.4 (GHSA-9hjg-9r4m-mvj7 .netrc leak + all older requests CVEs) - hnswlib>=0.8.0 (GHSA-xwc8-rf6m-xr86 double free) - numpy>=1.26.0 (symlink-write + incorrect-comparison CVEs) - safe floors for scipy/pandas/tqdm/tabulate/huggingface-hub/faiss-cpu/ pytrec-eval-terrier/matplotlib. Verified the local cp314 venv satisfies all. Rust (RUSTSEC-2025-0141): bincode 1.x is UNMAINTAINED (informational advisory, not a vulnerability), pulled only transitively via hnsw_rs in the dev-only benchmarks/beir-bench harness. `cargo tree -p ordvec` is clean of bincode, so it does not reach the shipped crate. Add a documented deny.toml ignore so cargo-deny (configured to error on unmaintained crates) stays green; revisit if a maintained HNSW crate that does not pull bincode 1.x is adopted. Verified: `cargo tree -p ordvec` clean of bincode; `cargo deny check advisories` ok; benchmark venv versions satisfy the new floors. Signed-off-by: Nelson Spence <nelson@projectnavi.ai>

Nelson Spence (Fieldnote-Echo) added 2 commits June 14, 2026 22:29

Nelson Spence (Fieldnote-Echo) requested a review from Navi Bot (project-navi-bot) as a code owner June 15, 2026 14:18

gemini-code-assist Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread benchmarks/beir-bench/src/main.rs

Comment thread benchmarks/beir-bench/src/main.rs

Comment thread benchmarks/beir-bench/Cargo.toml

Comment thread benchmarks/beir/beir_eval.py Outdated

Comment thread benchmarks/beir-bench/src/main.rs Outdated

qodo-code-review Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread benchmarks/beir-bench/src/main.rs Outdated

Nelson Spence (Fieldnote-Echo) added 2 commits June 15, 2026 09:42

project-navi-bot approved these changes Jun 15, 2026

View reviewed changes

Nelson Spence (Fieldnote-Echo) mentioned this pull request Jun 15, 2026

bench(native): add parallel sign-rq2 row matching OrdinalDB serving #238

Open

Navi Bot (project-navi-bot) merged commit b5551f0 into main Jun 15, 2026
38 checks passed

Navi Bot (project-navi-bot) deleted the feat/beir-benchmark branch June 15, 2026 14:53

Nelson Spence (Fieldnote-Echo) mentioned this pull request Jun 15, 2026

fix(security): clear OSV/Scorecard advisories on dev-only benchmark deps #240

Merged

4 tasks

This was referenced Jun 15, 2026

feat: promote RankQuantFastscan to public API + .ovfs persistence #233

Merged

feat: ordvec on-disk format (.ov* magics) with full back-compat for legacy .tv* #230

Merged

This was referenced Jun 19, 2026

[codex] Add params-only Debug and SearchResults serde #248

Merged

fix(beir-bench): streaming row-bounded npy loader (--max-docs honored, ~2x less peak RAM) #258

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bench: all-Rust BEIR benchmark + README above-the-fold#237

bench: all-Rust BEIR benchmark + README above-the-fold#237
Navi Bot (project-navi-bot) merged 5 commits into
mainfrom
feat/beir-benchmark

Fieldnote-Echo commented Jun 15, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 15, 2026

Uh oh!

qodo-code-review Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qodo-code-review Bot commented Jun 15, 2026

Uh oh!

Uh oh!

Fieldnote-Echo commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Fieldnote-Echo commented Jun 15, 2026

What changed

Headline results (trec-covid, 171,332 docs, Harrier-Q8 1024-d)

Test plan

Uh oh!

chatgpt-codex-connector Bot commented Jun 15, 2026

Uh oh!

qodo-code-review Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

codecov Bot commented Jun 15, 2026

Codecov Report

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qodo-code-review Bot commented Jun 15, 2026

PR Summary by Qodo

Walkthroughs

File Changes

Uh oh!

Uh oh!

Fieldnote-Echo commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qodo-code-review Bot commented Jun 15, 2026 •

edited

Loading