From cfe08cb404fddd867f7eca581c56c755b6858d6a Mon Sep 17 00:00:00 2001 From: Nelson Spence Date: Sun, 14 Jun 2026 22:29:18 -0500 Subject: [PATCH 1/5] bench: add BEIR retrieval benchmark harness (make benchmark-beir) Adds a reproducible BEIR benchmark per the repo spec, evaluating nDCG@10 vs BEIR qrels with the same Harrier embeddings across OrdVec, FAISS FlatIP, and HNSW. Python prepares public BEIR data + Harrier embeddings (cached normalized f32 .npy) and evaluates qrels; Rust owns every OrdVec hot path; FAISS/HNSW are native baselines. No 'import ordvec' in the harness (the hot path is the Rust crate). - benchmarks/beir/: common.py (shared contract), beir_prepare.py (BEIR download + Harrier encode, sentence-transformers/CUDA + Ollama/GGUF lanes, validation), beir_baselines.py (FAISS FlatIP + hnswlib), beir_eval.py (nDCG@10/MAP/Recall/MRR/ Precision vs qrels + paired bootstrap CIs vs FAISS), beir_report.py (tables), README, requirements.txt. - examples/beir_ordvec.rs: loads cached .npy, runs rq2/rq4 (batched search_asymmetric) and bitmap-rq2/sign-rq2 (batched CSR candidate-gen + pooled SubsetScratch allocation-free rerank); emits top-k JSONL (real rank-ordered scores) + summary JSON (bytes/vec, build/latency p50/p95/p99, simd_detected). No BEIR metrics computed in Rust. - root Makefile: benchmark-beir / -smoke / -bm25 + bench-beir-{setup,prepare, prepare-ollama,ordvec,baselines,eval,guardrail,clean,clean-cache}. - guardrail target greps for 'import ordvec' in benchmarks/beir and fails. Headline discipline: nDCG@10 vs qrels is the metric; FAISS FlatIP is a full-float dense baseline, NOT ground truth; ANN-recall-vs-FAISS is an optional diagnostic. Verified: cargo build --release --example beir_ordvec (+fmt), python3 -m py_compile all scripts, make -n benchmark-beir-smoke. Signed-off-by: Nelson Spence --- .gitignore | 5 + Makefile | 176 +++++ benchmarks/beir/README.md | 158 ++++ benchmarks/beir/beir_baselines.py | 459 +++++++++++ benchmarks/beir/beir_eval.py | 770 +++++++++++++++++++ benchmarks/beir/beir_prepare.py | 565 ++++++++++++++ benchmarks/beir/beir_report.py | 469 +++++++++++ benchmarks/beir/common.py | 289 +++++++ benchmarks/beir/requirements.txt | 12 + examples/beir_ordvec.rs | 1195 +++++++++++++++++++++++++++++ results/beir/.gitkeep | 0 11 files changed, 4098 insertions(+) create mode 100644 Makefile create mode 100644 benchmarks/beir/README.md create mode 100644 benchmarks/beir/beir_baselines.py create mode 100644 benchmarks/beir/beir_eval.py create mode 100644 benchmarks/beir/beir_prepare.py create mode 100644 benchmarks/beir/beir_report.py create mode 100644 benchmarks/beir/common.py create mode 100644 benchmarks/beir/requirements.txt create mode 100644 examples/beir_ordvec.rs create mode 100644 results/beir/.gitkeep diff --git a/.gitignore b/.gitignore index d342254b..6d7da4c6 100644 --- a/.gitignore +++ b/.gitignore @@ -39,3 +39,8 @@ venv/ .DS_Store .idea/ .vscode/ + +# BEIR benchmark harness — embedding cache and result files. +/.cache/ordvec-beir/ +/results/beir/* +!/results/beir/.gitkeep diff --git a/Makefile b/Makefile new file mode 100644 index 00000000..c6208cf8 --- /dev/null +++ b/Makefile @@ -0,0 +1,176 @@ +# ordvec-beir benchmark harness +# Reproduces nDCG@10 on standard BEIR datasets using ordvec's rank/sign retrieval +# methods plus FAISS FlatIP + HNSW dense baselines (for comparison, NOT ground truth). +# +# Usage: +# make bench-beir-setup # install Python deps +# make benchmark-beir-smoke # quick sanity run (scifact only) +# make benchmark-beir # full suite + +# ── interpreter ────────────────────────────────────────────────────────────── +PY ?= python3 + +# ── paths ───────────────────────────────────────────────────────────────────── +CACHE_DIR := .cache/ordvec-beir +RESULTS_DIR := results/beir + +# ── dataset suite ───────────────────────────────────────────────────────────── +DATASETS := scifact nfcorpus fiqa trec-covid +SMOKE_DATASETS := scifact +SPLIT := test + +# ── retrieval parameters ───────────────────────────────────────────────────── +TOPK := 100 +K_VALUES := 10 100 +BATCH := 8 +CANDIDATES := 500 +SEED := 1 + +# ── encoder ─────────────────────────────────────────────────────────────────── +ENCODER_PROVIDER := st +HARRIER_MODEL := microsoft/harrier-oss-v1-0.6b +HARRIER_REVISION := +DEVICE := cuda +ENCODE_BATCH := 16 + +# ── ollama lane ─────────────────────────────────────────────────────────────── +OLLAMA_URL := http://localhost:11434 +HARRIER_GGUF_MODEL := hf.co/mradermacher/harrier-oss-v1-0.6b-GGUF:Q8_0 + +# ── baselines + ordvec methods ──────────────────────────────────────────────── +ORDVEC_METHODS := rq2,rq4,bitmap-rq2,sign-rq2 +BASELINE_METHODS := faiss-flat,hnswlib +HNSW_M := 32 +HNSW_EF_CONSTRUCT := 200 +HNSW_EF_SEARCH := 128 + +# ── phony ───────────────────────────────────────────────────────────────────── +.PHONY: benchmark-beir benchmark-beir-smoke benchmark-beir-bm25 \ + bench-beir-setup bench-beir-prepare bench-beir-prepare-ollama \ + bench-beir-ordvec bench-beir-baselines bench-beir-eval \ + bench-beir-guardrail bench-beir-clean bench-beir-clean-cache + +# ── top-level targets ───────────────────────────────────────────────────────── + +## Full benchmark run (guardrail → prepare → ordvec → baselines → eval) +benchmark-beir: bench-beir-guardrail bench-beir-prepare bench-beir-ordvec bench-beir-baselines bench-beir-eval + +## Smoke run: scifact only, quick sanity check +benchmark-beir-smoke: + $(MAKE) benchmark-beir \ + DATASETS=$(SMOKE_DATASETS) \ + TOPK=100 \ + ENCODE_BATCH=8 + +## Optional BM25 lane (placeholder — requires beir[bm25] extras) +benchmark-beir-bm25: + @echo "BM25 lane: install beir[bm25] extras then run:" + @echo " $(PY) benchmarks/beir/beir_baselines.py --methods bm25 \\" + @echo " --datasets $(DATASETS) --split $(SPLIT) \\" + @echo " --cache-dir $(CACHE_DIR) --out-dir $(RESULTS_DIR) --top-k $(TOPK)" + +# ── setup ───────────────────────────────────────────────────────────────────── + +## Install Python benchmark dependencies +bench-beir-setup: + $(PY) -m pip install -r benchmarks/beir/requirements.txt + +# ── guardrail ───────────────────────────────────────────────────────────────── + +## Fail loudly if any harness file imports the ordvec Python package directly. +## The harness is an EXTERNAL consumer — it must use the Rust crate at bench time, +## not the ordvec Python package. That coupling breaks the reproducibility claim. +bench-beir-guardrail: + @if grep -R "import ordvec" benchmarks/beir 2>/dev/null; then \ + echo ""; \ + echo "ERROR: benchmarks/beir/ must not contain 'import ordvec'."; \ + echo "The benchmark hot path is the Rust crate, not the ordvec Python package."; \ + exit 1; \ + fi + @echo "guardrail OK: no 'import ordvec' found in benchmarks/beir/" + +# ── prepare ─────────────────────────────────────────────────────────────────── + +## Download datasets and encode with Harrier (sentence-transformers / CUDA lane) +bench-beir-prepare: + $(PY) benchmarks/beir/beir_prepare.py \ + --datasets $(DATASETS) \ + --split $(SPLIT) \ + --provider $(ENCODER_PROVIDER) \ + --model "$(HARRIER_MODEL)" \ + $(if $(HARRIER_REVISION),--revision $(HARRIER_REVISION),) \ + --device "$(DEVICE)" \ + --batch-size $(ENCODE_BATCH) \ + --cache-dir "$(CACHE_DIR)" \ + --seed $(SEED) + +## Encode with Harrier via Ollama (CPU/quantised lane — no GPU required) +bench-beir-prepare-ollama: + $(PY) benchmarks/beir/beir_prepare.py \ + --datasets $(DATASETS) \ + --split $(SPLIT) \ + --provider ollama \ + --ollama-url "$(OLLAMA_URL)" \ + --model "$(HARRIER_GGUF_MODEL)" \ + --batch-size $(ENCODE_BATCH) \ + --cache-dir "$(CACHE_DIR)" \ + --seed $(SEED) + +# ── ordvec retrieval ────────────────────────────────────────────────────────── + +## Build the Rust beir_ordvec example binary and run all ordvec methods +bench-beir-ordvec: + cargo build --release --example beir_ordvec + @for dataset in $(DATASETS); do \ + $(CURDIR)/target/release/examples/beir_ordvec \ + --cache-dir "$(CACHE_DIR)" \ + --dataset "$$dataset" \ + --split $(SPLIT) \ + --top-k $(TOPK) \ + --batch $(BATCH) \ + --candidates $(CANDIDATES) \ + --methods $(ORDVEC_METHODS) \ + --out-dir "$(RESULTS_DIR)"; \ + done + +# ── native dense baselines ──────────────────────────────────────────────────── + +## Run FAISS FlatIP + HNSW dense baselines (comparison references, NOT ground truth) +bench-beir-baselines: + $(PY) benchmarks/beir/beir_baselines.py \ + --datasets $(DATASETS) \ + --split $(SPLIT) \ + --cache-dir "$(CACHE_DIR)" \ + --out-dir "$(RESULTS_DIR)" \ + --top-k $(TOPK) \ + --methods $(BASELINE_METHODS) \ + --hnsw-m $(HNSW_M) \ + --hnsw-ef-construction $(HNSW_EF_CONSTRUCT) \ + --hnsw-ef-search $(HNSW_EF_SEARCH) \ + --seed $(SEED) + +# ── evaluation ──────────────────────────────────────────────────────────────── + +## Evaluate nDCG@10 etc. vs BEIR qrels + paired bootstrap deltas vs FAISS +bench-beir-eval: + $(PY) benchmarks/beir/beir_eval.py \ + --datasets $(DATASETS) \ + --split $(SPLIT) \ + --cache-dir "$(CACHE_DIR)" \ + --runs-dir "$(RESULTS_DIR)" \ + --k-values $(K_VALUES) \ + --baseline faiss-flat \ + --bootstrap-iters 1000 \ + --seed $(SEED) \ + --out-dir "$(RESULTS_DIR)" + +# ── cleanup ─────────────────────────────────────────────────────────────────── + +## Remove generated result files (keeps cache) +bench-beir-clean: + find $(RESULTS_DIR) -name "*.topk.jsonl" -delete + find $(RESULTS_DIR) -name "*.summary.json" -delete + +## Remove the embedding cache (re-encoding will be required) +bench-beir-clean-cache: + rm -rf $(CACHE_DIR) diff --git a/benchmarks/beir/README.md b/benchmarks/beir/README.md new file mode 100644 index 00000000..86aa6570 --- /dev/null +++ b/benchmarks/beir/README.md @@ -0,0 +1,158 @@ +# ordvec BEIR benchmark harness + +Reproducible nDCG@10 evaluation of ordvec's rank/sign retrieval methods across +standard BEIR datasets, using Microsoft Harrier (harrier-oss-v1-0.6b, 1024-dim) +as the shared encoder. + +## Claims discipline + +The following two paragraphs reproduce the project's required claims policy +verbatim and govern every number produced by this harness: + +> **Benchmark numbers in this repository reflect synthetic or user-runnable +> real-corpus experiments only. No numbers are fabricated or cherry-picked. +> Every result file produced by `make benchmark-beir` is fully reproducible +> from the commands documented here, using publicly available BEIR datasets and +> the pinned encoder revision recorded in `embeddings.manifest.json`.** + +> **FAISS FlatIP is a full-float dense retrieval baseline used for comparison +> purposes — it is NOT ground truth. nDCG@10 is computed against the official +> BEIR qrels (human-annotated relevance judgements), not against FAISS results. +> ANN-recall-vs-FAISS (fraction of FAISS top-k recovered by an ANN method) is +> an optional diagnostic metric only; it does not substitute for qrel-based +> evaluation.** + +## Dataset suite + +| Dataset | Domain | #Queries | #Corpus | +|------------|-----------------|----------|---------| +| scifact | Scientific claim verification | 300 | 5 183 | +| nfcorpus | Biomedical IR | 323 | 3 633 | +| fiqa | Financial QA | 648 | 57 638 | +| trec-covid | COVID-19 literature | 50 | 171 332 | + +All datasets are downloaded automatically via the BEIR Python library on first +`make bench-beir-prepare` run. + +## Encoder + +**Harrier (harrier-oss-v1-0.6b)** — Microsoft's 600M-parameter bi-encoder +producing 1024-dimensional L2-normalised float32 embeddings. + +- Documents receive no instruction prefix. +- Queries receive: + `"Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: "` +- Revision is pinned in `embeddings.manifest.json` per cache directory. + +## Quick start + +### 1. Install Python dependencies + +```bash +make bench-beir-setup +``` + +Installs from `benchmarks/beir/requirements.txt`. + +### 2. Smoke run (scifact only, ~5 min on GPU) + +```bash +make benchmark-beir-smoke +``` + +Uses the `st` (sentence-transformers) provider with CUDA. Encodes, runs all +ordvec methods, runs the FAISS baseline, then evaluates nDCG@{10,100}. + +### 3. Full suite + +**Sentence-transformers / CUDA lane (default):** + +```bash +make benchmark-beir +``` + +Override encoder or device: + +```bash +make benchmark-beir \ + ENCODER_PROVIDER=st \ + HARRIER_MODEL=microsoft/harrier-oss-v1-0.6b \ + DEVICE=cpu \ + ENCODE_BATCH=4 +``` + +**Ollama lane (CPU, quantised, no GPU required):** + +```bash +ollama pull hf.co/mradermacher/harrier-oss-v1-0.6b-GGUF:Q8_0 +make bench-beir-prepare-ollama +make bench-beir-ordvec +make bench-beir-baselines +make bench-beir-eval +``` + +## Cache layout + +One encoder run produces a directory per dataset/split: + +``` +.cache/ordvec-beir///encoder=/ + corpus.f32.npy # float32, shape (n_docs, 1024), L2-normalised, C-order + queries.f32.npy # float32, shape (n_queries, 1024), L2-normalised, C-order + corpus_ids.json # list[str], sorted(corpus.keys()) + query_ids.json # list[str], sorted(qrels.keys()) + qrels.json # {qid: {doc_id: int_relevance}} + texts.manifest.json # reproducibility provenance for raw text + embeddings.manifest.json# encoder provider/model/revision/dim/norm + sha256s.json # sha256 of each npy file +``` + +Encoder slug format: `____` +with `/`, `:`, and other non-filesystem-safe characters replaced by `__`. + +## Results layout + +``` +results/beir// + .topk.jsonl # one JSON line per query + .summary.json # aggregate latency + nDCG metrics +``` + +Top-k JSONL row schema: + +```json +{"dataset":"scifact","split":"test","method":"ordvec-rq2", + "qid_idx":0,"qid":"0","k":100, + "doc_idxs":[42,7,...],"doc_ids":["abc","def",...],"scores":[0.91,0.88,...]} +``` + +Two-stage method names include parameters, e.g. `ordvec-bitmap-rq2-m500-b8`. + +## Available methods + +| Method | Description | +|-----------------|-------------| +| `rq2` | RankQuant (2 bits/dim), asymmetric float-query LUT scoring | +| `rq4` | RankQuant (4 bits/dim), asymmetric float-query LUT scoring | +| `bitmap-rq2` | Two-stage: Bitmap candidate gen + RankQuant-2 rerank | +| `sign-rq2` | Two-stage: SignBitmap candidate gen + RankQuant-2 rerank | +| `faiss-flat` | FAISS FlatIP full-float dense baseline (comparison, not ground truth) | + +## `import ordvec` rule + +The Python harness files in `benchmarks/beir/` **must not** contain +`import ordvec`. This harness is an external consumer; it uses the installed +`ordvec` wheel. The `bench-beir-guardrail` Make target (run automatically as +part of `benchmark-beir`) enforces this and fails with a clear error message if +any harness file violates it. + +This rule preserves the reproducibility guarantee: anyone can clone this repo, +install the published wheel (`pip install ordvec`), and reproduce results +without needing the ordvec source tree. + +## Clean up + +```bash +make bench-beir-clean # remove result files, keep embedding cache +make bench-beir-clean-cache # remove embedding cache (re-encoding required) +``` diff --git a/benchmarks/beir/beir_baselines.py b/benchmarks/beir/beir_baselines.py new file mode 100644 index 00000000..28f08a8d --- /dev/null +++ b/benchmarks/beir/beir_baselines.py @@ -0,0 +1,459 @@ +""" +beir_baselines.py — native Python/C++ baselines for the ordvec-beir harness. + +Methods +------- +faiss-flat + Full-float inner-product exact search via faiss.IndexFlatIP on L2-normalised + corpus vectors. Inner product on unit vectors == cosine similarity. + +hnswlib-m-ef + Approximate nearest-neighbour search via hnswlib's HNSW graph in cosine space. + +Both methods consume the cached ``.npy`` arrays produced by ``beir_prepare.py`` +and write results in the shared top-k JSONL + summary JSON formats defined by +``common.py``. +""" + +from __future__ import annotations + +import argparse +import json +import pathlib +import time +from typing import Any + +import numpy as np + +# Allow `from common import ...` when run as a script from the repo root +# (the Makefile invokes `python3 benchmarks/beir/