From 92efc663065ef172df1957bf24c6fc9ec15967c7 Mon Sep 17 00:00:00 2001
From: Todd Baur <todd@baursoftware.com>
Date: Fri, 19 Jun 2026 18:37:13 -0700
Subject: [PATCH] README: state public scale evidence and retract near-flat
 framing

Tighten README benchmark framing to the evidence available in this public branch.

The README now describes the public Harrier-Q8 BEIR harness and the corpus sizes covered by those artifacts, while treating larger-corpus and alternate-encoder results as active research until public artifacts land.

It also removes near-flat per-query cost language and clarifies the O(n) compressed-code scan story, small/medium corpus latency claims, and footprint wording. The reported two-stage path is sign/bitmap candidate generation followed by RankQuant rerank; it does not retain original floats for library-side reranking.

Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
---
 README.md                         | 87 ++++++++++++++++++++-----------
 benchmarks/beir-bench/src/main.rs |  4 +-
 benchmarks/beir/beir_plot.py      |  4 +-
 3 files changed, 62 insertions(+), 33 deletions(-)

diff --git a/README.md b/README.md
index 23dc709..76910e5 100644
--- a/README.md
+++ b/README.md
@@ -32,26 +32,49 @@ append-friendly, and graph-optional.
 
 ## Benchmark at a glance
 
-> **ordvec matches dense retrieval quality within BEIR qrel noise at 8–16× smaller
-> vector storage — with no training and no graph build — and sub-millisecond
-> single-query retrieval on 171K Harrier embeddings. In the committed
-> default-method run, a threaded HNSW graph still wins highly-parallel batched
-> serving; ordvec wins the lightweight compressed-substrate lane.**
+> **On the public Harrier-Q8 BEIR harness, ordvec matches exact dense `nDCG@10`
+> within bootstrap noise at 8–16× smaller vector storage — with no training and
+> no graph build. The committed evidence is the reproducible scifact +
+> trec-covid run below; the harness also supports nfcorpus and fiqa. ordvec wins
+> single-query latency against exact `flat` on the committed 171K-doc run and on
+> operability (no build, no tuning, append-only); in the committed default-method
+> threaded view, HNSW still wins highly-parallel batched serving. Larger-corpus
+> and alternate-encoder studies are active research, not public release claims
+> until their artifacts land in this repository.**
+
+**Public evidence snapshot.** The load-bearing result in this README is narrower
+than the research backlog: Harrier-Q8 embeddings on public BEIR data, scored
+against official qrels. The default `make benchmark-beir` reproduces scifact
+(quality) and trec-covid (quality + latency/scaling); the harness also supports
+nfcorpus and fiqa via `QUALITY_DATASETS=...`. Do not read this page as a public
+larger-corpus or alternate-encoder claim; those need checked-in run artifacts
+before they become release evidence.
 
 On **trec-covid** (171,332 documents, the public [BEIR](https://github.com/beir-cellar/beir)
-benchmark) with **Harrier-Q8** 1024-d embeddings, ordvec's two-stage retrieval
-keeps a near-flat per-query cost as the corpus grows, while exact brute-force
-(`flat`, identical math to FAISS `IndexFlatIP`) is O(n) — so the speedup
-*widens* with scale:
+benchmark) with **Harrier-Q8** 1024-d embeddings, ordvec's two-stage retrieval has a
+far smaller per-query cost than exact brute-force (`flat`, identical math to FAISS
+`IndexFlatIP`). Both are O(n) scans, but the public ordvec rows stream compressed
+codes instead of the 4096-byte float vectors, so their lines sit well below `flat`
+and the gap widens over the committed subsampling sweep:
 
 ![ordvec speedup over exact search grows with corpus size](https://raw.githubusercontent.com/Project-Navi/ordvec/main/benchmarks/beir/figures/scaling_curve.png)
 
-- **~100× faster, single query.** At 171K docs, single-query latency: exact
-  `flat` 56 ms vs ordvec `Sign→rq2` **0.53 ms** — and the gap grows with the
-  corpus (it is ~5× at 1K docs).
-- **8–16× smaller.** 256–384 bytes/vector vs 4096 for full float, at
-  **nDCG@10 within bootstrap noise of exact** (on trec-covid the ordinal rows
-  even edge ahead; see [Benchmarks](#benchmarks)).
+- **~100× faster than exact `flat`, single query, at 171K docs.** Single-query
+  latency: exact `flat` 56 ms vs ordvec `Sign→rq2` **0.53 ms** — the gap over `flat`
+  grows with the corpus (it is ~5× at 1K docs).
+- **8–16× smaller for the reported qrel rows.** The b=2 rank code is 256 B/vector
+  (16× smaller than 4096 B floats), b=4 is 512 B (8×), and the reported two-stage
+  `sign→rq2` row accounts for both stage-1 sign codes and the RankQuant reranker
+  (384 B/vector). These rows do **not** retain or rescore against the original
+  float corpus; the public two-stage path is sign/bitmap candidate generation
+  followed by RankQuant b=2 rerank. At **nDCG@10 within bootstrap noise of exact**
+  (on trec-covid the ordinal rows even edge ahead; see [Benchmarks](#benchmarks)).
+- **vs HNSW (the honest public scale story).** On the committed trec-covid run,
+  ordvec wins single-query latency while HNSW wins the highly-parallel threaded
+  view. That is the public comparison here. At larger corpora, graph or shard
+  layers are the right comparison target; this README does not claim public
+  million-scale HNSW crossover or GPU bandwidth numbers until the underlying run
+  artifacts are committed.
 - **Reproducible on your machine, one command:**
 
   ```sh
@@ -103,8 +126,8 @@ structure of each vector on its own:
   not persistable to `.ovrq`; each prepared asymmetric query owns a
   `dim * 256` `f32` LUT, about 64 MiB at the maximum dimension.)
 - **Two-stage retrieval, built in.** A cheap bitmap / sign-popcount
-  prefilter feeds an exact rerank — the coarse→fine pipeline ships as
-  library primitives. The coarse-scan→exact-rerank pattern, and the
+  prefilter feeds a RankQuant rerank — the coarse→fine pipeline ships as
+  library primitives. The coarse-scan→fine-rerank pattern, and the
   `RankQuantFastscan` block-32 4-bit LUT path, follow the FAISS FastScan and
   binary-quantization-plus-rescore lineage; ordvec ships them
   batteries-included and dependency-free, not as new techniques.
@@ -127,7 +150,7 @@ large-scale serving rather than competing with one.
 - **`Bitmap`** — a top-bucket bitmap per document (one bit per
   coordinate); scoring is `popcount(Q AND D)`, a coarsened rank overlap.
 - **`SignBitmap`** — a sign bitmap per document for sign-cosine
-  candidate generation, feeding an exact rerank stage.
+  candidate generation, feeding a caller-owned second-stage rerank.
 
 Two further paths, for callers who need them:
 
@@ -377,9 +400,10 @@ storage; on trec-covid the ordinal codes even edge slightly ahead.
 
 #### Latency — three honest views
 
-ordvec never touches the float corpus, so its per-query cost is tiny and grows
-slowly with `n`; `flat`'s cost is dominated by streaming the 4096-byte vectors,
-which is O(n) and **memory-bandwidth-bound**. That single fact explains all three
+The reported ordvec rows do not stream the float corpus, so their per-query cost
+has a much smaller constant than `flat`; they are still O(n) compressed-code
+scans. `flat`'s cost is dominated by streaming the 4096-byte vectors, which is
+also O(n) and **memory-bandwidth-bound**. That single fact explains all three
 views (trec-covid, 171,332 docs, 1024-d):
 
 **1. Single query (batch = 1, 1 thread)** — latency-sensitive serving, where
@@ -389,7 +413,8 @@ views (trec-covid, 171,332 docs, 1024-d):
 
 `flat` 56 ms → ordvec `sign→rq2` **0.53 ms (≈106×)**, `bitmap→rq2` 0.62 ms (≈91×),
 `hnsw` 1.5 ms (37×). The scaling curve [above](#benchmark-at-a-glance) is this
-view swept over corpus size — the speedup *grows* with `n`.
+view swept over the committed subsamples — the speedup over `flat` grows across
+that public sweep.
 
 **2. Batched throughput (batch = 32, 1 thread)** — when many queries arrive at
 once, `flat`'s GEMM amortizes the corpus stream across the batch (56→4 ms),
@@ -422,13 +447,17 @@ ties or edges quality. And the two aren't mutually exclusive: ordvec's codes are
 index-agnostic, so they compose *under* an HNSW/sharding layer (see
 [Scope](#scope)) rather than replacing it.
 
-**Read it honestly:** ordvec's huge latency win is a single-query / low-batch
-phenomenon (and grows with corpus size); under large-batch throughput a batched
-exact GEMM is a strong baseline and HNSW threads very well. The durable wins are
-**compression at iso-quality** and **single-query latency that stays flat as the
-corpus grows**. `flat` is a comparison reference, not ground truth; nDCG@10 is
-the qrel-based metric. Numbers vary with encoder, dataset, hardware, and batch —
-the point is that you can regenerate all of them with `make benchmark-beir`.
+**Read it honestly:** ordvec's latency win over exact `flat` is a single-query /
+low-batch phenomenon in the committed BEIR run; under large-batch throughput a
+batched exact GEMM is a strong baseline, and HNSW wins the committed threaded
+default-method view. ordvec's scan is O(n) — the same class as `flat`, but over an
+8–16× smaller reported working set, so its constant is far lower. Its per-query
+latency is **not** flat in `n`; an earlier "near-flat" claim was an artifact of
+small (≤57K) corpora and is retracted. The durable wins are **compression at
+iso-quality** and **operability** (training-free, no graph build, no tuning,
+append-only). `flat` is a comparison reference, not ground truth; nDCG@10 is the
+qrel-based metric. Numbers vary with encoder, dataset, hardware, and batch —
+regenerate them with `make benchmark-beir`.
 
 ### Synthetic stress test
 
diff --git a/benchmarks/beir-bench/src/main.rs b/benchmarks/beir-bench/src/main.rs
index a12acc7..1daa448 100644
--- a/benchmarks/beir-bench/src/main.rs
+++ b/benchmarks/beir-bench/src/main.rs
@@ -13,8 +13,8 @@
 //! story; N>1 the throughput story. Batch is matched across every method.
 //!
 //! `--max-docs M`: truncate the corpus to its first M vectors. Sweeping M
-//! produces the speedup-vs-corpus-size curve (brute force is O(n); ordvec
-//! sign/rank candidate-gen is near-flat in n).
+//! produces the speedup-vs-corpus-size curve (both brute force and ordvec scans
+//! are O(n), but ordvec streams many fewer bytes per document).
 //!
 //! Output: `<out>/<dataset>/timing.jsonl` gets one record per
 //! (method, n_docs, threads) run, appended every invocation — the plotter
diff --git a/benchmarks/beir/beir_plot.py b/benchmarks/beir/beir_plot.py
index f83f801..b587310 100644
--- a/benchmarks/beir/beir_plot.py
+++ b/benchmarks/beir/beir_plot.py
@@ -8,8 +8,8 @@
 
 Outputs (PNG + SVG) to `<out-dir>`:
   1. `scaling_curve.{png,svg}`   speedup-vs-`flat` as the corpus grows — the
-     bands climb because exact brute force is O(n) while ordvec sign/rank
-     candidate-gen is near-flat in n.
+     bands climb because both paths are O(n), but `flat` streams 4096-byte
+     vectors while ordvec streams much smaller sign/rank codes.
   2. `bars_single_thread.{png,svg}`  per-method query latency at 1 thread,
      full corpus — the controlled apples-to-apples bar.
   3. `bars_threaded.{png,svg}`   the same at N threads (matched batch).