Project-Navi · Navi Bot (project-navi-bot) · Jun 15, 2026 · Jun 14, 2026 · Jun 14, 2026 · Jun 14, 2026
@@ -185,6 +185,8 @@ jobs:
         run: cargo test
       - name: cargo test (experimental)
         run: cargo test --features experimental
+      - name: cargo test (test-utils)
+        run: cargo test --features test-utils
       - name: cargo test (no default features)
         run: cargo test --no-default-features
       - name: cargo build --release --example bench_rank

@@ -29,6 +29,27 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   VPOPCNTDQ scan kernels are active on the current CPU. The scan dispatch reads
   only this predicate (no per-dimension gate).
 
+### Changed
+
+- **Release-hardened the caller-owned serial two-stage primitives** (no API
+  change; added in 0.5.0). The trust model is now explicit and tested:
+  - Rejection-path regression tests for the full CSR/query/buffer validation set
+    on the rerank entry points — overlong row (the guard that bounds the unsafe
+    gather), non-monotonic / wrong-final / non-zero-first offsets, non-finite and
+    ragged queries, and wrong output-buffer length — so a malformed-but-accepted
+    input can never reach the SIMD scan.
+  - A counting-allocator test proving `search_asymmetric_subset_batched_serial_into`
+    performs **zero heap allocations** in steady state (warmed `SubsetScratch`,
+    reused caller buffers) **on the AVX-512/AVX2 rerank path** — the strong form of
+    the prior capacity-stability proxy. (The scalar fallback, e.g. aarch64,
+    allocates a per-query scoring LUT; the test skips the strict check there.)
+  - A focused `two_stage_bench` example decomposing stage-1 candidate-gen /
+    single-query rerank loop / batched `_into` / full two-stage at the
+    Harrier-1024 shape, with a committed reference capture
+    (`benchmarks/two_stage_caller_owned_dim1024.txt`, SYNTHETIC corpus).
+  - User-facing docs for the caller-owned / no-rayon / allocation-free contract
+    (README + rustdoc examples on the `_into` hot path and the CSR candidate-gen).
+
 ### Fixed
 
 - **`ordvec-manifest` crate and wheel now ship license text.** Both declared
@@ -53,8 +74,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   - `RankQuant::search_asymmetric_subset_batched_serial(..) -> SearchResults` and
     `..._serial_into(.., &mut SubsetScratch, &mut out_scores, &mut out_indices)`
     — serial batched subset rerank; the `_into` form is allocation-free after
-    scratch warmup (the integration contract for runtimes that own their own
-    thread pool / GIL release).
+    scratch warmup on the AVX-512/AVX2 rerank path (the integration contract for
+    runtimes that own their own thread pool / GIL release).
   - New public types `CandidateBatch` (CSR candidate carrier) and `SubsetScratch`
     (reusable rerank scratch).
 - These primitives never enter rayon; the caller owns parallelism. No bundled

@@ -73,6 +73,10 @@ rand_chacha = "0.10"
 # target takes the scalar fallback. `experimental` exposes MultiBucketBitmap
 # (research scaffold), kept off the stable surface.
 experimental = []
+# `test-utils` exposes internal dispatch probes used by the crate's own integration
+# tests (e.g. the allocation-free guarantee check). Gated off the default surface
+# because these helpers are not part of the public API and carry no semver guarantee.
+test-utils = []
 
 [profile.release]
 lto = true

@@ -141,6 +141,49 @@ For the two-stage compressed-scan path (`Bitmap` / `SignBitmap` candidate
 generation → `RankQuant` rerank) and the full mode comparison, see
 [`docs/RANK_MODES.md`](docs/RANK_MODES.md).
 
+### Caller-owned serial two-stage (DB / runtime integration)
+
+For runtimes that own their own parallelism — an embedded vector DB driving a
+bounded thread pool, or a binding releasing the GIL — ordvec exposes a
+**no-rayon** serial two-stage path so the *caller* schedules the work, with an
+**allocation-free rerank step** (`_into`, on the AVX-512/AVX2 path) for the
+steady-state hot loop:
+
+```rust
+use ordvec::{RankQuant, SignBitmap, SubsetScratch};
+// Shape sketch (not standalone): `rq: RankQuant` and `sign: SignBitmap` are
+// built and `add`-ed as in the Quickstart above; `queries` is your flat
+// `dim * nq` f32 batch, `m` the shortlist size, `k` the top-k.
+// Stage 1 — serial CSR candidate generation (never enters rayon):
+let cb = sign.top_m_candidates_batched_serial_csr(&queries, m); // CandidateBatch { offsets, candidates }
+// Stage 2 — rerank into CALLER-OWNED buffers with a reusable scratch:
+let nq = queries.len() / dim;
+let out_k = k.min(rq.len());
+let mut scratch = SubsetScratch::new();               // reuse across batches
+let mut out_scores = vec![f32::NEG_INFINITY; nq * out_k];
+let mut out_indices = vec![-1i64; nq * out_k];
+rq.search_asymmetric_subset_batched_serial_into(
+    &queries, &cb.offsets, &cb.candidates, k,
+    &mut scratch, &mut out_scores, &mut out_indices,
+);
+```
+
+Contract: candidates are **CSR** (`offsets.len() == nq + 1`; row `qi` is
+`candidates[offsets[qi]..offsets[qi+1]]`; rows need **not** be sorted). Output is
+**rectangular** `nq * out_k` and **sentinel-padded** (`-1` / `NEG_INFINITY`) for
+underfull rows — size both buffers to `nq * k.min(index.len())`. Scores, row ids,
+and the deterministic tie policy (`score desc, global row-id asc`) match the
+single-query `search_asymmetric_subset`. **Only the `_into` rerank step is
+allocation-free** — on the **AVX-512 / AVX2** SIMD path, and only on repeated
+calls of the *same* batch shape — reusing the warmed `SubsetScratch` and your
+output buffers (no per-row alloc, no whole-buffer preclear). The scalar fallback
+(no AVX2, e.g. aarch64) allocates a per-query scoring LUT. Stage 1
+(`top_m_candidates_batched_serial_csr`) also allocates a fresh `CandidateBatch`
+each call. Neither primitive enters rayon —
+partition the query batch and call `_into` once per worker range from your own
+pool. A focused decomposition benchmark lives in
+[`examples/two_stage_bench.rs`](examples/two_stage_bench.rs).
+
 ### Python
 
 The same `Rank` / `RankQuant` / `Bitmap` / `SignBitmap` API is available from

@@ -0,0 +1,21 @@
+Caller-owned serial two-stage decomposition — Harrier-1024 shape (SYNTHETIC corpus)
+Reproduce:
+  cargo run --release --example two_stage_bench -- --dim 1024 --n 50000 --queries 200 --m 256 --k 10 --reps 15
+Host: AMD Ryzen 9 9950X (Zen5), AVX-512 VPOPCNTDQ, single core (taskset -c 12), single-thread.
+
+  dim=1024 n=50000 queries=200 m=256 k=10 bits=2 out_k=10 candidates=51200 reps=15
+  1. stage-1 candidate gen (CSR)        31.920 ms      6265.59 q/s      159.60 us/query
+  2. single-query rerank loop            2.086 ms     95858.02 q/s       10.43 us/query
+  3. batched rerank _into                2.031 ms     98463.67 q/s       10.16 us/query
+  4. full two-stage (1+3)               34.485 ms      5799.70 q/s      172.42 us/query
+  rerank speedup (batched _into vs single-query loop): 1.03x
+
+Interpretation (no-fiction): at dim=1024 the rerank stage is a small slice
+(~10 us/query) of an already-stage-1-dominated two-stage cost (~160 us/query);
+the batched _into form is on par with the single-query loop SINGLE-THREADED
+(~1.03x). The caller-owned serial primitives are NOT a single-thread speedup —
+their value is (a) allocation-free steady state (tests/alloc_free.rs proves 0
+heap allocations on a warmed _into call) and (b) caller-owned parallelism: no
+internal rayon, so a DB/runtime can drive the _into form across its own bounded
+pool (GIL released) one query-range per worker. This dim=1024 result is its own
+mechanism; it is NOT explained by the SignBitmap AVX-tail dim=768 result.
@@ -0,0 +1,177 @@
+//! Focused benchmark + integration example for the **caller-owned serial**
+//! two-stage path (the integration contract for DBs / runtimes that own their
+//! own parallelism). SYNTHETIC corpus — these numbers are a
+//! relative decomposition of the serial path on random data, NOT a retrieval-
+//! quality or real-corpus claim, and the dim=1024 result is its own mechanism
+//! (do not conflate it with the SignBitmap AVX-tail dim=768 result).
+//!
+//! It decomposes the cost into four separately-timed phases at the Harrier-1024
+//! shape and prints a headline "batched `_into` vs single-query loop" rerank
+//! speedup — the per-query-overhead reduction the caller-owned API exists for:
+//!   1. stage-1 candidate generation  (top_m_candidates_batched_serial_csr)
+//!   2. single-query subset rerank loop  (search_asymmetric_subset, baseline)
+//!   3. batched rerank `_into`  (warmed SubsetScratch, caller-owned buffers)
+//!   4. full two-stage serial  (1 + 3 end to end)
+//!
+//!   cargo run --release --example two_stage_bench -- [--dim N] [--n N]
+//!       [--queries N] [--m N] [--k N] [--bits {1,2,4}] [--reps N]
+
+use ordvec::{RankQuant, SignBitmap, SubsetScratch};
+use rand::{RngExt, SeedableRng};
+use rand_chacha::ChaCha8Rng;
+use std::time::Instant;
+
+fn median(mut v: Vec<f64>) -> f64 {
+    v.sort_by(|a, b| a.partial_cmp(b).unwrap());
+    v[v.len() / 2]
+}
+
+fn main() {
+    // Harrier-1024 defaults; all overridable.
+    let mut dim = 1024usize;
+    let mut n = 50_000usize;
+    let mut nq = 200usize;
+    let mut m = 256usize;
+    let mut k = 10usize;
+    let mut bits = 2u8;
+    let mut reps = 20usize;
+    let mut args = std::env::args().skip(1);
+    while let Some(flag) = args.next() {
+        let mut val = || args.next().expect("flag needs a value").parse().unwrap();
+        match flag.as_str() {
+            "--dim" => dim = val(),
+            "--n" => n = val(),
+            "--queries" => nq = val(),
+            "--m" => m = val(),
+            "--k" => k = val(),
+            "--bits" => bits = args.next().unwrap().parse().unwrap(),
+            "--reps" => reps = val(),
+            other => {
+                eprintln!("unknown arg: {other}");
+                std::process::exit(2);
+            }
+        }
+    }
+    assert!(nq > 0 && n > 0 && reps > 0, "n, queries, reps must be > 0");
+
+    let mut rng = ChaCha8Rng::seed_from_u64(7);
+    let corpus: Vec<f32> = (0..n * dim).map(|_| rng.random_range(-1.0..1.0)).collect();
+    let mut sign = SignBitmap::new(dim);
+    sign.add(&corpus);
+    let mut rq = RankQuant::new(dim, bits);
+    rq.add(&corpus);
+    let queries: Vec<f32> = (0..nq * dim).map(|_| rng.random_range(-1.0..1.0)).collect();
+    drop(corpus);
+
+    let out_k = k.min(rq.len());
+    // Caller-owned output buffers, allocated ONCE and reused across batches —
+    // rectangular nq*out_k, sentinel-padded for underfull rows.
+    let mut out_scores = vec![f32::NEG_INFINITY; nq * out_k];
+    let mut out_indices = vec![-1i64; nq * out_k];
+    let mut scratch = SubsetScratch::new();
+
+    // Warm: build the candidate batch once and warm the scratch to this shape.
+    let cb = sign.top_m_candidates_batched_serial_csr(&queries, m);
+    rq.search_asymmetric_subset_batched_serial_into(
+        &queries,
+        &cb.offsets,
+        &cb.candidates,
+        k,
+        &mut scratch,
+        &mut out_scores,
+        &mut out_indices,
+    );
+    let total_candidates = cb.candidates.len();
+
+    // Phase 1 — stage-1 candidate generation (serial CSR).
+    let p1 = median(
+        (0..reps)
+            .map(|_| {
+                let t = Instant::now();
+                let c = sign.top_m_candidates_batched_serial_csr(&queries, m);
+                std::hint::black_box(&c);
+                t.elapsed().as_secs_f64()
+            })
+            .collect(),
+    );
+
+    // Phase 2 — single-query subset rerank loop (the per-query baseline).
+    let p2 = median(
+        (0..reps)
+            .map(|_| {
+                let t = Instant::now();
+                for qi in 0..nq {
+                    let row = &cb.candidates[cb.offsets[qi]..cb.offsets[qi + 1]];
+                    let r = rq.search_asymmetric_subset(&queries[qi * dim..(qi + 1) * dim], row, k);
+                    std::hint::black_box(&r);
+                }
+                t.elapsed().as_secs_f64()
+            })
+            .collect(),
+    );
+
+    // Phase 3 — batched `_into` (warmed scratch + reused caller buffers).
+    let p3 = median(
+        (0..reps)
+            .map(|_| {
+                let t = Instant::now();
+                rq.search_asymmetric_subset_batched_serial_into(
+                    &queries,
+                    &cb.offsets,
+                    &cb.candidates,
+                    k,
+                    &mut scratch,
+                    &mut out_scores,
+                    &mut out_indices,
+                );
+                t.elapsed().as_secs_f64()
+            })
+            .collect(),
+    );
+
+    // Phase 4 — full two-stage serial (stage-1 gen + batched rerank).
+    let p4 = median(
+        (0..reps)
+            .map(|_| {
+                let t = Instant::now();
+                let c = sign.top_m_candidates_batched_serial_csr(&queries, m);
+                rq.search_asymmetric_subset_batched_serial_into(
+                    &queries,
+                    &c.offsets,
+                    &c.candidates,
+                    k,
+                    &mut scratch,
+                    &mut out_scores,
+                    &mut out_indices,
+                );
+                t.elapsed().as_secs_f64()
+            })
+            .collect(),
+    );
+
+    let row = |label: &str, secs: f64| {
+        println!(
+            "  {label:<34} {:>9.3} ms   {:>10.2} q/s   {:>9.2} us/query",
+            secs * 1e3,
+            nq as f64 / secs,
+            secs / nq as f64 * 1e6,
+        );
+    };
+    println!("caller-owned serial two-stage (SYNTHETIC corpus)");
+    println!(
+        "  dim={dim} n={n} queries={nq} m={m} k={k} bits={bits} out_k={out_k} \
+         candidates={total_candidates} reps={reps}"
+    );
+    println!(
+        "  (dim % 64 == {}: AVX-512 tier eligible when supported)",
+        dim % 64
+    );
+    row("1. stage-1 candidate gen (CSR)", p1);
+    row("2. single-query rerank loop", p2);
+    row("3. batched rerank _into", p3);
+    row("4. full two-stage (1+3)", p4);
+    println!(
+        "  rerank speedup (batched _into vs single-query loop): {:.2}x",
+        p2 / p3
+    );
+}
@@ -87,6 +87,13 @@ pub use sign_bitmap::SignBitmap;
 #[doc(hidden)]
 pub use quant::search_asymmetric_byte_lut;
 
+// `subset_rerank_uses_simd` is a test-only dispatch probe used by the crate's
+// own SIMD-parity tests. Gated behind the non-default `test-utils` feature and
+// excluded from semver guarantees — not a supported downstream API.
+#[cfg(feature = "test-utils")]
+#[doc(hidden)]
+pub use quant::subset_rerank_uses_simd;
+
 // `MultiBucketBitmap` underwrites the bilinear bucket-overlap
 // decomposition but is not the constant-weight top-bucket theorem surface and
 // is not stable public API. It is reachable only with the `experimental`