test+bench+docs: release-harden caller-owned batched two-stage primitives#215
Conversation
…ives The caller-owned serial two-stage primitives (added in 0.5.0: top_m_candidates_batched_serial_csr, search_asymmetric_subset_batched_serial[_into], CandidateBatch, SubsetScratch) are present, shape-conformant, and — per an adversarial trust/memory-safety audit — SOUND: every unsafe SIMD load in the rerank gather is bounded by validate_csr_batch / the query+buffer asserts that run before it, with no malformed-but-accepted input reachable. No source/guard change is needed. This lands the missing release-readiness half: proof-of-value + the explicit, tested memory-safety net. Tests (tests/index/two_stage.rs, +11): - Rejection-path net on the rerank entry points — overlong row (the guard that bounds the unsafe gather), non-monotonic / wrong-final / non-zero-first offsets, non-finite + ragged queries (both _into and the CSR gen and the allocating wrapper), and wrong out_scores/out_indices length. These pin that a malformed input panics before the SIMD scan. - batched_into_pads_mixed_full_and_underfull_rows: full + underfull + empty rows in one batch; filled slots match the single-query reference, trailing slots are -1 / NEG_INFINITY. tests/alloc_free.rs (new): a counting #[global_allocator] proves search_asymmetric_subset_batched_serial_into performs ZERO heap allocations on a warmed steady-state call — the strong form of the capacity-stability proxy. examples/two_stage_bench.rs (new) + benchmarks/two_stage_caller_owned_dim1024.txt: focused decomposition at the Harrier-1024 shape — stage-1 CSR gen / single-query rerank loop / batched _into / full two-stage. SYNTHETIC corpus. Measured single-thread, the batched _into is ~1.03x the single-query loop at m=256 (stage-1 dominates: ~160 us/q vs ~10 us/q rerank); the primitives' value is allocation-free steady state + caller-owned parallelism (no internal rayon), NOT a single-thread speedup. This is NOT explained by the SignBitmap AVX-tail dim=768 result. Docs: README "Caller-owned serial two-stage (DB / runtime integration)" section (CSR shape, sentinel padding, scratch reuse, no-rayon contract); a no_run rustdoc example on the alloc-free _into hot path + a buffer-sizing porting callout; an example on top_m_candidates_batched_serial_csr; CHANGELOG. No public API change. Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Code Review by Qodo
1.
|
PR Summary by QodoRelease-harden caller-owned two-stage API with tests, docs, and benchmark example WalkthroughsDescription• Add rejection-path regression tests ensuring invalid CSR/query/buffers panic before SIMD rerank. • Prove batched _into rerank is allocation-free after warmup via counting allocator test. • Document and benchmark caller-owned serial two-stage pipeline (README/rustdoc/example + reference capture). Diagramgraph TD
A(["Caller runtime/DB"]) --> B["SignBitmap CSR"] --> C[("CandidateBatch (CSR)")] --> D["RankQuant rerank _into"] --> E[("Caller out buffers")]
D --> F[("SubsetScratch")]
G["Tests"] --> B --> D
H["two_stage_bench example"] --> B --> D
subgraph Legend
direction LR
_actor(["Caller"]) ~~~ _mod["Module/API"] ~~~ _data[("Data/buffers")]
end
High-Level AssessmentThe following are alternative approaches to this PR: 1. Add property-based/fuzz coverage for CSR edge cases
2. Use Criterion for the decomposition benchmark
Recommendation: Keep the PR’s approach: targeted rejection-path tests + an allocator-counting test directly validate the safety/contract claims without changing core code, and the example-based benchmark doubles as an integration guide. Consider adding a dedicated fuzz target later as continuous coverage, but it’s additive rather than a replacement for these explicit regressions. File ChangesEnhancement (1)
Tests (2)
Documentation (5)
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Code Review
This pull request release-hardens the caller-owned serial two-stage primitives by adding extensive documentation, benchmarks, and tests. Specifically, it introduces a new benchmark example (two_stage_bench.rs), a counting-allocator test (alloc_free.rs) verifying zero heap allocations in steady state, and a comprehensive suite of rejection-path regression tests to ensure robust validation of inputs before SIMD scanning. Additionally, user-facing documentation and rustdoc examples have been added to clarify the integration contract for the allocation-free serial path. No review comments were provided, so there is no feedback to address.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Codex stop-gate review: the README "Caller-owned serial two-stage" section called the whole two-stage path "allocation-free", overstating it. Only the `_into` rerank step is allocation-free (and only on repeated calls of the same batch shape, reusing the warmed SubsetScratch + caller buffers); stage 1 `top_m_candidates_batched_serial_csr` allocates a fresh CandidateBatch per call. Reword the header + contract to scope the claim precisely. Docs only. Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
CI caught that `tests/alloc_free.rs` failed on aarch64 / macOS ("steady-state
_into allocated 8 time(s)") while passing on x86_64. Root cause: the rerank's
scalar fallback `scan_via_lut_scalar` allocates a per-query scoring LUT
(`vec![0.0f32; dim * n_buckets]`), one per query row — so the allocation-free
guarantee holds only on the AVX-512/AVX2 SIMD path, not the scalar fallback
(which runs on aarch64, or x86 without AVX2).
- Gate the strict zero-alloc assertion to hosts with AVX2 (skip with a note
otherwise) so the test is correct cross-platform.
- Scope the "allocation-free" claim to the SIMD path in the rustdoc, README, and
CHANGELOG (the scalar fallback allocates a per-query LUT).
Follow-up (separate, lib change): reuse the LUT in `SubsetScratch` to make the
scalar/NEON path allocation-free too — valuable for the ARM/Pi edge target.
Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
… guess
Codex stop-gate + PR review: the alloc-free test's skip-gate
(`is_x86_feature_detected!("avx2")`) did not match the actual rerank dispatch,
which routes via `select_simd_tier(dim, bits)` — AVX-512 needs avx512f+avx512dq
and dim%64; AVX2 needs avx2+FMA and the per-width divisibility; else the scalar
LUT path (which allocates per query). The avx2-only guess omitted FMA / avx512dq
/ the divisibility, so it could mis-gate.
- Expose `#[doc(hidden)] pub fn subset_rerank_uses_simd(dim, bits) -> bool`
wrapping the same `select_simd_tier` the dispatch reads, so the test gate
cannot drift from the real dispatch. Gate the strict zero-alloc check on it
(skips correctly on aarch64 / non-SIMD x86, fixing the aarch64 ASAN + macOS
ci failures).
- Label the README caller-owned snippet as a shape sketch (it references
`sign`/`rq`/`queries` built as in the Quickstart) rather than implying it is
standalone-runnable.
Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
|
/agentic_review |
|
Code review by qodo was updated up to the latest commit 1967c9a |
qodo flagged that the #[doc(hidden)] dispatch probe reported SIMD eligibility from select_simd_tier alone, so it could return true for (dim, bits) that RankQuant::new rejects — e.g. bits=4 with dim a multiple of 8 (the AVX2 b=4 lane invariant) but not of 2^bits=16 (the bucket rule). The probe is test-only and its lone caller passes valid (128, 2), but the loose domain was misleading. Gate on RankQuant::validate_params first so it answers 'the rerank takes a SIMD kernel' only for buildable params, with a regression test asserting the probe is false for constructor-invalid fixtures and never claims SIMD where new() rejects. Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
|
Triage of the qodo review (head Bug 2 — SIMD probe ignores validation ✅ Fixed ( Bug 1 — hidden API surface — Working as intended. |
|
/agentic_review |
|
Code review by qodo was updated up to the latest commit 013cbc6 |
…feature `subset_rerank_uses_simd` was exported as a `#[doc(hidden)] pub` symbol at the crate root, silently committing the crate to long-term semver stability for a test-only diagnostic. Introduce a non-default `test-utils` Cargo feature and gate the function definition in `src/quant.rs`, its re-export in `src/lib.rs`, and every test that calls it (`tests/alloc_free.rs`, `tests/index/two_stage.rs::subset_rerank_uses_simd_is_false_for_constructor_invalid_params`) behind `#[cfg(feature = "test-utils")]`. The probe and its tests are still compiled and run — opt-in via `--features test-utils` — but the symbol is absent from the default public surface, so downstream code cannot accidentally depend on it. Add `cargo test --features test-utils` to the CI `test` job so the gated path is exercised on every push/PR (same lane pattern as `--features experimental`). Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
…erank-harden Signed-off-by: Nelson Spence <nelson@projectnavi.ai> # Conflicts: # CHANGELOG.md
…tream name) Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
…erank-harden Signed-off-by: Nelson Spence <nelson@projectnavi.ai> # Conflicts: # CHANGELOG.md
Summary
The caller-owned serial two-stage primitives added in 0.5.0 —
SignBitmap::top_m_candidates_batched_serial_csr→CandidateBatch→RankQuant::search_asymmetric_subset_batched_serial[_into]with a reusableSubsetScratch— are present, shape-conformant, and memory-safe. A downstream(OrdinalDB) reviewer flagged the original PR as "only half the work"; an
adversarial trust/memory-safety audit confirms that gap is release-readiness +
proof-of-value, not a functional or safety defect. No source/guard change.
This PR lands the missing half: the explicit, tested memory-safety net + the
perf evidence + the integration docs.
Memory-safety verdict (the reason this was gated)
SOUND — no out-of-bounds reachable. The audit traced every unsafe SIMD
load/store from the two public rerank entries through
validate_csr_batchintoall four AVX kernels + the scalar fallback and could not construct a
malformed-but-accepted CSR/query that reaches an OOB gather. Order is:
queries % dim→assert_all_finite→validate_csr_batch(offsets len/start/final/monotonic/row-len≤n_vectors/candidate-id<n_vectors) → out-buffer lengths,
all before the per-row gather. Heap sizing uses
checked_mul. The guards arecorrect and complete; the residual risk was purely untested — which this PR
closes.
API status
Present + exact shape match; no API change. Design constraints verified in
code: no internal rayon,
_intoallocation-free after warmup, exactsingle-query semantics + tie policy (
score desc, global row-id asc),-1/NEG_INFINITYsentinels, unsorted-rows-OK + deterministic.Tests (
tests/index/two_stage.rs, +11;tests/alloc_free.rs, new)bounding the unsafe gather), non-monotonic / wrong-final / non-zero-first
offsets, non-finite + ragged queries (
_into, the CSR gen, and the allocatingwrapper), wrong
out_scores/out_indiceslength — each#[should_panic],pinning that a bad input panics before the SIMD scan.
slots equal the single-query reference, trailing slots are
-1/NEG_INFINITY.#[global_allocator]):search_asymmetric_subset_batched_serial_intodoes 0 heap allocations on awarmed steady-state call — strong form of the prior capacity proxy.
batch, k=0, duplicates, unsorted order-independence, cross-tier scalar
reference, offsets-len + OOB-candidate rejection.)
Benchmark — Harrier-1024 (
examples/two_stage_bench.rs, SYNTHETIC)Reproduce:
cargo run --release --example two_stage_bench -- --dim 1024 --n 50000 --queries 200 --m 256 --k 10(committed capture:
benchmarks/two_stage_caller_owned_dim1024.txt; Zen5 / AVX-512, single-thread):_into_intovs single-query loopHonest framing (no-fiction): at dim=1024 the rerank stage is a small slice of
an already stage-1-dominated cost, and the batched
_intois on par with thesingle-query loop single-threaded. The primitives' value is (a) allocation-free
steady state (proven: 0 allocs) and (b) caller-owned parallelism (no internal
rayon → a DB drives
_intoacross its own pool, GIL released, one query-range perworker) — not a single-thread speedup. This result is not explained by the
SignBitmap AVX-tail (dim=768) win — different shape, different mechanism.
Docs
README "Caller-owned serial two-stage (DB / runtime integration)" section (CSR
shape, sentinel padding, scratch reuse, no-rayon); a no_run rustdoc example on the
alloc-free
_intohot path + a buffer-sizing porting callout; an example on theCSR candidate-gen; CHANGELOG.
Validation (local, Zen5 / AVX-512)
cargo fmt --all --check✅cargo clippy --workspace --all-targets --locked -- -D warnings✅ (incl. ordvec-python)--locked✅ (index 85→96; alloc_free; 4 doctests; FFI C-link)cargo +1.89.0 build(MSRV) ✅ ·cargo build --locked✅examples/two_stage_benchruns at--dim 1024✅Note:
cargo test --workspacecannot linkordvec-python(pyo3extension-modulehas no libpython for a standalone
cargo testbinary) — a pre-existing property ofthis workspace; the binding is gated via
maturin develop+ pytest inpython.ymland is untouched here. The FFI C-link test needs an absolute target dir (it
resolves
CARGO_TARGET_DIRagainst the package CWD); green under CI's defaulttarget/.Follow-up (not in this PR)
A
csr_batched_twostagecargo-fuzz target over the CSR offset arithmetic + gather(the audit's one remaining "optional" residual-risk item) — best reviewed with its
own
fuzz.ymlwiring. The guards are proven sound and now rejection-tested, so thisis additive continuous coverage.