Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,8 @@ jobs:
run: cargo test
- name: cargo test (experimental)
run: cargo test --features experimental
- name: cargo test (test-utils)
run: cargo test --features test-utils
- name: cargo test (no default features)
run: cargo test --no-default-features
- name: cargo build --release --example bench_rank
Expand Down
25 changes: 23 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,27 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
VPOPCNTDQ scan kernels are active on the current CPU. The scan dispatch reads
only this predicate (no per-dimension gate).

### Changed

- **Release-hardened the caller-owned serial two-stage primitives** (no API
change; added in 0.5.0). The trust model is now explicit and tested:
- Rejection-path regression tests for the full CSR/query/buffer validation set
on the rerank entry points — overlong row (the guard that bounds the unsafe
gather), non-monotonic / wrong-final / non-zero-first offsets, non-finite and
ragged queries, and wrong output-buffer length — so a malformed-but-accepted
input can never reach the SIMD scan.
- A counting-allocator test proving `search_asymmetric_subset_batched_serial_into`
performs **zero heap allocations** in steady state (warmed `SubsetScratch`,
reused caller buffers) **on the AVX-512/AVX2 rerank path** — the strong form of
the prior capacity-stability proxy. (The scalar fallback, e.g. aarch64,
allocates a per-query scoring LUT; the test skips the strict check there.)
- A focused `two_stage_bench` example decomposing stage-1 candidate-gen /
single-query rerank loop / batched `_into` / full two-stage at the
Harrier-1024 shape, with a committed reference capture
(`benchmarks/two_stage_caller_owned_dim1024.txt`, SYNTHETIC corpus).
- User-facing docs for the caller-owned / no-rayon / allocation-free contract
(README + rustdoc examples on the `_into` hot path and the CSR candidate-gen).

### Fixed

- **`ordvec-manifest` crate and wheel now ship license text.** Both declared
Expand All @@ -53,8 +74,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `RankQuant::search_asymmetric_subset_batched_serial(..) -> SearchResults` and
`..._serial_into(.., &mut SubsetScratch, &mut out_scores, &mut out_indices)`
— serial batched subset rerank; the `_into` form is allocation-free after
scratch warmup (the integration contract for runtimes that own their own
thread pool / GIL release).
scratch warmup on the AVX-512/AVX2 rerank path (the integration contract for
runtimes that own their own thread pool / GIL release).
- New public types `CandidateBatch` (CSR candidate carrier) and `SubsetScratch`
(reusable rerank scratch).
- These primitives never enter rayon; the caller owns parallelism. No bundled
Expand Down
4 changes: 4 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,10 @@ rand_chacha = "0.10"
# target takes the scalar fallback. `experimental` exposes MultiBucketBitmap
# (research scaffold), kept off the stable surface.
experimental = []
# `test-utils` exposes internal dispatch probes used by the crate's own integration
# tests (e.g. the allocation-free guarantee check). Gated off the default surface
# because these helpers are not part of the public API and carry no semver guarantee.
test-utils = []

[profile.release]
lto = true
Expand Down
43 changes: 43 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,49 @@ For the two-stage compressed-scan path (`Bitmap` / `SignBitmap` candidate
generation → `RankQuant` rerank) and the full mode comparison, see
[`docs/RANK_MODES.md`](docs/RANK_MODES.md).

### Caller-owned serial two-stage (DB / runtime integration)

For runtimes that own their own parallelism — an embedded vector DB driving a
bounded thread pool, or a binding releasing the GIL — ordvec exposes a
**no-rayon** serial two-stage path so the *caller* schedules the work, with an
**allocation-free rerank step** (`_into`, on the AVX-512/AVX2 path) for the
steady-state hot loop:

```rust
use ordvec::{RankQuant, SignBitmap, SubsetScratch};
// Shape sketch (not standalone): `rq: RankQuant` and `sign: SignBitmap` are
// built and `add`-ed as in the Quickstart above; `queries` is your flat
// `dim * nq` f32 batch, `m` the shortlist size, `k` the top-k.
// Stage 1 — serial CSR candidate generation (never enters rayon):
let cb = sign.top_m_candidates_batched_serial_csr(&queries, m); // CandidateBatch { offsets, candidates }
// Stage 2 — rerank into CALLER-OWNED buffers with a reusable scratch:
let nq = queries.len() / dim;
let out_k = k.min(rq.len());
let mut scratch = SubsetScratch::new(); // reuse across batches
let mut out_scores = vec![f32::NEG_INFINITY; nq * out_k];
let mut out_indices = vec![-1i64; nq * out_k];
rq.search_asymmetric_subset_batched_serial_into(
&queries, &cb.offsets, &cb.candidates, k,
&mut scratch, &mut out_scores, &mut out_indices,
);
```

Contract: candidates are **CSR** (`offsets.len() == nq + 1`; row `qi` is
`candidates[offsets[qi]..offsets[qi+1]]`; rows need **not** be sorted). Output is
**rectangular** `nq * out_k` and **sentinel-padded** (`-1` / `NEG_INFINITY`) for
underfull rows — size both buffers to `nq * k.min(index.len())`. Scores, row ids,
and the deterministic tie policy (`score desc, global row-id asc`) match the
single-query `search_asymmetric_subset`. **Only the `_into` rerank step is
allocation-free** — on the **AVX-512 / AVX2** SIMD path, and only on repeated
calls of the *same* batch shape — reusing the warmed `SubsetScratch` and your
output buffers (no per-row alloc, no whole-buffer preclear). The scalar fallback
(no AVX2, e.g. aarch64) allocates a per-query scoring LUT. Stage 1
(`top_m_candidates_batched_serial_csr`) also allocates a fresh `CandidateBatch`
each call. Neither primitive enters rayon —
partition the query batch and call `_into` once per worker range from your own
pool. A focused decomposition benchmark lives in
[`examples/two_stage_bench.rs`](examples/two_stage_bench.rs).

### Python

The same `Rank` / `RankQuant` / `Bitmap` / `SignBitmap` API is available from
Expand Down
21 changes: 21 additions & 0 deletions benchmarks/two_stage_caller_owned_dim1024.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
Caller-owned serial two-stage decomposition — Harrier-1024 shape (SYNTHETIC corpus)
Reproduce:
cargo run --release --example two_stage_bench -- --dim 1024 --n 50000 --queries 200 --m 256 --k 10 --reps 15
Host: AMD Ryzen 9 9950X (Zen5), AVX-512 VPOPCNTDQ, single core (taskset -c 12), single-thread.

dim=1024 n=50000 queries=200 m=256 k=10 bits=2 out_k=10 candidates=51200 reps=15
1. stage-1 candidate gen (CSR) 31.920 ms 6265.59 q/s 159.60 us/query
2. single-query rerank loop 2.086 ms 95858.02 q/s 10.43 us/query
3. batched rerank _into 2.031 ms 98463.67 q/s 10.16 us/query
4. full two-stage (1+3) 34.485 ms 5799.70 q/s 172.42 us/query
rerank speedup (batched _into vs single-query loop): 1.03x

Interpretation (no-fiction): at dim=1024 the rerank stage is a small slice
(~10 us/query) of an already-stage-1-dominated two-stage cost (~160 us/query);
the batched _into form is on par with the single-query loop SINGLE-THREADED
(~1.03x). The caller-owned serial primitives are NOT a single-thread speedup —
their value is (a) allocation-free steady state (tests/alloc_free.rs proves 0
heap allocations on a warmed _into call) and (b) caller-owned parallelism: no
internal rayon, so a DB/runtime can drive the _into form across its own bounded
pool (GIL released) one query-range per worker. This dim=1024 result is its own
mechanism; it is NOT explained by the SignBitmap AVX-tail dim=768 result.
177 changes: 177 additions & 0 deletions examples/two_stage_bench.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
//! Focused benchmark + integration example for the **caller-owned serial**
//! two-stage path (the integration contract for DBs / runtimes that own their
//! own parallelism). SYNTHETIC corpus — these numbers are a
//! relative decomposition of the serial path on random data, NOT a retrieval-
//! quality or real-corpus claim, and the dim=1024 result is its own mechanism
//! (do not conflate it with the SignBitmap AVX-tail dim=768 result).
//!
//! It decomposes the cost into four separately-timed phases at the Harrier-1024
//! shape and prints a headline "batched `_into` vs single-query loop" rerank
//! speedup — the per-query-overhead reduction the caller-owned API exists for:
//! 1. stage-1 candidate generation (top_m_candidates_batched_serial_csr)
//! 2. single-query subset rerank loop (search_asymmetric_subset, baseline)
//! 3. batched rerank `_into` (warmed SubsetScratch, caller-owned buffers)
//! 4. full two-stage serial (1 + 3 end to end)
//!
//! cargo run --release --example two_stage_bench -- [--dim N] [--n N]
//! [--queries N] [--m N] [--k N] [--bits {1,2,4}] [--reps N]

use ordvec::{RankQuant, SignBitmap, SubsetScratch};
use rand::{RngExt, SeedableRng};
use rand_chacha::ChaCha8Rng;
use std::time::Instant;

fn median(mut v: Vec<f64>) -> f64 {
v.sort_by(|a, b| a.partial_cmp(b).unwrap());
v[v.len() / 2]
}

fn main() {
// Harrier-1024 defaults; all overridable.
let mut dim = 1024usize;
let mut n = 50_000usize;
let mut nq = 200usize;
let mut m = 256usize;
let mut k = 10usize;
let mut bits = 2u8;
let mut reps = 20usize;
let mut args = std::env::args().skip(1);
while let Some(flag) = args.next() {
let mut val = || args.next().expect("flag needs a value").parse().unwrap();
match flag.as_str() {
"--dim" => dim = val(),
"--n" => n = val(),
"--queries" => nq = val(),
"--m" => m = val(),
"--k" => k = val(),
"--bits" => bits = args.next().unwrap().parse().unwrap(),
"--reps" => reps = val(),
other => {
eprintln!("unknown arg: {other}");
std::process::exit(2);
}
}
}
assert!(nq > 0 && n > 0 && reps > 0, "n, queries, reps must be > 0");

let mut rng = ChaCha8Rng::seed_from_u64(7);
let corpus: Vec<f32> = (0..n * dim).map(|_| rng.random_range(-1.0..1.0)).collect();
let mut sign = SignBitmap::new(dim);
sign.add(&corpus);
let mut rq = RankQuant::new(dim, bits);
rq.add(&corpus);
let queries: Vec<f32> = (0..nq * dim).map(|_| rng.random_range(-1.0..1.0)).collect();
drop(corpus);

let out_k = k.min(rq.len());
// Caller-owned output buffers, allocated ONCE and reused across batches —
// rectangular nq*out_k, sentinel-padded for underfull rows.
let mut out_scores = vec![f32::NEG_INFINITY; nq * out_k];
let mut out_indices = vec![-1i64; nq * out_k];
let mut scratch = SubsetScratch::new();

// Warm: build the candidate batch once and warm the scratch to this shape.
let cb = sign.top_m_candidates_batched_serial_csr(&queries, m);
rq.search_asymmetric_subset_batched_serial_into(
&queries,
&cb.offsets,
&cb.candidates,
k,
&mut scratch,
&mut out_scores,
&mut out_indices,
);
let total_candidates = cb.candidates.len();

// Phase 1 — stage-1 candidate generation (serial CSR).
let p1 = median(
(0..reps)
.map(|_| {
let t = Instant::now();
let c = sign.top_m_candidates_batched_serial_csr(&queries, m);
std::hint::black_box(&c);
t.elapsed().as_secs_f64()
})
.collect(),
);

// Phase 2 — single-query subset rerank loop (the per-query baseline).
let p2 = median(
(0..reps)
.map(|_| {
let t = Instant::now();
for qi in 0..nq {
let row = &cb.candidates[cb.offsets[qi]..cb.offsets[qi + 1]];
let r = rq.search_asymmetric_subset(&queries[qi * dim..(qi + 1) * dim], row, k);
std::hint::black_box(&r);
}
t.elapsed().as_secs_f64()
})
.collect(),
);

// Phase 3 — batched `_into` (warmed scratch + reused caller buffers).
let p3 = median(
(0..reps)
.map(|_| {
let t = Instant::now();
rq.search_asymmetric_subset_batched_serial_into(
&queries,
&cb.offsets,
&cb.candidates,
k,
&mut scratch,
&mut out_scores,
&mut out_indices,
);
t.elapsed().as_secs_f64()
})
.collect(),
);

// Phase 4 — full two-stage serial (stage-1 gen + batched rerank).
let p4 = median(
(0..reps)
.map(|_| {
let t = Instant::now();
let c = sign.top_m_candidates_batched_serial_csr(&queries, m);
rq.search_asymmetric_subset_batched_serial_into(
&queries,
&c.offsets,
&c.candidates,
k,
&mut scratch,
&mut out_scores,
&mut out_indices,
);
t.elapsed().as_secs_f64()
})
.collect(),
);

let row = |label: &str, secs: f64| {
println!(
" {label:<34} {:>9.3} ms {:>10.2} q/s {:>9.2} us/query",
secs * 1e3,
nq as f64 / secs,
secs / nq as f64 * 1e6,
);
};
println!("caller-owned serial two-stage (SYNTHETIC corpus)");
println!(
" dim={dim} n={n} queries={nq} m={m} k={k} bits={bits} out_k={out_k} \
candidates={total_candidates} reps={reps}"
);
println!(
" (dim % 64 == {}: AVX-512 tier eligible when supported)",
dim % 64
);
row("1. stage-1 candidate gen (CSR)", p1);
row("2. single-query rerank loop", p2);
row("3. batched rerank _into", p3);
row("4. full two-stage (1+3)", p4);
println!(
" rerank speedup (batched _into vs single-query loop): {:.2}x",
p2 / p3
);
}
7 changes: 7 additions & 0 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,13 @@ pub use sign_bitmap::SignBitmap;
#[doc(hidden)]
pub use quant::search_asymmetric_byte_lut;

// `subset_rerank_uses_simd` is a test-only dispatch probe used by the crate's
// own SIMD-parity tests. Gated behind the non-default `test-utils` feature and
// excluded from semver guarantees — not a supported downstream API.
#[cfg(feature = "test-utils")]
#[doc(hidden)]
pub use quant::subset_rerank_uses_simd;

// `MultiBucketBitmap` underwrites the bilinear bucket-overlap
// decomposition but is not the constant-weight top-bucket theorem surface and
// is not stable public API. It is reachable only with the `experimental`
Expand Down
Loading
Loading