Context
Downstream DB integrations now use the caller-owned two-stage path:
SignBitmap::top_m_candidates_batched_serial_csr() for stage-1 sign probing
RankQuant::search_asymmetric_subset_batched_serial_into() for rerank
The rerank half has SubsetScratch and caller-provided output buffers. The sign-probe half still routes through top_m_candidates(), which allocates per query:
- query bitmap
scores: Vec<u32>(n_vectors)
idx: Vec<u32>(n_vectors)
head: Vec<u32>(m)
Now that the sign scan kernels are fast, these allocations and index materialization are a visible part of the integration overhead.
Proposed API
Add caller-owned scratch and _into APIs for sign candidate generation, something along these lines:
pub struct SignProbeScratch { ... }
impl SignBitmap {
pub fn top_m_candidates_into(
&self,
query: &[f32],
m: usize,
scratch: &mut SignProbeScratch,
out: &mut Vec<u32>,
);
pub fn top_m_candidates_batched_serial_csr_into(
&self,
queries: &[f32],
m: usize,
scratch: &mut SignProbeScratch,
offsets: &mut Vec<usize>,
candidates: &mut Vec<u32>,
);
}
Exact shape can change; the important contract is caller-owned capacity reuse across repeated calls.
Acceptance criteria
- Same candidate order and deterministic tie policy as current
top_m_candidates().
- No heap allocation on warmed repeated calls for fixed
(n_vectors, dim, m, nq) on the SIMD path.
- Reuses score/index/query-bitmap buffers in scratch.
- Keeps current allocating APIs as convenience wrappers.
- Tests compare
_into vs existing APIs across empty, small, tied, and normal cases.
- Include a focused microbench at Harrier-1024 and BGE-768 shapes.
Motivation
OrdinalDB can already use the new caller-owned rerank path, but stage-1 still has allocation overhead. This should be the next practical performance API for DB integrations after the 0.5.0 batched subset rerank work.
Context
Downstream DB integrations now use the caller-owned two-stage path:
SignBitmap::top_m_candidates_batched_serial_csr()for stage-1 sign probingRankQuant::search_asymmetric_subset_batched_serial_into()for rerankThe rerank half has
SubsetScratchand caller-provided output buffers. The sign-probe half still routes throughtop_m_candidates(), which allocates per query:scores: Vec<u32>(n_vectors)idx: Vec<u32>(n_vectors)head: Vec<u32>(m)Now that the sign scan kernels are fast, these allocations and index materialization are a visible part of the integration overhead.
Proposed API
Add caller-owned scratch and
_intoAPIs for sign candidate generation, something along these lines:Exact shape can change; the important contract is caller-owned capacity reuse across repeated calls.
Acceptance criteria
top_m_candidates().(n_vectors, dim, m, nq)on the SIMD path._intovs existing APIs across empty, small, tied, and normal cases.Motivation
OrdinalDB can already use the new caller-owned rerank path, but stage-1 still has allocation overhead. This should be the next practical performance API for DB integrations after the 0.5.0 batched subset rerank work.