Context
SignBitmap::top_m_candidates_batched() can scan multiple queries, but it materializes a full batch * n_vectors score matrix. In OrdinalDB testing, switching from the caller-owned CSR loop to this full-batch path lost at the FiQA / Harrier-1024 shape because the full score matrix and selection overhead outweighed any batching benefit.
The useful target is not the current full-matrix batched path. It is a tiled batched sign probe that avoids materializing batch * n_vectors while still getting better locality / lower overhead than independent per-query candidate generation.
Proposed direction
Explore a tiled batched sign-probe API:
- build query bitmaps for a small query tile
- scan doc blocks against that tile
- maintain per-query top-m state incrementally
- emit CSR candidates directly
- reuse caller-owned scratch/output buffers
Possible API shape can build on the SignProbeScratch issue:
pub fn top_m_candidates_tiled_csr_into(
&self,
queries: &[f32],
m: usize,
options: SignProbeTileOptions,
scratch: &mut SignProbeScratch,
offsets: &mut Vec<usize>,
candidates: &mut Vec<u32>,
);
Acceptance criteria
- Deterministic candidate order matches current
top_m_candidates() semantics.
- Does not allocate a full
batch * n_vectors score matrix.
- Benchmarks cover at least:
- Harrier-1024, n around 50k-100k
- BGE-768, n around 100k-400k
- small nq and larger nq cases
- Clearly document when callers should use independent CSR vs tiled batched probe.
- No regression to existing convenience APIs.
Motivation
The AVX-512 sign scan kernels are now fast enough that selection/allocation/batching shape matters. A tiled batched probe is the likely next stage-1 candidate-generation improvement without changing the retrieval semantics.
Context
SignBitmap::top_m_candidates_batched()can scan multiple queries, but it materializes a fullbatch * n_vectorsscore matrix. In OrdinalDB testing, switching from the caller-owned CSR loop to this full-batch path lost at the FiQA / Harrier-1024 shape because the full score matrix and selection overhead outweighed any batching benefit.The useful target is not the current full-matrix batched path. It is a tiled batched sign probe that avoids materializing
batch * n_vectorswhile still getting better locality / lower overhead than independent per-query candidate generation.Proposed direction
Explore a tiled batched sign-probe API:
Possible API shape can build on the
SignProbeScratchissue:Acceptance criteria
top_m_candidates()semantics.batch * n_vectorsscore matrix.Motivation
The AVX-512 sign scan kernels are now fast enough that selection/allocation/batching shape matters. A tiled batched probe is the likely next stage-1 candidate-generation improvement without changing the retrieval semantics.