Skip to content

perf(sign): explore tiled batched sign probe without full score matrix #236

@Fieldnote-Echo

Description

Context

SignBitmap::top_m_candidates_batched() can scan multiple queries, but it materializes a full batch * n_vectors score matrix. In OrdinalDB testing, switching from the caller-owned CSR loop to this full-batch path lost at the FiQA / Harrier-1024 shape because the full score matrix and selection overhead outweighed any batching benefit.

The useful target is not the current full-matrix batched path. It is a tiled batched sign probe that avoids materializing batch * n_vectors while still getting better locality / lower overhead than independent per-query candidate generation.

Proposed direction

Explore a tiled batched sign-probe API:

  • build query bitmaps for a small query tile
  • scan doc blocks against that tile
  • maintain per-query top-m state incrementally
  • emit CSR candidates directly
  • reuse caller-owned scratch/output buffers

Possible API shape can build on the SignProbeScratch issue:

pub fn top_m_candidates_tiled_csr_into(
    &self,
    queries: &[f32],
    m: usize,
    options: SignProbeTileOptions,
    scratch: &mut SignProbeScratch,
    offsets: &mut Vec<usize>,
    candidates: &mut Vec<u32>,
);

Acceptance criteria

  • Deterministic candidate order matches current top_m_candidates() semantics.
  • Does not allocate a full batch * n_vectors score matrix.
  • Benchmarks cover at least:
    • Harrier-1024, n around 50k-100k
    • BGE-768, n around 100k-400k
    • small nq and larger nq cases
  • Clearly document when callers should use independent CSR vs tiled batched probe.
  • No regression to existing convenience APIs.

Motivation

The AVX-512 sign scan kernels are now fast enough that selection/allocation/batching shape matters. A tiled batched probe is the likely next stage-1 candidate-generation improvement without changing the retrieval semantics.

Metadata

Metadata

Assignees

No one assigned

    Labels

    core-apiCore search/index public API surface (pre-1.0)perfPerformance-relevant: scan/SIMD/alloc/memory/parallelismrustPull requests that update rust code

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions