perf(simd): AVX-512 masked-tail scan for non-512-bit-multiple dims (BGE-768 ~4x stage-1)#214
Conversation
The SignBitmap and Bitmap AVX-512 VPOPCNTDQ scan kernels dispatched to the vectorized path only when the per-vector 64-bit word count was a multiple of 8 (dim a multiple of 512), silently falling back to the scalar loop otherwise. Common embedding widths — 768 (BGE), 384 (bge-small / MiniLM) — therefore ran the entire stage-1 candidate scan scalar. Add a masked-tail epilogue (`_mm512_maskz_loadu_epi64` over the trailing `(dim / 64) % 8` words) to all six scan kernels (SignBitmap single + batched; Bitmap single / collect / batched / subset) and drop the `qpv % 8` dispatch gate. Any supported dim (a multiple of 64) now uses VPOPCNTDQ; dims whose word count is a multiple of 8 are unchanged, others pay one extra masked chunk (768 ≈ 1024). Dispatch now reads one shared predicate, `avx512vpop_supported()`, with no per-dimension gate. Measured ~4x faster stage-1 scan at dim=768 (609 -> 153 us/query, n=100k, batch=256, single-thread, Zen5 / AVX-512; see examples/bge_kernel_bench); 1024/1536 unchanged. Byte-identical to scalar: parity tests cover qpv tail residues 0..7 plus 384/512/768/1024/1536 across all six kernels, an unchanged-at-512-bit-multiples test, and a dispatch diagnostic. Stage-1 scan-kernel throughput only — not a whole-pipeline figure. Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
Code Review by Qodo
1.
|
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
PR Summary by QodoAVX-512 masked-tail scan so all 64-bit-multiple dims use VPOPCNTDQ WalkthroughsDescription• Remove the qpv % 8 dispatch gate so AVX-512 scan runs for any dim % 64 == 0. • Add a masked-load tail epilogue to SignBitmap/Bitmap VPOPCNTDQ scan kernels. • Add parity tests + a repro bench example; document the new behavior and speedup. Diagramgraph TD
A["SignBitmap / Bitmap APIs"] --> B{"avx512vpop_supported()?"} -->|yes| C["AVX-512 scan kernels"] --> D["loadu 8×u64 groups"] --> E["maskz tail load"] --> F["popcnt reduce + scores"]
B -->|no| G["Scalar scan"]
H[("Bitmaps u64 rows")] --> C
subgraph Legend
direction LR
_api["API/module"] ~~~ _dec{"Dispatch"} ~~~ _db[("Data buffer")]
end
High-Level AssessmentThe following are alternative approaches to this PR: 1. Pad bitmap rows to `qpv` multiple-of-8 at build time
2. Scalar tail loop after vectorized chunks
Recommendation: The masked-load epilogue is the best trade-off: it preserves the existing storage format and removes the scalar performance cliff for non-512-bit-multiple dims while keeping previously-fast dims unchanged. Padding could marginally improve the 768≈1024 tail cost, but it introduces memory/layout complexity that likely outweighs the gain. File ChangesEnhancement (4)
Documentation (1)
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Code Review
This pull request extends the AVX-512 VPOPCNTDQ scan kernels in Bitmap and SignBitmap to support any dimension that is a multiple of 64, rather than restricting them to multiples of 512 bits. This is achieved by processing trailing words with a masked load (_mm512_maskz_loadu_epi64), which significantly improves performance for common embedding widths like 384 and 768. Additionally, a new benchmark and comprehensive parity tests have been added to ensure correctness. There are no review comments, so I have no feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
…heck PR review: `scan_dispatch_is_dimension_independent` asserted `avx512vpop_supported() == avx512vpop_supported()` — it cannot catch a dispatch regression. Replace it with `masked_tail_kernel_matches_scalar_when_avx512_present`, which on an AVX-512 VPOPCNTDQ host (or the Intel SDE CI job) builds qpv % 8 != 0 dims (384/768/832 — which force the masked tail), runs the real dispatched path, and asserts byte-identity to an independent scalar overlap reference. On a non-AVX-512 host it SKIPS with a notice rather than silently passing on the scalar path, so a green run there is not mistaken for tail-kernel coverage. The cross-platform scalar parity remains in `avx512_path_matches_scalar_across_residues_and_common_dims`. Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
|
/agentic_review |
|
Code review by qodo was updated up to the latest commit 255a892 |
|
/agentic_review |
|
Code review by qodo was updated up to the latest commit 255a892 |
Add a require_avx512_or_skip helper to the bitmap and sign_bitmap test modules. When ORDVEC_REQUIRE_AVX512 is set to '1' or 'true' and the host lacks AVX-512 VPOPCNTDQ, the helper panics loudly instead of silently skipping. When the env var is not set, it emits a visible eprintln! skip notice to stderr and returns false so the caller bails. Apply the helper to all AVX-512-named tests in both modules (avx512_path_matches_scalar_across_residues_and_common_dims, avx512_path_matches_scalar_at_production_dim, masked_tail_kernel_matches_scalar_when_avx512_present), replacing the ad-hoc eprintln!+return in the masked-tail test. Wire ORDVEC_REQUIRE_AVX512=1 into the Intel SDE CI job so the SDE lane genuinely enforces the kernels rather than silently treating a skipped test as green coverage. Addresses qodo findings: 'AVX512 tests not enforced' (Reliability) and 'Skip notice not visible' (Observability). Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
…ling Signed-off-by: Nelson Spence <nelson@projectnavi.ai> # Conflicts: # CHANGELOG.md
Summary
The
SignBitmapandBitmapAVX-512 VPOPCNTDQ scan kernels only took the vectorized path when the per-vector 64-bit word count was a multiple of 8 — i.e.dima multiple of 512 bits. Any otherdim(still a valid multiple of 64) silently fell back to the scalar loop. So the most common open-embedding widths — 768 (BGE/bge-base), 384 (bge-small, all-MiniLM) — ran the entire stage-1 candidate scan scalar, while 1024 (Harrier) / 1536 hit the kernel.This adds the missing SIMD epilogue: full 8×u64 groups via
loadu, then the trailing(dim / 64) % 8words via a single masked_mm512_maskz_loadu_epi64(fault-suppressed; masked lanes contribute 0). Theqpv % 8dispatch gate is removed across all kernels, so any supporteddimnow uses VPOPCNTDQ.Why it matters
768 and 384 are two of the most common embedding dimensions in the wild; bringing your own BGE/MiniLM vectors meant the stage-1 scan — ~98% of two-stage e2e — ran with no SIMD. This is a scalar cliff, not a slope: the whole vector dropped to scalar, not just the tail.
Bench — stage-1 scan (
score_all_batched_flat)Hardware: AMD Ryzen 9 9950X (Zen5),
avx512f+avx512vpopcntdq. Single-thread (RAYON_NUM_THREADS=1),taskset -c 12, 40 reps median, batch=256, same seeded inputs. Reproduce:cargo run --release --example bge_kernel_bench -- <dim> <n>.main)Kernels changed (6)
SignBitmap:sign_scan_collect_avx512vpop(single),sign_scan_collect_batched_avx512vpop(batched).Bitmap:bitmap_scan_avx512vpop(TopK/search),bitmap_scan_collect_avx512vpop(top_m_candidates),bitmap_scan_collect_batched_avx512vpop(top_m_candidates_batched),body_overlap_scores_subset_avx512vpop(subset).Dispatch is unified behind one
#[doc(hidden)] pub fn avx512vpop_supported()— it takes no dimension, so no dim can be re-gated to scalar.Tests (byte-identical to scalar)
lanes==0all-tail case):sign_bitmap::tests::avx512_path_matches_scalar_across_residues_and_common_dims,bitmap::tests::avx512_path_matches_scalar_across_residues_and_common_dims.unchanged_at_512bit_multiple_dims— pins no behavior change at 1024/1536.scan_dispatch_is_dimension_independent— the dispatch predicate is dim-free.Out of scope / deferred
Local gate
cargo fmt --check·cargo clippy --all-targets --all-features -D warnings·cargo test(incl. new parity) ·cargo test --no-default-features·cargo +1.89.0 build(MSRV — masked-load intrinsics available) ·cargo build --locked— all green on the Zen5/AVX-512 host.