Skip to content

perf(simd): AVX-512 masked-tail scan for non-512-bit-multiple dims (BGE-768 ~4x stage-1)#214

Merged
Navi Bot (project-navi-bot) merged 4 commits into
mainfrom
perf/avx512-tail-handling
Jun 15, 2026
Merged

perf(simd): AVX-512 masked-tail scan for non-512-bit-multiple dims (BGE-768 ~4x stage-1)#214
Navi Bot (project-navi-bot) merged 4 commits into
mainfrom
perf/avx512-tail-handling

Conversation

@Fieldnote-Echo

Copy link
Copy Markdown
Member

Summary

The SignBitmap and Bitmap AVX-512 VPOPCNTDQ scan kernels only took the vectorized path when the per-vector 64-bit word count was a multiple of 8 — i.e. dim a multiple of 512 bits. Any other dim (still a valid multiple of 64) silently fell back to the scalar loop. So the most common open-embedding widths — 768 (BGE/bge-base), 384 (bge-small, all-MiniLM) — ran the entire stage-1 candidate scan scalar, while 1024 (Harrier) / 1536 hit the kernel.

This adds the missing SIMD epilogue: full 8×u64 groups via loadu, then the trailing (dim / 64) % 8 words via a single masked _mm512_maskz_loadu_epi64 (fault-suppressed; masked lanes contribute 0). The qpv % 8 dispatch gate is removed across all kernels, so any supported dim now uses VPOPCNTDQ.

Why it matters

768 and 384 are two of the most common embedding dimensions in the wild; bringing your own BGE/MiniLM vectors meant the stage-1 scan — ~98% of two-stage e2e — ran with no SIMD. This is a scalar cliff, not a slope: the whole vector dropped to scalar, not just the tail.

Bench — stage-1 scan (score_all_batched_flat)

Hardware: AMD Ryzen 9 9950X (Zen5), avx512f + avx512vpopcntdq. Single-thread (RAYON_NUM_THREADS=1), taskset -c 12, 40 reps median, batch=256, same seeded inputs. Reproduce: cargo run --release --example bge_kernel_bench -- <dim> <n>.

dim corpus OLD (scalar, main) NEW (AVX-512 + tail) speedup
768 (BGE) n=100k 609.1 µs/q 152.6 µs/q 3.99×
768 (BGE) n=400k 2439.5 µs/q 694.6 µs/q 3.51×
384 (bge-small/MiniLM) n=100k 345.9 µs/q 129.3 µs/q 2.68×
1024 (Harrier) n=100k 147.3 µs/q 148.9 µs/q 0.99× (unchanged)

Scope: this is stage-1 scan-kernel throughput, not a whole-pipeline figure. End-to-end two-stage speedup for a BGE deployment is large but under 4× once top-k select (parallelized in production) and RankQuant rerank are included. The 1024 row confirms no regression on the already-vectorized path. Note 768 ≈ 1024 because the 4-word tail costs one masked chunk like 1024's second full chunk — a small future micro-opt, intentionally not chased here.

Kernels changed (6)

SignBitmap: sign_scan_collect_avx512vpop (single), sign_scan_collect_batched_avx512vpop (batched).
Bitmap: bitmap_scan_avx512vpop (TopK/search), bitmap_scan_collect_avx512vpop (top_m_candidates), bitmap_scan_collect_batched_avx512vpop (top_m_candidates_batched), body_overlap_scores_subset_avx512vpop (subset).

Dispatch is unified behind one #[doc(hidden)] pub fn avx512vpop_supported() — it takes no dimension, so no dim can be re-gated to scalar.

Tests (byte-identical to scalar)

  • Parity vs scalar for every affected kernel, across qpv tail residues 0..7 and the common dims 384 / 512 / 768 / 1024 / 1536 (plus 64/448 for the lanes==0 all-tail case): sign_bitmap::tests::avx512_path_matches_scalar_across_residues_and_common_dims, bitmap::tests::avx512_path_matches_scalar_across_residues_and_common_dims.
  • unchanged_at_512bit_multiple_dims — pins no behavior change at 1024/1536.
  • scan_dispatch_is_dimension_independent — the dispatch predicate is dim-free.
  • Reproducible bench example proving 768 is no longer scalar (the ~4× is impossible for the scalar path).

Out of scope / deferred

  • Dense-write fusion is intentionally cut, per the stage-1 profiling verdict (conditional ~8.5% only at dim=1024/n≳262k, a net loss elsewhere). The real lever was this kernel gate, fixed here.
  • No 768 masked-tail micro-opt (768 ≈ 1024 is acceptable).

Local gate

cargo fmt --check · cargo clippy --all-targets --all-features -D warnings · cargo test (incl. new parity) · cargo test --no-default-features · cargo +1.89.0 build (MSRV — masked-load intrinsics available) · cargo build --locked — all green on the Zen5/AVX-512 host.

The SignBitmap and Bitmap AVX-512 VPOPCNTDQ scan kernels dispatched to the
vectorized path only when the per-vector 64-bit word count was a multiple of 8
(dim a multiple of 512), silently falling back to the scalar loop otherwise.
Common embedding widths — 768 (BGE), 384 (bge-small / MiniLM) — therefore ran
the entire stage-1 candidate scan scalar.

Add a masked-tail epilogue (`_mm512_maskz_loadu_epi64` over the trailing
`(dim / 64) % 8` words) to all six scan kernels (SignBitmap single + batched;
Bitmap single / collect / batched / subset) and drop the `qpv % 8` dispatch
gate. Any supported dim (a multiple of 64) now uses VPOPCNTDQ; dims whose word
count is a multiple of 8 are unchanged, others pay one extra masked chunk
(768 ≈ 1024). Dispatch now reads one shared predicate, `avx512vpop_supported()`,
with no per-dimension gate.

Measured ~4x faster stage-1 scan at dim=768 (609 -> 153 us/query, n=100k,
batch=256, single-thread, Zen5 / AVX-512; see examples/bge_kernel_bench);
1024/1536 unchanged. Byte-identical to scalar: parity tests cover qpv tail
residues 0..7 plus 384/512/768/1024/1536 across all six kernels, an
unchanged-at-512-bit-multiples test, and a dispatch diagnostic.

Stage-1 scan-kernel throughput only — not a whole-pipeline figure.

Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
@qodo-code-review

qodo-code-review Bot commented Jun 14, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX issues (0) 🔗 Cross-repo conflicts (0)

Grey Divider


Remediation recommended

1. AVX512 tests not enforced ✓ Resolved 🐞 Bug ☼ Reliability
Description
The new avx512_path_matches_scalar_across_residues_and_common_dims tests call only public APIs
that internally dispatch based on avx512vpop_supported(). On machines where that predicate is
false, the tests still pass while only exercising the scalar path, so the new masked-tail AVX-512
kernels may remain unexecuted/unvalidated in many test environments.
Code

src/bitmap.rs[R999-1065]

+    #[test]
+    fn avx512_path_matches_scalar_across_residues_and_common_dims() {
+        for &dim in &PARITY_DIMS {
+            let n = 300usize;
+            let n_top = (dim / 4).max(1);
+            let m = 32usize;
+            let nq = 4usize;
+            let mut rng = ChaCha8Rng::seed_from_u64(9000 + dim as u64);
+            let corpus: Vec<f32> = (0..n * dim).map(|_| rng.random_range(-1.0..1.0)).collect();
+            let mut idx = Bitmap::new(dim, n_top);
+            idx.add(&corpus);
+            let qpv = idx.qwords_per_vec;
+            let queries: Vec<f32> = (0..nq * dim).map(|_| rng.random_range(-1.0..1.0)).collect();
+
+            let batched = idx.top_m_candidates_batched(&queries, m);
+            for qi in 0..nq {
+                let q = &queries[qi * dim..(qi + 1) * dim];
+                let qbm = idx.build_query_bitmap_fp32(q);
+
+                // (1) body_overlap_scores_subset kernel: exact overlap for ALL
+                //     ids vs an independent scalar over the stored bitmaps.
+                let all_ids: Vec<u32> = (0..n as u32).collect();
+                let mut out = vec![0u32; n];
+                idx.body_overlap_scores_subset(&qbm, &all_ids, &mut out);
+                let mut ref_pairs: Vec<(u32, u32)> = Vec::with_capacity(n);
+                #[allow(clippy::needless_range_loop)]
+                for di in 0..n {
+                    let off = di * qpv;
+                    let ov = scalar_overlap(&idx.bitmaps[off..off + qpv], &qbm);
+                    assert_eq!(out[di], ov, "body_overlap dim={dim} qi={qi} di={di}");
+                    ref_pairs.push((ov, di as u32));
+                }
+                // Reference top-m under the library's (overlap desc, id asc) key.
+                ref_pairs.sort_by(|a, b| b.0.cmp(&a.0).then_with(|| a.1.cmp(&b.1)));
+                let reference: Vec<u32> = ref_pairs.iter().take(m).map(|&(_, d)| d).collect();
+
+                // (2) bitmap_scan_collect kernel.
+                assert_eq!(
+                    idx.top_m_candidates(q, m),
+                    reference,
+                    "top_m dim={dim} qi={qi}"
+                );
+                // (3) bitmap_scan_collect_batched kernel.
+                assert_eq!(batched[qi], reference, "batched dim={dim} qi={qi}");
+
+                // (4) bitmap_scan (TopK) kernel via search: the returned m docs
+                //     must be a valid top-m by overlap (tie-policy-independent;
+                //     a wrong scan score would admit an out-of-top-m doc).
+                let res = idx.search(q, m);
+                let got = res.indices_for_query(0);
+                assert_eq!(got.len(), m, "search len dim={dim} qi={qi}");
+                let got_set: std::collections::HashSet<i64> = got.iter().copied().collect();
+                let ov_of =
+                    |di: usize| scalar_overlap(&idx.bitmaps[di * qpv..(di + 1) * qpv], &qbm);
+                let min_in = got.iter().map(|&id| ov_of(id as usize)).min().unwrap();
+                let max_out = (0..n)
+                    .filter(|di| !got_set.contains(&(*di as i64)))
+                    .map(ov_of)
+                    .max()
+                    .unwrap_or(0);
+                assert!(
+                    min_in >= max_out,
+                    "search not a valid top-m: dim={dim} qi={qi} min_in={min_in} max_out={max_out}"
+                );
+            }
+        }
+    }
Evidence
The new tests use only high-level APIs; those APIs select AVX-512 kernels only when
avx512vpop_supported() is true, otherwise they fall back to scalar. Since avx512vpop_supported()
returns false on non-x86_64 (and on x86_64 without AVX-512 VPOPCNTDQ), these tests can pass without
ever executing the new masked-tail AVX-512 code paths.

src/bitmap.rs[999-1065]
src/bitmap.rs[565-578]
src/sign_bitmap.rs[1053-1104]
src/lib.rs[97-117]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The tests named `avx512_path_matches_scalar_…` don’t ensure the AVX-512 path actually ran; they rely on runtime dispatch, so on non-AVX512 hosts they validate only the scalar implementation.

## Issue Context
The dispatch is guarded by `crate::avx512vpop_supported()`; when it is false (non-x86_64 or x86_64 without the features), the AVX-512 functions are never invoked.

## Fix Focus Areas
- src/bitmap.rs[565-578]
- src/bitmap.rs[999-1065]
- src/sign_bitmap.rs[1053-1104]
- src/lib.rs[97-117]

## Suggested fix
Adjust the parity tests so that when `avx512vpop_supported()` is true, they directly call the `*_avx512vpop` functions (guarded by the predicate to avoid illegal instructions) and compare against an explicit scalar reference. When the predicate is false, either rename the tests to avoid claiming AVX-512 coverage, or early-return/skip with a clear message so the test suite doesn’t imply SIMD validation occurred.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Informational

2. Skip notice not visible ✓ Resolved 🐞 Bug ◔ Observability
Description
masked_tail_kernel_matches_scalar_when_avx512_present returns early when AVX-512 VPOPCNTDQ isn’t
detected and relies on eprintln! as a “notice”, but CI runs cargo test without -- --nocapture,
so that message is typically not surfaced and the test appears to pass even when it didn’t exercise
the masked-tail kernel.
Code

src/bitmap.rs[R1106-1112]

+        if !crate::avx512vpop_supported() {
+            eprintln!(
+                "masked_tail test skipped: AVX-512 VPOPCNTDQ not present on this host \
+                 (the tail kernel is exercised by the Intel SDE CI job)"
+            );
+            return;
+        }
Evidence
The test’s skip path is an early return guarded only by an eprintln!, and the CI AVX-512 job
invokes cargo test without -- --nocapture, so the intended “notice” is not reliably visible in
CI logs for passing tests.

src/bitmap.rs[1097-1112]
.github/workflows/ci.yml[432-440]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The AVX-512 masked-tail test currently “skips with a notice” by printing to stderr and returning early when AVX-512 VPOPCNTDQ isn’t available. In standard `cargo test` runs (including CI), output from passing tests is usually captured/suppressed, so the skip is not clearly visible and can be misread as coverage.

## Issue Context
This test exists specifically to provide confidence that the masked-tail AVX-512 kernel is exercised when available; its current skip reporting mechanism is not reliably observable.

## Fix Focus Areas
- src/bitmap.rs[1097-1112]
- .github/workflows/ci.yml[432-440]

## Suggested fix
Choose one of:
1) Add an env-var “require” mode:
  - In the test: if `ORDVEC_REQUIRE_AVX512_VPOPCNTDQ=1` (or similar) and `avx512vpop_supported()` is false, `panic!` instead of returning.
  - In the AVX-512 SDE CI job: set that env var so the job fails if the test can’t actually run.
2) Make CI always show output for this job (e.g., `cargo test -- --nocapture`), so the skip notice is visible in logs.
3) If you intentionally want silent skip, remove/adjust the comment that claims the skip is "with a notice" (so it doesn’t imply visibility you don’t actually have).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. No-op dispatch test ✓ Resolved 🐞 Bug ⚙ Maintainability
Description
scan_dispatch_is_dimension_independent compares avx512vpop_supported() to itself, so it will
always pass and cannot detect regressions in scan dispatch behavior. This test currently provides
false confidence without asserting any property of the dispatch logic.
Code

src/bitmap.rs[R1096-1109]

+    #[test]
+    fn scan_dispatch_is_dimension_independent() {
+        // The qpv % 8 gate is gone: the SignBitmap/Bitmap scan dispatch reads
+        // only `avx512vpop_supported()`, which takes no dimension. So on a
+        // VPOPCNTDQ host, 384 (qpv=6) and 768 (qpv=12) — previously routed to
+        // the scalar fallback — take the SAME kernel as 1024/1536. Bit-identity
+        // at those dims is proven by the parity test above; the ~4x speedup is
+        // shown by `examples/bge_kernel_bench`. No dimension can be special-cased
+        // back to scalar because the predicate is dim-free.
+        assert_eq!(
+            crate::avx512vpop_supported(),
+            crate::avx512vpop_supported(),
+            "dispatch predicate must be pure"
+        );
Evidence
The test asserts equality of the same function call, so it cannot fail regardless of future changes
to dispatch logic.

src/bitmap.rs[1096-1109]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`scan_dispatch_is_dimension_independent` is tautological: it asserts `avx512vpop_supported() == avx512vpop_supported()`, which is always true.

## Issue Context
The comment claims it verifies the removal of the `qpv % 8` gate, but the assertion does not test dimension-independence (or any observable dispatch behavior).

## Fix Focus Areas
- src/bitmap.rs[1096-1109]

## Suggested fix
Either remove this test, or replace it with a test that checks an observable property (e.g., under `if avx512vpop_supported()` directly invoke the AVX-512 kernel with a non-multiple-of-8 `qpv` and compare to a scalar reference, or add a `#[cfg(test)]` hook/flag to prove the scalar fallback was not taken).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@qodo-code-review

Copy link
Copy Markdown

PR Summary by Qodo

AVX-512 masked-tail scan so all 64-bit-multiple dims use VPOPCNTDQ
✨ Enhancement 🧪 Tests 📝 Documentation 🕐 40+ Minutes

Grey Divider

Walkthroughs

Description
• Remove the qpv % 8 dispatch gate so AVX-512 scan runs for any dim % 64 == 0.
• Add a masked-load tail epilogue to SignBitmap/Bitmap VPOPCNTDQ scan kernels.
• Add parity tests + a repro bench example; document the new behavior and speedup.
Diagram
graph TD
  A["SignBitmap / Bitmap APIs"] --> B{"avx512vpop_supported()?"} -->|yes| C["AVX-512 scan kernels"] --> D["loadu 8×u64 groups"] --> E["maskz tail load"] --> F["popcnt reduce + scores"]
  B -->|no| G["Scalar scan"]
  H[("Bitmaps u64 rows")] --> C

  subgraph Legend
    direction LR
    _api["API/module"] ~~~ _dec{"Dispatch"} ~~~ _db[("Data buffer")]
  end
Loading
High-Level Assessment

The following are alternative approaches to this PR:

1. Pad bitmap rows to `qpv` multiple-of-8 at build time
  • ➕ Eliminates masked-tail branch and tail-mask bookkeeping inside hot kernels
  • ➕ Keeps kernel inner loops uniform across dimensions
  • ➖ Increases memory footprint for common dims (e.g., 384/768) and may impact cache behavior
  • ➖ Requires changing/guaranteeing storage layout invariants (potentially a breaking change)
  • ➖ Doesn't help when scanning externally-provided slices that aren’t padded
2. Scalar tail loop after vectorized chunks
  • ➕ Simpler to reason about than masked fault-suppressed loads
  • ➕ Avoids relying on masked-load semantics for bounds safety
  • ➖ Tail becomes branchy and potentially slower than one masked vector op
  • ➖ More instructions and code paths to test; less consistent with the AVX-512 style in this crate

Recommendation: The masked-load epilogue is the best trade-off: it preserves the existing storage format and removes the scalar performance cliff for non-512-bit-multiple dims while keeping previously-fast dims unchanged. Padding could marginally improve the 768≈1024 tail cost, but it introduces memory/layout complexity that likely outweighs the gain.

Grey Divider

File Changes

Enhancement (4)
bge_kernel_bench.rs Add reproducible stage-1 scan microbench for BGE-style dims +56/-0

Add reproducible stage-1 scan microbench for BGE-style dims

• Introduces an example that times 'SignBitmap::score_all_batched_flat' with seeded random inputs and reports per-query microseconds. Intended to demonstrate that dim=768 no longer falls back to scalar and to support A/B perf verification.

examples/bge_kernel_bench.rs


bitmap.rs Add masked-tail AVX-512 epilogues + parity tests for Bitmap scans +253/-57

Add masked-tail AVX-512 epilogues + parity tests for Bitmap scans

• Removes the 'qpv.is_multiple_of(8)' dispatch gate in favor of 'crate::avx512vpop_supported()'. Updates AVX-512 kernels ('bitmap_scan*' and 'body_overlap_scores_subset') to process full 8×u64 chunks plus a masked-load tail, tightening safety comments and asserts accordingly. Adds comprehensive tests validating AVX-512 vs scalar parity across tail residues and common embedding dims, plus invariance checks for previously-vectorized dims.

src/bitmap.rs


lib.rs Centralize AVX-512 VPOPCNTDQ availability check +22/-0

Centralize AVX-512 VPOPCNTDQ availability check

• Adds '#[doc(hidden)] pub fn avx512vpop_supported() -> bool' to encapsulate runtime feature detection ('avx512f' + 'avx512vpopcntdq'). Establishes a dimension-independent dispatch contract for all scan kernels.

src/lib.rs


sign_bitmap.rs Add masked-tail AVX-512 epilogues + parity tests for SignBitmap scans +136/-29

Add masked-tail AVX-512 epilogues + parity tests for SignBitmap scans

• Switches SignBitmap scan dispatch to 'avx512vpop_supported()' (removing the 'qpv % 8' gate). Updates single and batched AVX-512 VPOPCNTDQ kernels to handle a 'qpv % 8' remainder via '_mm512_maskz_loadu_epi64' for both query and doc tails, and clarifies safety reasoning. Adds parity tests ensuring byte-identical results to scalar references across all tail residues and common dims.

src/sign_bitmap.rs


Documentation (1)
CHANGELOG.md Document AVX-512 tail handling and dim-independent dispatch +22/-0

Document AVX-512 tail handling and dim-independent dispatch

• Adds an Unreleased performance note explaining the prior 'dim'-multiple-of-512 limitation and the new masked-tail AVX-512 behavior for any 'dim % 64 == 0'. Mentions measured speedup on 768-dim workloads and introduces 'avx512vpop_supported()' as the sole dispatch predicate.

CHANGELOG.md


Grey Divider

Qodo Logo

@codecov

codecov Bot commented Jun 14, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request extends the AVX-512 VPOPCNTDQ scan kernels in Bitmap and SignBitmap to support any dimension that is a multiple of 64, rather than restricting them to multiples of 512 bits. This is achieved by processing trailing words with a masked load (_mm512_maskz_loadu_epi64), which significantly improves performance for common embedding widths like 384 and 768. Additionally, a new benchmark and comprehensive parity tests have been added to ensure correctness. There are no review comments, so I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

…heck

PR review: `scan_dispatch_is_dimension_independent` asserted
`avx512vpop_supported() == avx512vpop_supported()` — it cannot catch a dispatch
regression. Replace it with `masked_tail_kernel_matches_scalar_when_avx512_present`,
which on an AVX-512 VPOPCNTDQ host (or the Intel SDE CI job) builds qpv % 8 != 0
dims (384/768/832 — which force the masked tail), runs the real dispatched path,
and asserts byte-identity to an independent scalar overlap reference. On a
non-AVX-512 host it SKIPS with a notice rather than silently passing on the
scalar path, so a green run there is not mistaken for tail-kernel coverage. The
cross-platform scalar parity remains in
`avx512_path_matches_scalar_across_residues_and_common_dims`.

Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
@Fieldnote-Echo

Copy link
Copy Markdown
Member Author

/agentic_review

@qodo-code-review

qodo-code-review Bot commented Jun 14, 2026

Copy link
Copy Markdown

Code review by qodo was updated up to the latest commit 255a892

@Fieldnote-Echo

Copy link
Copy Markdown
Member Author

/agentic_review

@qodo-code-review

qodo-code-review Bot commented Jun 14, 2026

Copy link
Copy Markdown

Code review by qodo was updated up to the latest commit 255a892

Add a require_avx512_or_skip helper to the bitmap and sign_bitmap test
modules. When ORDVEC_REQUIRE_AVX512 is set to '1' or 'true' and the
host lacks AVX-512 VPOPCNTDQ, the helper panics loudly instead of
silently skipping. When the env var is not set, it emits a visible
eprintln! skip notice to stderr and returns false so the caller bails.

Apply the helper to all AVX-512-named tests in both modules
(avx512_path_matches_scalar_across_residues_and_common_dims,
avx512_path_matches_scalar_at_production_dim,
masked_tail_kernel_matches_scalar_when_avx512_present), replacing the
ad-hoc eprintln!+return in the masked-tail test.

Wire ORDVEC_REQUIRE_AVX512=1 into the Intel SDE CI job so the SDE
lane genuinely enforces the kernels rather than silently treating a
skipped test as green coverage.

Addresses qodo findings: 'AVX512 tests not enforced' (Reliability)
and 'Skip notice not visible' (Observability).

Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
…ling

Signed-off-by: Nelson Spence <nelson@projectnavi.ai>

# Conflicts:
#	CHANGELOG.md
@project-navi-bot Navi Bot (project-navi-bot) merged commit 81fcfe4 into main Jun 15, 2026
38 checks passed
@project-navi-bot Navi Bot (project-navi-bot) deleted the perf/avx512-tail-handling branch June 15, 2026 00:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants