Skip to content

[codex] Reuse scalar subset rerank LUT scratch#250

Merged
Navi Bot (project-navi-bot) merged 2 commits into
mainfrom
codex/prerelease-scalar-subset-scratch
Jun 19, 2026
Merged

[codex] Reuse scalar subset rerank LUT scratch#250
Navi Bot (project-navi-bot) merged 2 commits into
mainfrom
codex/prerelease-scalar-subset-scratch

Conversation

@Fieldnote-Echo

Copy link
Copy Markdown
Member

Summary

  • add scratch-owned scalar LUT storage for caller-owned subset rerank
  • route scalar and b=8 subset rerank through reusable LUT buffers
  • remove the scalar fallback carve-out from the allocation-free rerank contract

Closes #235.

Validation

  • cargo fmt --check
  • cargo test --features test-utils --test alloc_free -- --nocapture
  • cargo test --features test-utils --test index batched_subset_rerank_matches_scalar_reference_across_tiers -- --nocapture
  • cargo test --features test-utils --test index batched_into_is_allocation_free_after_warmup -- --nocapture
  • cargo test --test index quant_b8 -- --nocapture
  • cargo check
  • git diff --check

@codecov

codecov Bot commented Jun 19, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request makes the scalar fallback path for subset reranking allocation-free by reusing a pre-allocated LUT buffer in SubsetScratch, matching the behavior of the SIMD path. Documentation and tests are updated to reflect that both paths are now allocation-free after warmup. The review feedback suggests using chunks_exact_mut in build_asym_lut_into and build_b8_asym_lut_into to avoid manual indexing and eliminate bounds checks, making the code more idiomatic and potentially more performant.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/quant_kernels.rs
Comment thread src/quant_kernels.rs
@Fieldnote-Echo Nelson Spence (Fieldnote-Echo) force-pushed the codex/prerelease-scalar-subset-scratch branch from d916ca8 to f1cd40e Compare June 19, 2026 16:00
Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@qodo-code-review

Copy link
Copy Markdown

PR Summary by Qodo

Reuse scalar subset rerank LUT scratch to make rerank allocation-free
✨ Enhancement 📝 Documentation 🧪 Tests 🕐 40+ Minutes

Grey Divider

Description

• Add scratch-owned scalar LUT storage to reuse buffers across subset rerank calls.
• Route scalar and b=8 asymmetric rerank through caller-provided LUT buffers.
• Update allocation-free rerank contract and tests to remove scalar fallback carve-out.
Diagram

graph TD
  A["Caller"] --> B["RankQuant _into"] --> C["SubsetScratch"] --> D["scalar_lut Vec<f32>"]
  B --> E["b=8 rerank"] --> F["scan_b8_asym_with_lut"] --> G["TopK results"]
  B --> H["scalar rerank"] --> I["scan_via_lut_scalar_with_lut"] --> G

  subgraph Legend
    direction LR
    _api["API/entrypoint"] ~~~ _buf["Reusable buffer"] ~~~ _ker["Kernel"]
  end
Loading
High-Level Assessment

The following are alternative approaches to this PR:

1. Cache LUT inside RankQuant instance
  • ➕ No need to expand SubsetScratch API surface
  • ➕ Simplifies scratch ownership for single-threaded callers
  • ➖ Harder to use safely with external parallelism (would need per-thread/per-worker storage)
  • ➖ Blurs separation between immutable index state and per-call scratch buffers
2. Use thread-local LUT cache
  • ➕ Avoids passing/owning LUT in scratch explicitly
  • ➕ Works well when caller uses OS threads with stable lifetimes
  • ➖ Less predictable memory behavior for embedders (TLS lifetime and teardown)
  • ➖ Complicates testing and allocation-free guarantees across runtimes (e.g., WASM, custom thread pools)
3. Specialize scalar-only scratch type
  • ➕ Keeps scalar concerns out of the main SubsetScratch used by SIMD paths
  • ➖ Adds API and type complexity (more generics/branching in call sites)
  • ➖ Callers would need additional dispatch logic to choose scratch type

Recommendation: Current approach (adding a reusable LUT buffer to SubsetScratch and threading it through scalar and b=8 scans) is the best fit for the stated contract: allocation-free rerank after warmup in caller-managed parallel runtimes. It keeps ownership explicit, works cross-platform (including scalar-only targets), and avoids implicit TLS or shared mutable state concerns.

Files changed (5) +88 / -48

Enhancement (1) +14 / -8
quant.rsAdd scalar LUT storage to SubsetScratch and route rerank through LUT-reuse kernels +14/-8

Add scalar LUT storage to SubsetScratch and route rerank through LUT-reuse kernels

• Extends SubsetScratch with a 'scalar_lut: Vec<f32>' buffer and exposes its capacity in the test helper. Updates subset rerank dispatch to call the new '*_with_lut' kernels for b=8 and scalar paths, and updates doc comments to reflect allocation-free scalar rerank after warmup.

src/quant.rs

Refactor (1) +64 / -16
quant_kernels.rsIntroduce LUT-into builders and with-LUT scan entrypoints for scalar and b=8 +64/-16

Introduce LUT-into builders and with-LUT scan entrypoints for scalar and b=8

• Refactors LUT construction to support caller-provided buffers ('build_asym_lut_into', 'build_b8_asym_lut_into') and adds 'scan_via_lut_scalar_with_lut' and 'scan_b8_asym_with_lut' to reuse capacity after warmup. Keeps the original entrypoints as thin wrappers allocating a fresh LUT for convenience, and updates AVX-512 gather tests to use the new LUT builder.

src/quant_kernels.rs

Tests (2) +4 / -16
alloc_free.rsRemove SIMD-only gating from allocation-free rerank test +0/-15

Remove SIMD-only gating from allocation-free rerank test

• Deletes the conditional skip that previously exempted scalar fallback from the strict zero-allocation assertion. The test now enforces the allocation-free contract uniformly after warmup.

tests/alloc_free.rs

quant_b8.rsMake panic payload extraction explicit and robust +4/-1

Make panic payload extraction explicit and robust

• Replaces 'res.err().expect(...)' with an explicit 'match' to make the test logic clearer and avoid relying on chained unwraps when extracting the panic payload string.

tests/index/quant_b8.rs

Documentation (1) +6 / -8
README.mdUpdate allocation-free rerank contract to include scalar path +6/-8

Update allocation-free rerank contract to include scalar path

• Removes documentation statements claiming scalar fallback allocates a per-query LUT. Clarifies that '_into' rerank is allocation-free after warmup for both SIMD and scalar by reusing SubsetScratch and output buffers.

README.md

@qodo-code-review

qodo-code-review Bot commented Jun 19, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0) 📎 Requirement gaps (0) 📜 Skill insights (0)

Context used

Grey Divider


Action required

1. capacities_for_test() return changed ✓ Resolved 📎 Requirement gap ⚙ Maintainability
Description
SubsetScratch::capacities_for_test changes its public return type from a 4-tuple to a 5-tuple,
which is a breaking API change for any downstream code using this (even if #[doc(hidden)]). This
violates the requirement to keep public APIs unchanged for callers.
Code

src/quant.rs[R87-94]

+    pub fn capacities_for_test(&self) -> (usize, usize, usize, usize, usize) {
        (
            self.q_unit.capacity(),
            self.sub_packed.capacity(),
+            self.scalar_lut.capacity(),
            self.local_indices.capacity(),
            self.final_order.capacity(),
        )
Relevance

⭐⭐⭐ High

Team previously accepted limiting doc(hidden) test APIs to avoid semver commitments; breaking pub
return likely blocked (PR215).

PR-#215

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
PR Compliance ID 2 forbids breaking public API changes. The diff shows capacities_for_test is
pub and its return type is modified to add an additional element for scalar_lut capacity.

Public APIs and external behavior remain unchanged
src/quant.rs[82-95]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`SubsetScratch::capacities_for_test` is `pub` and its return type changed from `(usize, usize, usize, usize)` to `(usize, usize, usize, usize, usize)`, which is a breaking change.

## Issue Context
Although it is marked `#[doc(hidden)]`, it is still part of the crate's public API surface and can be used by downstream crates (including those enabling `test-utils`). The compliance checklist requires no breaking/user-visible API changes.

## Fix Focus Areas
- src/quant.rs[82-95]

## Suggested approach
- Keep the existing `capacities_for_test()` signature unchanged (return the original 4-tuple), and add a new method (e.g. `capacities_for_test_v2()` or `capacities_for_test_with_scalar_lut()`) that returns the 5-tuple including `scalar_lut` capacity.
- Update internal/integration tests to call the new method while preserving backward compatibility.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. b8 LUT shape unchecked ✓ Resolved 🐞 Bug ☼ Reliability
Description
scan_b8_asym_with_lut builds the LUT sized from q_unit.len() but then uses the separate dim
parameter to drive both the scalar scan and the AVX-512 gather path. If a future internal caller
passes mismatched dim/q_unit, the AVX-512 gather kernel can read past the LUT in release builds
because it only debug_asserts the lut.len() == dim * 256 invariant.
Code

src/quant_kernels.rs[R604-616]

+pub(crate) fn scan_b8_asym_with_lut(
+    packed: &[u8],
+    n: usize,
+    dim: usize,
+    q_unit: &[f32],
+    scale: f32,
+    top: &mut TopK,
+    lut: &mut Vec<f32>,
+) {
+    build_b8_asym_lut_into(lut, q_unit);
    #[cfg(target_arch = "x86_64")]
    {
        if is_x86_feature_detected!("avx512f")
Relevance

⭐⭐⭐ High

Team previously accepted replacing debug-only preconditions to prevent release-build OOB/UB in SIMD
paths (PR #19, #228).

PR-#19
PR-#228

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The new reusable-LUT entry point builds the LUT from q_unit length, not dim, and then calls the
AVX-512 gather kernel which relies on lut.len() == dim * 256 but only checks it with
debug_assert_eq! before using lut.as_ptr() for gathers.

src/quant_kernels.rs[604-633]
src/quant_kernels.rs[169-177]
src/quant_kernels.rs[666-699]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`scan_b8_asym_with_lut` does not validate that `q_unit.len()` matches the `dim` parameter, even though it forwards `dim` to the scan routines and the AVX-512 gather kernel’s safety relies on `lut.len() == dim * 256`.

## Issue Context
- The LUT builder sizes from `q_unit.len()`.
- The AVX-512 gather kernel uses `lut.as_ptr()` and only `debug_assert`s the length invariant, so a mismatch could become an OOB read in release.

## Fix Focus Areas
- Add a hard invariant check early in `scan_b8_asym_with_lut`, e.g. `assert_eq!(q_unit.len(), dim, "...")`, before building/using the LUT.
- (Optional) After building, you may also `debug_assert_eq!(lut.len(), dim * 256)` for clarity.

- src/quant_kernels.rs[604-633]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Weak LUT invariant check ✓ Resolved 🐞 Bug ≡ Correctness
Description
build_asym_lut_into now only debug_asserts q_unit.len() == dim and then uses zip, so in
release builds a length mismatch will silently leave part of the LUT zeroed (or ignore extra
q_unit), producing incorrect scores without failing fast. This turns a would-be panic into a
silent mis-scoring failure mode that is harder to detect during future refactors.
Code

src/quant_kernels.rs[R46-50]

+    debug_assert_eq!(q_unit.len(), dim);
+    lut.resize(dim * n_buckets, 0.0);
+    for (&qd, row) in q_unit.iter().zip(lut.chunks_exact_mut(n_buckets)) {
+        for (b, slot) in row.iter_mut().enumerate() {
+            *slot = qd * bucket_centre(b as u8, bits);
Relevance

⭐⭐⭐ High

Repo favors fail-loud release checks over debug_assert for invariants (e.g., made length
preconditions unconditional in PR #19).

PR-#19
PR-#26

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The code shows a debug_assert (compiled out in release) and a zip() iteration, which will not
panic on length mismatch and will instead stop early, leaving trailing LUT entries at their resized
default 0.0 values.

src/quant_kernels.rs[39-53]
src/quant_kernels.rs[55-75]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`build_asym_lut_into()` relies on `debug_assert_eq!(q_unit.len(), dim)` and then iterates via `q_unit.iter().zip(lut.chunks_exact_mut(n_buckets))`. In release builds, `debug_assert` is disabled and `zip()` truncates to the shorter iterator, so a mismatched `(dim, q_unit.len())` can silently produce a partially-zero LUT and wrong scores.

## Issue Context
This function takes both `dim` and `q_unit` as independent inputs, so a future call site or refactor could desynchronize them. Previously, indexing-style loops would naturally panic on a short slice; the new iterator form removes that fail-fast behavior in release.

## Fix Focus Areas
- src/quant_kernels.rs[39-53]

## Suggested fix
- Replace `debug_assert_eq!(q_unit.len(), dim)` with `assert_eq!(q_unit.len(), dim)` (or otherwise return an error if you prefer non-panicking internal APIs).
- Alternatively, remove the `dim` parameter and derive `dim` from `q_unit.len()` at the boundary, so the function cannot be called with inconsistent shapes.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Informational

4. Test probe leaks into API 🐞 Bug ⚙ Maintainability
Description
SubsetScratch::capacities_with_scalar_lut_for_test is a new pub method compiled into normal
builds (only #[doc(hidden)]), expanding the public API surface with a test-only diagnostic. This
increases the chance downstream code accidentally relies on it and constrains future refactors of
scratch internals.
Code

src/quant.rs[R95-98]

+
+    /// Test-only capacity probe that includes the scalar LUT scratch buffer.
+    #[doc(hidden)]
+    pub fn capacities_with_scalar_lut_for_test(&self) -> (usize, usize, usize, usize, usize) {
Relevance

⭐⭐⭐ High

Team previously accepted feature-gating doc-hidden diagnostic APIs to avoid public surface (PR
#215).

PR-#215

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The new capacity probe is declared pub without a feature gate, while a similar test-only dispatch
probe in the same file is explicitly gated behind feature = "test-utils", indicating the intended
pattern for keeping diagnostics out of the stable surface.

src/quant.rs[70-106]
src/quant.rs[329-354]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`capacities_with_scalar_lut_for_test()` is a test-only diagnostic but is currently a `pub` method available in all builds (it is only hidden from docs). This unnecessarily expands the public API surface.

## Issue Context
The crate already uses `#[cfg(feature = "test-utils")]` to keep other test/diagnostic surfaces (e.g. `subset_rerank_uses_simd`) out of normal builds.

## Fix Focus Areas
- src/quant.rs[96-106]

## Suggested fix
- Add `#[cfg(feature = "test-utils")]` to `capacities_with_scalar_lut_for_test` (and likely to `capacities_for_test` for consistency), and ensure integration tests that call it are run with `--features test-utils` (as they appear to be already).
- If you want the probe callable without features, consider moving it behind a dedicated, clearly non-stable module/trait to reduce accidental downstream use.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


5. Scalar path not forced in tests ✓ Resolved 📎 Requirement gap ☼ Reliability
Description
The allocation-free test change removes conditional skipping but still does not force the scalar
rerank path, so CI on SIMD-capable x86 hosts may never exercise scalar fallback allocation behavior.
This fails the requirement to isolate/force scalar allocation testing independent of host CPU SIMD
capabilities.
Code

tests/alloc_free.rs[L61-74]

-    // The zero-allocation guarantee holds only when the rerank takes a SIMD
-    // kernel: the scalar LUT fallback (`scan_via_lut_scalar`) allocates a
-    // per-query LUT. Gate on the SAME dispatch decision the rerank reads — via
-    // `subset_rerank_uses_simd`, so the gate cannot drift from the actual
-    // dispatch — and skip the strict check on hosts that fall to scalar
-    // (aarch64, or x86 without AVX2+FMA / AVX-512).
-    if !ordvec::subset_rerank_uses_simd(dim, bits) {
-        eprintln!(
-            "alloc_free: rerank uses the scalar LUT fallback for \
-             (dim={dim}, bits={bits}) — it allocates a per-query LUT; \
-             skipping strict zero-alloc check"
-        );
-        return;
-    }
Relevance

⭐ Low

Alloc-free test previously gated on SIMD because scalar allocated; contract changed to alloc-free
scalar too, so no need to force scalar (PR215).

PR-#215

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
PR Compliance ID 4 requires tests to force or isolate scalar fallback allocation behavior
independent of host SIMD. The diff shows the test’s SIMD-dispatch-dependent skip logic was removed,
but no mechanism was added to force scalar fallback execution on SIMD-capable hosts.

Tests isolate and cover scalar allocation behavior independent of host CPU SIMD capabilities
tests/alloc_free.rs[53-105]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The test suite does not reliably force the scalar fallback rerank path; allocation-free assertions can still run exclusively on SIMD paths depending on the host CPU.

## Issue Context
The compliance requirement is to validate scalar fallback allocation behavior in an automated, host-independent way. Simply removing the skip/gate does not guarantee scalar coverage on AVX2/AVX-512 machines.

## Fix Focus Areas
- tests/alloc_free.rs[53-105]

## Suggested approach
- Choose a `(dim, bits)` that is constructor-valid but intentionally routes to scalar on all x86 tiers (e.g. `bits=2` with `dim` divisible by 4 but NOT by 16, such as `dim=132`).
- Add an assertion like `assert!(!ordvec::subset_rerank_uses_simd(dim, bits));` (behind `feature = "test-utils"`) to ensure the test is truly exercising scalar fallback.
- Keep the warmup + allocation-counter window assertion as-is to validate steady-state zero allocations on the scalar path.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Previous review results

Review updated until commit 5fd5c7e

Results up to commit a909ad5


🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX issues (0) 🔗 Cross-repo conflicts (0) 📜 Skill insights (0)


Action required
1. capacities_for_test() return changed ✓ Resolved 📎 Requirement gap ⚙ Maintainability
Description
SubsetScratch::capacities_for_test changes its public return type from a 4-tuple to a 5-tuple,
which is a breaking API change for any downstream code using this (even if #[doc(hidden)]). This
violates the requirement to keep public APIs unchanged for callers.
Code

src/quant.rs[R87-94]

+    pub fn capacities_for_test(&self) -> (usize, usize, usize, usize, usize) {
        (
            self.q_unit.capacity(),
            self.sub_packed.capacity(),
+            self.scalar_lut.capacity(),
            self.local_indices.capacity(),
            self.final_order.capacity(),
        )
Relevance

⭐⭐⭐ High

Team previously accepted limiting doc(hidden) test APIs to avoid semver commitments; breaking pub
return likely blocked (PR215).

PR-#215

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
PR Compliance ID 2 forbids breaking public API changes. The diff shows capacities_for_test is
pub and its return type is modified to add an additional element for scalar_lut capacity.

Public APIs and external behavior remain unchanged
src/quant.rs[82-95]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`SubsetScratch::capacities_for_test` is `pub` and its return type changed from `(usize, usize, usize, usize)` to `(usize, usize, usize, usize, usize)`, which is a breaking change.

## Issue Context
Although it is marked `#[doc(hidden)]`, it is still part of the crate's public API surface and can be used by downstream crates (including those enabling `test-utils`). The compliance checklist requires no breaking/user-visible API changes.

## Fix Focus Areas
- src/quant.rs[82-95]

## Suggested approach
- Keep the existing `capacities_for_test()` signature unchanged (return the original 4-tuple), and add a new method (e.g. `capacities_for_test_v2()` or `capacities_for_test_with_scalar_lut()`) that returns the 5-tuple including `scalar_lut` capacity.
- Update internal/integration tests to call the new method while preserving backward compatibility.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Informational
2. Scalar path not forced in tests ✓ Resolved 📎 Requirement gap ☼ Reliability
Description
The allocation-free test change removes conditional skipping but still does not force the scalar
rerank path, so CI on SIMD-capable x86 hosts may never exercise scalar fallback allocation behavior.
This fails the requirement to isolate/force scalar allocation testing independent of host CPU SIMD
capabilities.
Code

tests/alloc_free.rs[L61-74]

-    // The zero-allocation guarantee holds only when the rerank takes a SIMD
-    // kernel: the scalar LUT fallback (`scan_via_lut_scalar`) allocates a
-    // per-query LUT. Gate on the SAME dispatch decision the rerank reads — via
-    // `subset_rerank_uses_simd`, so the gate cannot drift from the actual
-    // dispatch — and skip the strict check on hosts that fall to scalar
-    // (aarch64, or x86 without AVX2+FMA / AVX-512).
-    if !ordvec::subset_rerank_uses_simd(dim, bits) {
-        eprintln!(
-            "alloc_free: rerank uses the scalar LUT fallback for \
-             (dim={dim}, bits={bits}) — it allocates a per-query LUT; \
-             skipping strict zero-alloc check"
-        );
-        return;
-    }
Relevance

⭐ Low

Alloc-free test previously gated on SIMD because scalar allocated; contract changed to alloc-free
scalar too, so no need to force scalar (PR215).

PR-#215

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
PR Compliance ID 4 requires tests to force or isolate scalar fallback allocation behavior
independent of host SIMD. The diff shows the test’s SIMD-dispatch-dependent skip logic was removed,
but no mechanism was added to force scalar fallback execution on SIMD-capable hosts.

Tests isolate and cover scalar allocation behavior independent of host CPU SIMD capabilities
tests/alloc_free.rs[53-105]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The test suite does not reliably force the scalar fallback rerank path; allocation-free assertions can still run exclusively on SIMD paths depending on the host CPU.

## Issue Context
The compliance requirement is to validate scalar fallback allocation behavior in an automated, host-independent way. Simply removing the skip/gate does not guarantee scalar coverage on AVX2/AVX-512 machines.

## Fix Focus Areas
- tests/alloc_free.rs[53-105]

## Suggested approach
- Choose a `(dim, bits)` that is constructor-valid but intentionally routes to scalar on all x86 tiers (e.g. `bits=2` with `dim` divisible by 4 but NOT by 16, such as `dim=132`).
- Add an assertion like `assert!(!ordvec::subset_rerank_uses_simd(dim, bits));` (behind `feature = "test-utils"`) to ensure the test is truly exercising scalar fallback.
- Keep the warmup + allocation-counter window assertion as-is to validate steady-state zero allocations on the scalar path.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Results up to commit 020882f


🐞 Bugs (1) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX issues (0) 🔗 Cross-repo conflicts (0) 📜 Skill insights (0)


Remediation recommended
1. Weak LUT invariant check ✓ Resolved 🐞 Bug ≡ Correctness
Description
build_asym_lut_into now only debug_asserts q_unit.len() == dim and then uses zip, so in
release builds a length mismatch will silently leave part of the LUT zeroed (or ignore extra
q_unit), producing incorrect scores without failing fast. This turns a would-be panic into a
silent mis-scoring failure mode that is harder to detect during future refactors.
Code

src/quant_kernels.rs[R46-50]

+    debug_assert_eq!(q_unit.len(), dim);
+    lut.resize(dim * n_buckets, 0.0);
+    for (&qd, row) in q_unit.iter().zip(lut.chunks_exact_mut(n_buckets)) {
+        for (b, slot) in row.iter_mut().enumerate() {
+            *slot = qd * bucket_centre(b as u8, bits);
Relevance

⭐⭐⭐ High

Repo favors fail-loud release checks over debug_assert for invariants (e.g., made length
preconditions unconditional in PR #19).

PR-#19
PR-#26

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The code shows a debug_assert (compiled out in release) and a zip() iteration, which will not
panic on length mismatch and will instead stop early, leaving trailing LUT entries at their resized
default 0.0 values.

src/quant_kernels.rs[39-53]
src/quant_kernels.rs[55-75]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`build_asym_lut_into()` relies on `debug_assert_eq!(q_unit.len(), dim)` and then iterates via `q_unit.iter().zip(lut.chunks_exact_mut(n_buckets))`. In release builds, `debug_assert` is disabled and `zip()` truncates to the shorter iterator, so a mismatched `(dim, q_unit.len())` can silently produce a partially-zero LUT and wrong scores.

## Issue Context
This function takes both `dim` and `q_unit` as independent inputs, so a future call site or refactor could desynchronize them. Previously, indexing-style loops would naturally panic on a short slice; the new iterator form removes that fail-fast behavior in release.

## Fix Focus Areas
- src/quant_kernels.rs[39-53]

## Suggested fix
- Replace `debug_assert_eq!(q_unit.len(), dim)` with `assert_eq!(q_unit.len(), dim)` (or otherwise return an error if you prefer non-panicking internal APIs).
- Alternatively, remove the `dim` parameter and derive `dim` from `q_unit.len()` at the boundary, so the function cannot be called with inconsistent shapes.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Informational
2. Test probe leaks into API 🐞 Bug ⚙ Maintainability
Description
SubsetScratch::capacities_with_scalar_lut_for_test is a new pub method compiled into normal
builds (only #[doc(hidden)]), expanding the public API surface with a test-only diagnostic. This
increases the chance downstream code accidentally relies on it and constrains future refactors of
scratch internals.
Code

src/quant.rs[R95-98]

+
+    /// Test-only capacity probe that includes the scalar LUT scratch buffer.
+    #[doc(hidden)]
+    pub fn capacities_with_scalar_lut_for_test(&self) -> (usize, usize, usize, usize, usize) {
Relevance

⭐⭐⭐ High

Team previously accepted feature-gating doc-hidden diagnostic APIs to avoid public surface (PR
#215).

PR-#215

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The new capacity probe is declared pub without a feature gate, while a similar test-only dispatch
probe in the same file is explicitly gated behind feature = "test-utils", indicating the intended
pattern for keeping diagnostics out of the stable surface.

src/quant.rs[70-106]
src/quant.rs[329-354]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`capacities_with_scalar_lut_for_test()` is a test-only diagnostic but is currently a `pub` method available in all builds (it is only hidden from docs). This unnecessarily expands the public API surface.

## Issue Context
The crate already uses `#[cfg(feature = "test-utils")]` to keep other test/diagnostic surfaces (e.g. `subset_rerank_uses_simd`) out of normal builds.

## Fix Focus Areas
- src/quant.rs[96-106]

## Suggested fix
- Add `#[cfg(feature = "test-utils")]` to `capacities_with_scalar_lut_for_test` (and likely to `capacities_for_test` for consistency), and ensure integration tests that call it are run with `--features test-utils` (as they appear to be already).
- If you want the probe callable without features, consider moving it behind a dedicated, clearly non-stable module/trait to reduce accidental downstream use.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Results up to commit 7d492ea


🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX issues (0) 🔗 Cross-repo conflicts (0) 📜 Skill insights (0)


Remediation recommended
1. b8 LUT shape unchecked ✓ Resolved 🐞 Bug ☼ Reliability
Description
scan_b8_asym_with_lut builds the LUT sized from q_unit.len() but then uses the separate dim
parameter to drive both the scalar scan and the AVX-512 gather path. If a future internal caller
passes mismatched dim/q_unit, the AVX-512 gather kernel can read past the LUT in release builds
because it only debug_asserts the lut.len() == dim * 256 invariant.
Code

src/quant_kernels.rs[R604-616]

+pub(crate) fn scan_b8_asym_with_lut(
+    packed: &[u8],
+    n: usize,
+    dim: usize,
+    q_unit: &[f32],
+    scale: f32,
+    top: &mut TopK,
+    lut: &mut Vec<f32>,
+) {
+    build_b8_asym_lut_into(lut, q_unit);
    #[cfg(target_arch = "x86_64")]
    {
        if is_x86_feature_detected!("avx512f")
Relevance

⭐⭐⭐ High

Team previously accepted replacing debug-only preconditions to prevent release-build OOB/UB in SIMD
paths (PR #19, #228).

PR-#19
PR-#228

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The new reusable-LUT entry point builds the LUT from q_unit length, not dim, and then calls the
AVX-512 gather kernel which relies on lut.len() == dim * 256 but only checks it with
debug_assert_eq! before using lut.as_ptr() for gathers.

src/quant_kernels.rs[604-633]
src/quant_kernels.rs[169-177]
src/quant_kernels.rs[666-699]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`scan_b8_asym_with_lut` does not validate that `q_unit.len()` matches the `dim` parameter, even though it forwards `dim` to the scan routines and the AVX-512 gather kernel’s safety relies on `lut.len() == dim * 256`.

## Issue Context
- The LUT builder sizes from `q_unit.len()`.
- The AVX-512 gather kernel uses `lut.as_ptr()` and only `debug_assert`s the length invariant, so a mismatch could become an OOB read in release.

## Fix Focus Areas
- Add a hard invariant check early in `scan_b8_asym_with_lut`, e.g. `assert_eq!(q_unit.len(), dim, "...")`, before building/using the LUT.
- (Optional) After building, you may also `debug_assert_eq!(lut.len(), dim * 256)` for clarity.

- src/quant_kernels.rs[604-633]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Qodo Logo

Comment thread src/quant.rs Outdated

Copy link
Copy Markdown
Member Author

/agentic_review

@Fieldnote-Echo Nelson Spence (Fieldnote-Echo) force-pushed the codex/prerelease-scalar-subset-scratch branch from 1e74798 to 020882f Compare June 19, 2026 16:35

Copy link
Copy Markdown
Member Author

/agentic_review

@qodo-code-review

Copy link
Copy Markdown

Code review by qodo was updated up to the latest commit 020882f

@Fieldnote-Echo Nelson Spence (Fieldnote-Echo) force-pushed the codex/prerelease-scalar-subset-scratch branch from 020882f to 7d492ea Compare June 19, 2026 16:47
@Fieldnote-Echo

Copy link
Copy Markdown
Member Author

/agentic_review

@qodo-code-review

Copy link
Copy Markdown

Code review by qodo was updated up to the latest commit 7d492ea

Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
@Fieldnote-Echo Nelson Spence (Fieldnote-Echo) force-pushed the codex/prerelease-scalar-subset-scratch branch from 7d492ea to 5fd5c7e Compare June 19, 2026 16:57
@Fieldnote-Echo

Copy link
Copy Markdown
Member Author

/agentic_review

@qodo-code-review

Copy link
Copy Markdown

Code review by qodo was updated up to the latest commit 5fd5c7e

@project-navi-bot Navi Bot (project-navi-bot) merged commit 9b90197 into main Jun 19, 2026
38 checks passed
@project-navi-bot Navi Bot (project-navi-bot) deleted the codex/prerelease-scalar-subset-scratch branch June 19, 2026 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(runtime): reuse scalar subset-rerank LUTs through SubsetScratch

2 participants