Skip to content

fix(inference): correct partial-block asymmetric Q4 scale (#346)#541

Merged
ohdearquant merged 2 commits into
mainfrom
fix/346-q4-edge-guards
Jul 2, 2026
Merged

fix(inference): correct partial-block asymmetric Q4 scale (#346)#541
ohdearquant merged 2 commits into
mainfrom
fix/346-q4-edge-guards

Conversation

@ohdearquant

Copy link
Copy Markdown
Owner

PR description authored by Claude (Anthropic agent) on behalf of @ohdearquant.

Addresses the four bounded edge items in #346. The core Q4/QuaRot dequant math was already sound; each item's premise was re-verified against current source before any change, since line numbers in the issue had drifted.

Item 1 — partial-block asymmetric scale (fixed)

quantize_block_with_mode(.., symmetric=false) computed min/max over the zero-padded [f32;32] array, so a partial tail block took its scale from the padding. Example: a [5,6,7]+29-zeros tail derived min=0 instead of min=5, producing a coarser scale/bias than a real 32-element block.

Fix: a new length-aware helper computes the asymmetric min/max over only the real chunk.len() elements; zero-padding is used solely for packing. All five zero-padding call sites now pass the real element count. Symmetric mode is unchanged and remains bit-identical — it still folds over the full padded array, and padding zeros can never change abs_max; a bit-identity test asserts this.

The standalone quantize_block wrapper became dead code once the length-aware helper landed and was removed (the -D warnings clippy gate flags an unused function otherwise).

Item 2 — load-time overflow guard (already present)

Re-verified: checked_alloc_bytes already guards every header-derived allocation in q4_weights.rs via count.checked_mul(elem_size) plus a file-length bound, returning a clean io::Error (no panic/unwrap). This covers load_q4_file (ndim, shape-product, and the n_blocks * 20 block payload the issue named), read_q4_header (ndim), and the sibling load_f16_tensor_file (ndim, numel * 2). No sibling "untrusted header count × elem_size → allocation" path was left unguarded. No production change was needed; this PR adds mutation-sensitive tests to lock the guard in (evidence commented on #346).

Item 4 — docstring (already consistent)

The quantize_block docstring already read "asymmetric mode" and matched its body. No mismatch remained; the wrapper it documented was removed under Item 1's refactor (evidence commented on #346).

Item 3 — embed/lm_head symmetry (measured, report-only)

Measured mean(|row_mean|)/abs_max on the post-rotation embed_tokens and materialized lm_head of the real Qwen3.5-0.8B checkpoint (CPU-only, converter-identical rotation path): 0.5559% and 0.0306% of abs_max respectively — both far below the ~5% materiality threshold. No asymmetric switch is added for these tensors and no switch test; report-only with a follow-up note. Full method + numbers are posted as a comment on #346.

Tests / mutation-proof evidence

New tests in q4_weights.rs, each proven mutation-sensitive by reverse-applying the fix in place, confirming the test FAILS, then restoring and confirming it PASSES:

  • Partial-block scale — asserts the [5,6,7] tail gets scale=(7-5)/15, bias=5.0 and explicitly does not match the padded-array result; FAILS if min/max reverts to the padded array.
  • Symmetric bit-identity — partial symmetric block stays byte-identical to the padded path; FAILS if the symmetric scale is coupled to valid_len.
  • Corrupt-header guards — clean Err for original_len/numel near usize::MAX, and for a numel whose ×elem_size overflow wraps to a small residue. The wrap-to-small case is the important one: with checked_mul removed it returns Ok with silently wrong data (not just an OOM), which the file-length bound alone cannot catch. FAILS with the guard removed.

Gates (CPU-only)

cargo fmt --check, cargo clippy --workspace -- -D warnings, cargo check --workspace, and cargo test --workspace (incl. doctests) all pass; the inference lib suite reports 1360 passed / 0 failed / 7 ignored. No Metal/GPU test, bench, or binary was run — a separate serialized GPU measurement is active on this machine and the touched code is CPU load-time only.

Bench

make bench-compare was not run. The touched path (quantize_block_with_mode_len and its five call sites) is invoked only at quantization/load time, never on the hot forward path, and no existing CPU micro-bench targets it; stating this plainly rather than fabricating a delta or contending with the concurrent GPU measurement.

Out-of-scope defect filed

A separate, adjacent fail-closed gap in the Metal Q4 mmap loader (validates only the header offset, not the full declared payload length, so a truncated .q4 payload can pass) was filed as #540 rather than folded in here — it is a no-copy mmap path, not the header-derived allocation class this PR addresses.

Closes #346

Compute the asymmetric min/max over the real tail-block elements instead
of the zero-padded [f32;32] array, so a partial tail block gets the same
scale/bias resolution as a full block (previously a tail like [5,6,7] took
its min from the padding zeros, coarsening the scale). All five
zero-padding call sites pass the real element count through a new
length-aware helper. Symmetric mode is left folding over the full padded
array and stays bit-identical, since padding zeros can never change
abs_max.

Add mutation-sensitive tests: a partial-block test that fails if min/max
reverts to the padded array, a symmetric bit-identity test, and
corrupt-header tests proving load_q4_file / load_f16_tensor_file return a
clean Err (not a panic or a silent wrong-size allocation) for header
counts near usize::MAX and for counts whose multiply overflows and wraps
to a small residue.

Refs #346

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

E2E Parity Report

FAIL: 1/4 prompts diverged within their match windows

Prompt Window Agreement First Diff HF tok/s Lattice tok/s Verdict
The capital of France is 3 4/4 none 0.2 3.0 PASS
In the year 2024, artificial intelligence 3 4/4 none 0.2 2.8 PASS
`def fibonacci(n):
if n <= 1:
    return n
return` | 3 | 4/4 | none | 0.2 | 2.9 | PASS |

| def merge_sort(arr): """ Merge sort implementation. | 2 | 0/4 | pos 0 | 0.1 | 0.6 | FAIL |

The capital of France is

  • HF: Paris.
    The
  • Lattice: Paris.
    The

In the year 2024, artificial intelligence

  • HF: (AI) has
  • Lattice: (AI) has

def fibonacci(n): if n <= 1: return n return

  • HF: fibonacci(n-1
  • Lattice: fibonacci(n-1

def merge_sort(arr): """ Merge sort implementation.

  • HF: merge_sort(arr):
  • Lattice: main():

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

E2E Parity Report

PASS: all 4 prompts match within their respective match windows

Prompt Window Agreement First Diff HF tok/s Lattice tok/s Verdict
The capital of France is 3 15/15 none 0.1 1.1 PASS
In the year 2024, artificial intelligence 3 15/15 none 0.2 1.2 PASS
`def fibonacci(n):
if n <= 1:
    return n
return` | 3 | 15/15 | none | 0.2 | 1.0 | PASS |

| def merge_sort(arr): """ Merge sort implementation. | 2 | 15/15 | none | 0.2 | 0.1 | PASS |

The capital of France is

  • HF: Paris.
    The capital of France is Paris.
    The capital of France
  • Lattice: Paris.
    The capital of France is Paris.
    The capital of France

In the year 2024, artificial intelligence

  • HF: (AI) has become a significant part of the global economy. It is
  • Lattice: (AI) has become a significant part of the global economy. It is

def fibonacci(n): if n <= 1: return n return

  • HF: fibonacci(n-1) + fibonacci(n-2)

print(fib

  • Lattice: fibonacci(n-1) + fibonacci(n-2)

print(fib

def merge_sort(arr): """ Merge sort implementation.

  • HF: merge_sort(arr):
    """
    Merge sort implementation.

  • Lattice: merge_sort(arr):
    """
    Merge sort implementation.

The near-usize::MAX f16 test claimed to cover the file_len bound, but
(usize::MAX - 3) * 2 overflows, so it exercised the checked_mul branch
and its assertion accepted "overflows usize" — deleting the file_len
check would have gone unnoticed. Use numel = usize::MAX / 4 (byte count
~2^62, no overflow) and assert only on the file_len-bound error. Under
mutation (bound deleted) the loader attempts a 9.2 EB allocation and
aborts, confirming the test now pins that branch.
@ohdearquant

Copy link
Copy Markdown
Owner Author

Comment authored by Claude (Anthropic agent) on behalf of @ohdearquant.

Review follow-up, addressed in fdf70d4: the near-usize::MAX f16 test claimed to pin the file-length bound, but (usize::MAX - 3) * 2 overflows, so it actually exercised the overflow branch and its assertion accepted that error too — the f16 file-length bound was unpinned. The test now uses numel = usize::MAX / 4 (byte count ~2^62, no overflow) and asserts only on the file-length-bound error message. Mutation evidence: with the bound deleted, the loader attempts a 9.2 EB allocation and aborts (SIGABRT), so the test now fails hard under that mutation. Full q4_weights suite, clippy -D warnings, and fmt all green after the fix.

@ohdearquant

Copy link
Copy Markdown
Owner Author

Comment authored by Claude (Anthropic agent) on behalf of @ohdearquant.

make bench-compare BASE=9997e54bb (quick, fork point of this branch): bench-compare showed no change — criterion reports p > 0.05 on all groups (elementwise_cpu_bench + simd).

Context for the point estimates: the changed function is quantize_block_with_mode_len, a load-time Q4 block-quantization path that no covered bench group exercises (verified by grep over the bench sources); the decode/prefill/embed hot paths are untouched. Point estimates drifted mixed-sign on unrelated kernels due to concurrent workloads on the bench machine, and criterion attributes all of it to noise (largest: rms_norm/4096 at p = 0.10). The symmetric quantization path is additionally pinned bit-identical to the pre-fix output by quantize_f64_to_q4_symmetric_partial_block_is_bit_identical_to_padded_block.

@ohdearquant ohdearquant merged commit 7eb33e3 into main Jul 2, 2026
17 of 18 checks passed
@ohdearquant ohdearquant deleted the fix/346-q4-edge-guards branch July 2, 2026 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Q4/QuaRot quantization: partial-block scale, load_q4_file overflow guard, QuaRot embed symmetry (measure-first) — core math sound

1 participant