fix(inference): correct partial-block asymmetric Q4 scale (#346) by ohdearquant · Pull Request #541 · ohdearquant/lattice

ohdearquant · 2026-07-02T05:18:53Z

PR description authored by Claude (Anthropic agent) on behalf of @ohdearquant.

Addresses the four bounded edge items in #346. The core Q4/QuaRot dequant math was already sound; each item's premise was re-verified against current source before any change, since line numbers in the issue had drifted.

Item 1 — partial-block asymmetric scale (fixed)

quantize_block_with_mode(.., symmetric=false) computed min/max over the zero-padded [f32;32] array, so a partial tail block took its scale from the padding. Example: a [5,6,7]+29-zeros tail derived min=0 instead of min=5, producing a coarser scale/bias than a real 32-element block.

Fix: a new length-aware helper computes the asymmetric min/max over only the real chunk.len() elements; zero-padding is used solely for packing. All five zero-padding call sites now pass the real element count. Symmetric mode is unchanged and remains bit-identical — it still folds over the full padded array, and padding zeros can never change abs_max; a bit-identity test asserts this.

The standalone quantize_block wrapper became dead code once the length-aware helper landed and was removed (the -D warnings clippy gate flags an unused function otherwise).

Item 2 — load-time overflow guard (already present)

Re-verified: checked_alloc_bytes already guards every header-derived allocation in q4_weights.rs via count.checked_mul(elem_size) plus a file-length bound, returning a clean io::Error (no panic/unwrap). This covers load_q4_file (ndim, shape-product, and the n_blocks * 20 block payload the issue named), read_q4_header (ndim), and the sibling load_f16_tensor_file (ndim, numel * 2). No sibling "untrusted header count × elem_size → allocation" path was left unguarded. No production change was needed; this PR adds mutation-sensitive tests to lock the guard in (evidence commented on #346).

Item 4 — docstring (already consistent)

The quantize_block docstring already read "asymmetric mode" and matched its body. No mismatch remained; the wrapper it documented was removed under Item 1's refactor (evidence commented on #346).

Item 3 — embed/lm_head symmetry (measured, report-only)

Measured mean(|row_mean|)/abs_max on the post-rotation embed_tokens and materialized lm_head of the real Qwen3.5-0.8B checkpoint (CPU-only, converter-identical rotation path): 0.5559% and 0.0306% of abs_max respectively — both far below the ~5% materiality threshold. No asymmetric switch is added for these tensors and no switch test; report-only with a follow-up note. Full method + numbers are posted as a comment on #346.

Tests / mutation-proof evidence

New tests in q4_weights.rs, each proven mutation-sensitive by reverse-applying the fix in place, confirming the test FAILS, then restoring and confirming it PASSES:

Partial-block scale — asserts the [5,6,7] tail gets scale=(7-5)/15, bias=5.0 and explicitly does not match the padded-array result; FAILS if min/max reverts to the padded array.
Symmetric bit-identity — partial symmetric block stays byte-identical to the padded path; FAILS if the symmetric scale is coupled to valid_len.
Corrupt-header guards — clean Err for original_len/numel near usize::MAX, and for a numel whose ×elem_size overflow wraps to a small residue. The wrap-to-small case is the important one: with checked_mul removed it returns Ok with silently wrong data (not just an OOM), which the file-length bound alone cannot catch. FAILS with the guard removed.

Gates (CPU-only)

cargo fmt --check, cargo clippy --workspace -- -D warnings, cargo check --workspace, and cargo test --workspace (incl. doctests) all pass; the inference lib suite reports 1360 passed / 0 failed / 7 ignored. No Metal/GPU test, bench, or binary was run — a separate serialized GPU measurement is active on this machine and the touched code is CPU load-time only.

Bench

make bench-compare was not run. The touched path (quantize_block_with_mode_len and its five call sites) is invoked only at quantization/load time, never on the hot forward path, and no existing CPU micro-bench targets it; stating this plainly rather than fabricating a delta or contending with the concurrent GPU measurement.

Out-of-scope defect filed

A separate, adjacent fail-closed gap in the Metal Q4 mmap loader (validates only the header offset, not the full declared payload length, so a truncated .q4 payload can pass) was filed as #540 rather than folded in here — it is a no-copy mmap path, not the header-derived allocation class this PR addresses.

Closes #346

Compute the asymmetric min/max over the real tail-block elements instead of the zero-padded [f32;32] array, so a partial tail block gets the same scale/bias resolution as a full block (previously a tail like [5,6,7] took its min from the padding zeros, coarsening the scale). All five zero-padding call sites pass the real element count through a new length-aware helper. Symmetric mode is left folding over the full padded array and stays bit-identical, since padding zeros can never change abs_max. Add mutation-sensitive tests: a partial-block test that fails if min/max reverts to the padded array, a symmetric bit-identity test, and corrupt-header tests proving load_q4_file / load_f16_tensor_file return a clean Err (not a panic or a silent wrong-size allocation) for header counts near usize::MAX and for counts whose multiply overflows and wraps to a small residue. Refs #346 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-actions · 2026-07-02T05:28:29Z

E2E Parity Report

FAIL: 1/4 prompts diverged within their match windows

Prompt	Window	Agreement	First Diff	HF tok/s	Lattice tok/s	Verdict
`The capital of France is`	3	4/4	none	0.2	3.0	PASS
`In the year 2024, artificial intelligence`	3	4/4	none	0.2	2.8	PASS
`def fibonacci(n):

if n <= 1:
    return n
return` | 3 | 4/4 | none | 0.2 | 2.9 | PASS |

| def merge_sort(arr): """ Merge sort implementation. | 2 | 0/4 | pos 0 | 0.1 | 0.6 | FAIL |

The capital of France is

HF: Paris.
The
Lattice: Paris.
The

In the year 2024, artificial intelligence

HF: (AI) has
Lattice: (AI) has

def fibonacci(n): if n <= 1: return n return

HF: fibonacci(n-1
Lattice: fibonacci(n-1

def merge_sort(arr): """ Merge sort implementation.

HF: merge_sort(arr):
Lattice: main():

github-actions · 2026-07-02T05:28:32Z

E2E Parity Report

PASS: all 4 prompts match within their respective match windows

Prompt	Window	Agreement	First Diff	HF tok/s	Lattice tok/s	Verdict
`The capital of France is`	3	15/15	none	0.1	1.1	PASS
`In the year 2024, artificial intelligence`	3	15/15	none	0.2	1.2	PASS
`def fibonacci(n):

if n <= 1:
    return n
return` | 3 | 15/15 | none | 0.2 | 1.0 | PASS |

| def merge_sort(arr): """ Merge sort implementation. | 2 | 15/15 | none | 0.2 | 0.1 | PASS |

The capital of France is

HF: Paris.
The capital of France is Paris.
The capital of France
Lattice: Paris.
The capital of France is Paris.
The capital of France

In the year 2024, artificial intelligence

HF: (AI) has become a significant part of the global economy. It is
Lattice: (AI) has become a significant part of the global economy. It is

def fibonacci(n): if n <= 1: return n return

HF: fibonacci(n-1) + fibonacci(n-2)

print(fib

Lattice: fibonacci(n-1) + fibonacci(n-2)

print(fib

def merge_sort(arr): """ Merge sort implementation.

HF: merge_sort(arr):
"""
Merge sort implementation.
Lattice: merge_sort(arr):
"""
Merge sort implementation.

The near-usize::MAX f16 test claimed to cover the file_len bound, but (usize::MAX - 3) * 2 overflows, so it exercised the checked_mul branch and its assertion accepted "overflows usize" — deleting the file_len check would have gone unnoticed. Use numel = usize::MAX / 4 (byte count ~2^62, no overflow) and assert only on the file_len-bound error. Under mutation (bound deleted) the loader attempts a 9.2 EB allocation and aborts, confirming the test now pins that branch.

ohdearquant · 2026-07-02T05:30:28Z

Comment authored by Claude (Anthropic agent) on behalf of @ohdearquant.

Review follow-up, addressed in fdf70d4: the near-usize::MAX f16 test claimed to pin the file-length bound, but (usize::MAX - 3) * 2 overflows, so it actually exercised the overflow branch and its assertion accepted that error too — the f16 file-length bound was unpinned. The test now uses numel = usize::MAX / 4 (byte count ~2^62, no overflow) and asserts only on the file-length-bound error message. Mutation evidence: with the bound deleted, the loader attempts a 9.2 EB allocation and aborts (SIGABRT), so the test now fails hard under that mutation. Full q4_weights suite, clippy -D warnings, and fmt all green after the fix.

ohdearquant · 2026-07-02T09:28:53Z

Comment authored by Claude (Anthropic agent) on behalf of @ohdearquant.

make bench-compare BASE=9997e54bb (quick, fork point of this branch): bench-compare showed no change — criterion reports p > 0.05 on all groups (elementwise_cpu_bench + simd).

Context for the point estimates: the changed function is quantize_block_with_mode_len, a load-time Q4 block-quantization path that no covered bench group exercises (verified by grep over the bench sources); the decode/prefill/embed hot paths are untouched. Point estimates drifted mixed-sign on unrelated kernels due to concurrent workloads on the bench machine, and criterion attributes all of it to noise (largest: rms_norm/4096 at p = 0.10). The symmetric quantization path is additionally pinned bit-identical to the pre-fix output by quantize_f64_to_q4_symmetric_partial_block_is_bit_identical_to_padded_block.

ohdearquant merged commit 7eb33e3 into main Jul 2, 2026
17 of 18 checks passed

ohdearquant deleted the fix/346-q4-edge-guards branch July 2, 2026 09:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(inference): correct partial-block asymmetric Q4 scale (#346)#541

fix(inference): correct partial-block asymmetric Q4 scale (#346)#541
ohdearquant merged 2 commits into
mainfrom
fix/346-q4-edge-guards

ohdearquant commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

ohdearquant commented Jul 2, 2026

Uh oh!

ohdearquant commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ohdearquant commented Jul 2, 2026

Item 1 — partial-block asymmetric scale (fixed)

Item 2 — load-time overflow guard (already present)

Item 4 — docstring (already consistent)

Item 3 — embed/lm_head symmetry (measured, report-only)

Tests / mutation-proof evidence

Gates (CPU-only)

Bench

Out-of-scope defect filed

Uh oh!

github-actions Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Parity Report

Uh oh!

github-actions Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Parity Report

Uh oh!

ohdearquant commented Jul 2, 2026

Uh oh!

ohdearquant commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Jul 2, 2026 •

edited

Loading

github-actions Bot commented Jul 2, 2026 •

edited

Loading