fix(inference): correct partial-block asymmetric Q4 scale (#346)#541
Conversation
Compute the asymmetric min/max over the real tail-block elements instead of the zero-padded [f32;32] array, so a partial tail block gets the same scale/bias resolution as a full block (previously a tail like [5,6,7] took its min from the padding zeros, coarsening the scale). All five zero-padding call sites pass the real element count through a new length-aware helper. Symmetric mode is left folding over the full padded array and stays bit-identical, since padding zeros can never change abs_max. Add mutation-sensitive tests: a partial-block test that fails if min/max reverts to the padded array, a symmetric bit-identity test, and corrupt-header tests proving load_q4_file / load_f16_tensor_file return a clean Err (not a panic or a silent wrong-size allocation) for header counts near usize::MAX and for counts whose multiply overflows and wraps to a small residue. Refs #346 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
E2E Parity ReportFAIL: 1/4 prompts diverged within their match windows
|
|
E2E Parity ReportPASS: all 4 prompts match within their respective match windows
|
print(fib
print(fib
|
The near-usize::MAX f16 test claimed to cover the file_len bound, but (usize::MAX - 3) * 2 overflows, so it exercised the checked_mul branch and its assertion accepted "overflows usize" — deleting the file_len check would have gone unnoticed. Use numel = usize::MAX / 4 (byte count ~2^62, no overflow) and assert only on the file_len-bound error. Under mutation (bound deleted) the loader attempts a 9.2 EB allocation and aborts, confirming the test now pins that branch.
Review follow-up, addressed in fdf70d4: the near-usize::MAX f16 test claimed to pin the file-length bound, but |
Context for the point estimates: the changed function is |
Addresses the four bounded edge items in #346. The core Q4/QuaRot dequant math was already sound; each item's premise was re-verified against current source before any change, since line numbers in the issue had drifted.
Item 1 — partial-block asymmetric scale (fixed)
quantize_block_with_mode(.., symmetric=false)computed min/max over the zero-padded[f32;32]array, so a partial tail block took its scale from the padding. Example: a[5,6,7]+29-zeros tail derivedmin=0instead ofmin=5, producing a coarser scale/bias than a real 32-element block.Fix: a new length-aware helper computes the asymmetric min/max over only the real
chunk.len()elements; zero-padding is used solely for packing. All five zero-padding call sites now pass the real element count. Symmetric mode is unchanged and remains bit-identical — it still folds over the full padded array, and padding zeros can never changeabs_max; a bit-identity test asserts this.The standalone
quantize_blockwrapper became dead code once the length-aware helper landed and was removed (the-D warningsclippy gate flags an unused function otherwise).Item 2 — load-time overflow guard (already present)
Re-verified:
checked_alloc_bytesalready guards every header-derived allocation inq4_weights.rsviacount.checked_mul(elem_size)plus a file-length bound, returning a cleanio::Error(no panic/unwrap). This coversload_q4_file(ndim, shape-product, and then_blocks * 20block payload the issue named),read_q4_header(ndim), and the siblingload_f16_tensor_file(ndim,numel * 2). No sibling "untrusted header count × elem_size → allocation" path was left unguarded. No production change was needed; this PR adds mutation-sensitive tests to lock the guard in (evidence commented on #346).Item 4 — docstring (already consistent)
The
quantize_blockdocstring already read "asymmetric mode" and matched its body. No mismatch remained; the wrapper it documented was removed under Item 1's refactor (evidence commented on #346).Item 3 — embed/lm_head symmetry (measured, report-only)
Measured
mean(|row_mean|)/abs_maxon the post-rotationembed_tokensand materializedlm_headof the real Qwen3.5-0.8B checkpoint (CPU-only, converter-identical rotation path): 0.5559% and 0.0306% ofabs_maxrespectively — both far below the ~5% materiality threshold. No asymmetric switch is added for these tensors and no switch test; report-only with a follow-up note. Full method + numbers are posted as a comment on #346.Tests / mutation-proof evidence
New tests in
q4_weights.rs, each proven mutation-sensitive by reverse-applying the fix in place, confirming the test FAILS, then restoring and confirming it PASSES:[5,6,7]tail getsscale=(7-5)/15, bias=5.0and explicitly does not match the padded-array result; FAILS if min/max reverts to the padded array.valid_len.Errfororiginal_len/numelnearusize::MAX, and for anumelwhose×elem_sizeoverflow wraps to a small residue. The wrap-to-small case is the important one: withchecked_mulremoved it returnsOkwith silently wrong data (not just an OOM), which the file-length bound alone cannot catch. FAILS with the guard removed.Gates (CPU-only)
cargo fmt --check,cargo clippy --workspace -- -D warnings,cargo check --workspace, andcargo test --workspace(incl. doctests) all pass; the inference lib suite reports 1360 passed / 0 failed / 7 ignored. No Metal/GPU test, bench, or binary was run — a separate serialized GPU measurement is active on this machine and the touched code is CPU load-time only.Bench
make bench-comparewas not run. The touched path (quantize_block_with_mode_lenand its five call sites) is invoked only at quantization/load time, never on the hot forward path, and no existing CPU micro-bench targets it; stating this plainly rather than fabricating a delta or contending with the concurrent GPU measurement.Out-of-scope defect filed
A separate, adjacent fail-closed gap in the Metal Q4 mmap loader (validates only the header offset, not the full declared payload length, so a truncated
.q4payload can pass) was filed as #540 rather than folded in here — it is a no-copy mmap path, not the header-derived allocation class this PR addresses.Closes #346