feat(inference): W3 3-bit MLP-only weight path (#420) [FOUNDER-GATED: silent-quality-loss review]#515
Draft
ohdearquant wants to merge 3 commits into
Draft
feat(inference): W3 3-bit MLP-only weight path (#420) [FOUNDER-GATED: silent-quality-loss review]#515ohdearquant wants to merge 3 commits into
ohdearquant wants to merge 3 commits into
Conversation
CPU-side W3 (3-bit) asymmetric weight quantization for dense MLP projections: 32-weight groups, f16 scale+bias, 12 packed bytes per 16-byte block (KHW3 v1). 25% payload / 20% block byte reduction vs Q4, targeting decode weight-bandwidth. is_w3_mlp_tensor_name fails closed on MoE/attention/GDN/embed/lm_head. Library APIs return Result; no unwrap/assert in public shape validation. Metal kernel, converter, and loader are DESIGNED but NOT implemented (review surface, founder-gated). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…420) 11 independent integration tests for weights::w3_weights: happy-path half-step error bound, group-boundary isolation on non-multiple-of-32 shapes, exact vs one-past-multiple sizes, degenerate/zero-range blocks, empty/single-element edges, real temp-file roundtrip, and a quantization-level mutation guard. Confirmed mutation-sensitive by fault injection. 46 W3 tests total (35 inline + 11 integration), 0 workspace regressions. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ace (#420) Design doc: 3-bit format (packing/group/scale-bias), MLP-only layer scope, Metal kernel approach, measured f16/Q4 PPL + decode A/B, and an explicit SILENT QUALITY LOSS RISK section quantifying the unmeasured W3 PPL cost. Founder-gated review surface. W3 quality is UNVERIFIED (no runnable W3 quality-measurement path yet) and INCOMPLETE per the measure-or-declare rule. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
E2E Parity ReportFAIL: 1/4 prompts diverged within their match windows
|
print(fib
print(fib
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
W3 3-bit MLP-only weight path — REVIEW SURFACE, FOUNDER-GATED
⛔ DO NOT MERGE. This is a review surface (draft PR + evidence) for a new
sub-4-bit weight format with unmeasured silent-quality-loss risk. It delivers
a tested CPU W3 format component plus a complete Metal-path design, not a
runnable W3 inference path.
Issue #420. Branch off
main @ 36d9bf68c.Full design:
.khive/reports/w3_mlp_420_design.md.What W3 is
The cheapest sub-4-bit decode byte-read reduction. Decode is
weight-bandwidth-bound and dense MLP weights dominate per-step bytes-read.
Quantize only
mlp.{gate_proj,up_proj,down_proj}to 3-bit; keep attention,GDN, embeddings, norm, and
lm_headat Q4/f16.12 packed bytes → 16-byte block (KHW3 v1). −25% payload / −20% block
bytes vs Q4.
is_w3_mlp_tensor_namefails closed onMoE/attention/GDN/embed/lm_head.
gemv_w3_decodemirrorsgemv_q4_decode(3 packed bytes/lane vs 4,same
bias·Σxoptimization); naivegemm_w3for prefill/verify. Mixed loaderfrom_w3_mlp_dirrejects MoE, missing.w3, and stray MLP.q4(no silentfallback). DESIGNED, not implemented this pass.
Full-corpus measurement (2026-07-02) — method-disclosed
Dedicated serialized run on an idle GPU, one process per leg, build
11158ab112b2(the follow-up work that made the W3 path runnable; refs #420, #530, #531). All
three legs use the same harness and scoring method:
eval_perplexitydecode-loop (M=1 teacher-forced NLL via
forward_step; NOT standard prefill PPL),full
docs/bench_results/wiki.test.rawcorpus (310,034 tokens → 310,033 scored,1,211 windows), window 512 / stride 256, max cache len 4096. Runtime differs by
leg and is disclosed per row: the f16 reference runs the CPU safetensors path;
Q4 and W3 run Metal. Deltas are method-consistent A/B within this harness and
are not comparable to published prefill-PPL numbers.
Reading: W3's MLP-only 3-bit quantization costs +2.37 PPL over Q4 on the same
harness — roughly 5.4× Q4's own delta vs f16 — in exchange for −25% MLP payload
bytes. This is the quantified silent-quality-loss number the gate below asked for.
Still unmeasured (disclosed hole): W3 decode A/B. The #420 acceptance also
calls for generation-path decode tok/s of W3 vs f16/Q4; that has not been run.
The wall clocks above are teacher-forced scoring throughput and must not be
quoted as a decode speedup.
Raw log archived locally at
.khive/reports/w3_420_full_corpus_ppl.log(structured
@@lattice {"ev":"perplexity",...}events per leg).Earlier smoke measurement — SUPERSEDED by the full-corpus table above
Local
qwen3.5-0.8b, wiki corpus,window=128 stride=64. SMOKE numbers on acapped corpus — not publication-grade.
Existing Q4-vs-Q8_0 gap is +35.75%, confirming byte-read reduction → real
decode speedup (the mechanism W3 exploits). W3's own speedup is not measured.
SILENT QUALITY LOSS RISK
This is the founder gate. (Update 2026-07-02: the PPL magnitude is now
measured — see the full-corpus table above: +2.366 PPL vs Q4, +2.899 vs f16. The
gate is now a founder judgment on that measured trade plus the still-missing
decode A/B, no longer on an unmeasured unknown.) A 3-bit MLP model still decodes
fluent-looking tokens even when degraded, so the loss is silent — it shows up
only as higher perplexity, never as a crash.
quantization error is strictly larger than Q4's on those tensors. Expected
W3-vs-Q4 PPL delta is positive and larger than the MLP's share of Q4's
+0.1–0.3 budget — exact magnitude not measured on this branch.
to +1.41), proving the capped harness is not comparability-grade. W3 must be
evaluated on a full/comparable corpus before any acceptance.
To close the gate (none exist yet): converter →
.w3artifacts →from_w3_mlp_dir→gemv_w3_decode;--w3-mlp-dirwired for a full-corpus run;delta_w3_vs_{f16,q4}+delta_q4_vs_f16on the same corpus; a decode tok/s A/B.Additional silent surfaces (each produces plausible tokens while wrong; each
needs a test): missing
.w3→ silent.q4fallback; wronggate_byte_sizeinfused
gate||up; measuring only tok/s; treating smoke PPL as a quality claim.Verdict: W3 quality is NOT verified. INCOMPLETE per Π_TBV. Do not merge on the
byte-reduction argument alone.
DONE vs DESIGNED
DONE (tested):
weights::w3_weightsCPU pack/dequant/.w3I/O/classification(~660 LOC); 46 W3 tests (35 inline + 11 integration), mutation-sensitivity
confirmed by fault injection, 0 workspace regressions (1364 pre-existing).
DESIGNED, NOT built: converter binary, mixed W3/Q4 Metal loader,
gemv_w3_decode/gemm_w3kernels, CLI/bench/--w3-mlp-dirwiring, and the W3PPL/decode measurement.
Gates (actual)
🤖 Generated with Claude Code