feat(inference): W3 3-bit MLP-only weight path (#420) [FOUNDER-GATED: silent-quality-loss review] by ohdearquant · Pull Request #515 · ohdearquant/lattice

ohdearquant · 2026-07-01T19:52:35Z

PR description authored by Claude (Anthropic agent) on behalf of @ohdearquant.

W3 3-bit MLP-only weight path — REVIEW SURFACE, FOUNDER-GATED

⛔ DO NOT MERGE. This is a review surface (draft PR + evidence) for a new
sub-4-bit weight format with unmeasured silent-quality-loss risk. It delivers
a tested CPU W3 format component plus a complete Metal-path design, not a
runnable W3 inference path.

Issue #420. Branch off main @ 36d9bf68c.
Full design: .khive/reports/w3_mlp_420_design.md.

What W3 is

The cheapest sub-4-bit decode byte-read reduction. Decode is
weight-bandwidth-bound and dense MLP weights dominate per-step bytes-read.
Quantize only mlp.{gate_proj,up_proj,down_proj} to 3-bit; keep attention,
GDN, embeddings, norm, and lm_head at Q4/f16.

Format: 32-weight groups, unsigned 3-bit asymmetric min/max, f16 scale+bias,
12 packed bytes → 16-byte block (KHW3 v1). −25% payload / −20% block
bytes vs Q4.
Layer scope: MLP-only; is_w3_mlp_tensor_name fails closed on
MoE/attention/GDN/embed/lm_head.
Metal: gemv_w3_decode mirrors gemv_q4_decode (3 packed bytes/lane vs 4,
same bias·Σx optimization); naive gemm_w3 for prefill/verify. Mixed loader
from_w3_mlp_dir rejects MoE, missing .w3, and stray MLP .q4 (no silent
fallback). DESIGNED, not implemented this pass.

Full-corpus measurement (2026-07-02) — method-disclosed

Dedicated serialized run on an idle GPU, one process per leg, build 11158ab112b2
(the follow-up work that made the W3 path runnable; refs #420, #530, #531). All
three legs use the same harness and scoring method: eval_perplexity
decode-loop (M=1 teacher-forced NLL via forward_step; NOT standard prefill PPL),
full docs/bench_results/wiki.test.raw corpus (310,034 tokens → 310,033 scored,
1,211 windows), window 512 / stride 256, max cache len 4096. Runtime differs by
leg and is disclosed per row: the f16 reference runs the CPU safetensors path;
Q4 and W3 run Metal. Deltas are method-consistent A/B within this harness and
are not comparable to published prefill-PPL numbers.

Weights	Runtime	PPL	Mean NLL (nats)	Δ PPL vs f16	Δ PPL vs Q4	Wall
f16 (safetensors)	CPU	15.1977	2.7211	—	—	6,697s
Q4 (KHQ4)	Metal	15.7307	2.7556	+0.533 (+3.5%)	—	5,994s
W3 MLP-only	Metal	18.0971	2.8958	+2.899 (+19.1%)	+2.366 (+15.0%)	5,949s

Reading: W3's MLP-only 3-bit quantization costs +2.37 PPL over Q4 on the same
harness — roughly 5.4× Q4's own delta vs f16 — in exchange for −25% MLP payload
bytes. This is the quantified silent-quality-loss number the gate below asked for.

Still unmeasured (disclosed hole): W3 decode A/B. The #420 acceptance also
calls for generation-path decode tok/s of W3 vs f16/Q4; that has not been run.
The wall clocks above are teacher-forced scoring throughput and must not be
quoted as a decode speedup.

Raw log archived locally at .khive/reports/w3_420_full_corpus_ppl.log
(structured @@lattice {"ev":"perplexity",...} events per leg).

Earlier smoke measurement — SUPERSEDED by the full-corpus table above

Local qwen3.5-0.8b, wiki corpus, window=128 stride=64. SMOKE numbers on a
capped corpus — not publication-grade.

Path	Runtime	Max tok	PPL	Δ vs f16	Bound
f16/BF16	CPU	1024	17.299129	baseline	—
Q4	Metal	1024	18.709340	+1.410211	outside known Q4 +0.1–0.3
W3 MLP-only	N/A	N/A	N/A	N/A	INCOMPLETE — no runnable W3 path

Decode path	tok/s	Status
Q4 directory (Metal Q4_0)	158.42	measured
safetensors-direct (Metal Q8_0)	116.70	measured
W3 MLP-only	N/A	INCOMPLETE — no W3 decode path

Existing Q4-vs-Q8_0 gap is +35.75%, confirming byte-read reduction → real
decode speedup (the mechanism W3 exploits). W3's own speedup is not measured.

SILENT QUALITY LOSS RISK

This is the founder gate. (Update 2026-07-02: the PPL magnitude is now
measured — see the full-corpus table above: +2.366 PPL vs Q4, +2.899 vs f16. The
gate is now a founder judgment on that measured trade plus the still-missing
decode A/B, no longer on an unmeasured unknown.) A 3-bit MLP model still decodes
fluent-looking tokens even when degraded, so the loss is silent — it shows up
only as higher perplexity, never as a crash.

W3 uses 3 bits vs Q4's 4 on the MLP (8 levels vs 16), so W3's MLP
quantization error is strictly larger than Q4's on those tensors. Expected
W3-vs-Q4 PPL delta is positive and larger than the MLP's share of Q4's
+0.1–0.3 budget — exact magnitude not measured on this branch.
The Q4 smoke delta already landed outside the known full-eval bound (+1.15
to +1.41), proving the capped harness is not comparability-grade. W3 must be
evaluated on a full/comparable corpus before any acceptance.

To close the gate (none exist yet): converter → .w3 artifacts →
from_w3_mlp_dir → gemv_w3_decode; --w3-mlp-dir wired for a full-corpus run;
delta_w3_vs_{f16,q4} + delta_q4_vs_f16 on the same corpus; a decode tok/s A/B.

Additional silent surfaces (each produces plausible tokens while wrong; each
needs a test): missing .w3 → silent .q4 fallback; wrong gate_byte_size in
fused gate||up; measuring only tok/s; treating smoke PPL as a quality claim.

Verdict: W3 quality is NOT verified. INCOMPLETE per Π_TBV. Do not merge on the
byte-reduction argument alone.

DONE vs DESIGNED

DONE (tested): weights::w3_weights CPU pack/dequant/.w3 I/O/classification
(~660 LOC); 46 W3 tests (35 inline + 11 integration), mutation-sensitivity
confirmed by fault injection, 0 workspace regressions (1364 pre-existing).

DESIGNED, NOT built: converter binary, mixed W3/Q4 Metal loader,
gemv_w3_decode/gemm_w3 kernels, CLI/bench/--w3-mlp-dir wiring, and the W3
PPL/decode measurement.

Gates (actual)

cargo fmt --check                                              → clean (exit 0)
cargo clippy --workspace -- -D warnings                       → 0 warnings (exit 0)
cargo clippy -p lattice-inference --features metal-gpu -D warnings → 0 warnings (exit 0)
cargo test -p lattice-inference --lib w3_weights              → 35 passed; 0 failed
cargo test -p lattice-inference --test w3_weights_integration → 11 passed; 0 failed

🤖 Generated with Claude Code

CPU-side W3 (3-bit) asymmetric weight quantization for dense MLP projections: 32-weight groups, f16 scale+bias, 12 packed bytes per 16-byte block (KHW3 v1). 25% payload / 20% block byte reduction vs Q4, targeting decode weight-bandwidth. is_w3_mlp_tensor_name fails closed on MoE/attention/GDN/embed/lm_head. Library APIs return Result; no unwrap/assert in public shape validation. Metal kernel, converter, and loader are DESIGNED but NOT implemented (review surface, founder-gated). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…420) 11 independent integration tests for weights::w3_weights: happy-path half-step error bound, group-boundary isolation on non-multiple-of-32 shapes, exact vs one-past-multiple sizes, degenerate/zero-range blocks, empty/single-element edges, real temp-file roundtrip, and a quantization-level mutation guard. Confirmed mutation-sensitive by fault injection. 46 W3 tests total (35 inline + 11 integration), 0 workspace regressions. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ace (#420) Design doc: 3-bit format (packing/group/scale-bias), MLP-only layer scope, Metal kernel approach, measured f16/Q4 PPL + decode A/B, and an explicit SILENT QUALITY LOSS RISK section quantifying the unmeasured W3 PPL cost. Founder-gated review surface. W3 quality is UNVERIFIED (no runnable W3 quality-measurement path yet) and INCOMPLETE per the measure-or-declare rule. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-actions · 2026-07-01T20:04:19Z

E2E Parity Report

FAIL: 1/4 prompts diverged within their match windows

Prompt	Window	Agreement	First Diff	HF tok/s	Lattice tok/s	Verdict
`The capital of France is`	3	3/15	pos 3	0.3	1.9	PASS
`In the year 2024, artificial intelligence`	3	10/15	pos 9	0.3	2.2	PASS
`def fibonacci(n):

if n <= 1:
    return n
return` | 3 | 15/15 | none | 0.3 | 0.7 | PASS |

| def merge_sort(arr): """ Merge sort implementation. | 2 | 3/15 | pos 0 | 0.2 | 0.1 | FAIL |

The capital of France is

HF: Paris.
The capital of France is Paris.
The capital of France
Lattice: Paris.
A: Yes, the capital of France is Paris.

In the year 2024, artificial intelligence

HF: (AI) has become a significant part of the global economy. It is
Lattice: (AI) has become a significant part of our daily lives. From personal

def fibonacci(n): if n <= 1: return n return

HF: fibonacci(n-1) + fibonacci(n-2)

print(fib

Lattice: fibonacci(n-1) + fibonacci(n-2)

print(fib

def merge_sort(arr): """ Merge sort implementation.

HF: merge_sort(arr):
"""
Merge sort implementation.
Lattice: main():

Test cases

test_cases = [

ohdearquant and others added 3 commits July 1, 2026 15:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(inference): W3 3-bit MLP-only weight path (#420) [FOUNDER-GATED: silent-quality-loss review]#515

feat(inference): W3 3-bit MLP-only weight path (#420) [FOUNDER-GATED: silent-quality-loss review]#515
ohdearquant wants to merge 3 commits into
mainfrom
feat/w3-mlp-420

ohdearquant commented Jul 1, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 1, 2026

Test cases

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ohdearquant commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

W3 3-bit MLP-only weight path — REVIEW SURFACE, FOUNDER-GATED

What W3 is

Full-corpus measurement (2026-07-02) — method-disclosed

Earlier smoke measurement — SUPERSEDED by the full-corpus table above

SILENT QUALITY LOSS RISK

DONE vs DESIGNED

Gates (actual)

Uh oh!

github-actions Bot commented Jul 1, 2026

E2E Parity Report

Test cases

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ohdearquant commented Jul 1, 2026 •

edited

Loading