Skip to content

feat(inference): W3 3-bit MLP-only weight path (#420) [FOUNDER-GATED: silent-quality-loss review]#515

Draft
ohdearquant wants to merge 3 commits into
mainfrom
feat/w3-mlp-420
Draft

feat(inference): W3 3-bit MLP-only weight path (#420) [FOUNDER-GATED: silent-quality-loss review]#515
ohdearquant wants to merge 3 commits into
mainfrom
feat/w3-mlp-420

Conversation

@ohdearquant

@ohdearquant ohdearquant commented Jul 1, 2026

Copy link
Copy Markdown
Owner

PR description authored by Claude (Anthropic agent) on behalf of @ohdearquant.

W3 3-bit MLP-only weight path — REVIEW SURFACE, FOUNDER-GATED

⛔ DO NOT MERGE. This is a review surface (draft PR + evidence) for a new
sub-4-bit weight format with unmeasured silent-quality-loss risk. It delivers
a tested CPU W3 format component plus a complete Metal-path design, not a
runnable W3 inference path.

Issue #420. Branch off main @ 36d9bf68c.
Full design: .khive/reports/w3_mlp_420_design.md.


What W3 is

The cheapest sub-4-bit decode byte-read reduction. Decode is
weight-bandwidth-bound and dense MLP weights dominate per-step bytes-read.
Quantize only mlp.{gate_proj,up_proj,down_proj} to 3-bit; keep attention,
GDN, embeddings, norm, and lm_head at Q4/f16.

  • Format: 32-weight groups, unsigned 3-bit asymmetric min/max, f16 scale+bias,
    12 packed bytes → 16-byte block (KHW3 v1). −25% payload / −20% block
    bytes vs Q4.
  • Layer scope: MLP-only; is_w3_mlp_tensor_name fails closed on
    MoE/attention/GDN/embed/lm_head.
  • Metal: gemv_w3_decode mirrors gemv_q4_decode (3 packed bytes/lane vs 4,
    same bias·Σx optimization); naive gemm_w3 for prefill/verify. Mixed loader
    from_w3_mlp_dir rejects MoE, missing .w3, and stray MLP .q4 (no silent
    fallback). DESIGNED, not implemented this pass.

Full-corpus measurement (2026-07-02) — method-disclosed

Dedicated serialized run on an idle GPU, one process per leg, build 11158ab112b2
(the follow-up work that made the W3 path runnable; refs #420, #530, #531). All
three legs use the same harness and scoring method: eval_perplexity
decode-loop (M=1 teacher-forced NLL via forward_step; NOT standard prefill PPL),
full docs/bench_results/wiki.test.raw corpus (310,034 tokens → 310,033 scored,
1,211 windows), window 512 / stride 256, max cache len 4096. Runtime differs by
leg and is disclosed per row: the f16 reference runs the CPU safetensors path;
Q4 and W3 run Metal. Deltas are method-consistent A/B within this harness and
are not comparable to published prefill-PPL numbers.

Weights Runtime PPL Mean NLL (nats) Δ PPL vs f16 Δ PPL vs Q4 Wall
f16 (safetensors) CPU 15.1977 2.7211 6,697s
Q4 (KHQ4) Metal 15.7307 2.7556 +0.533 (+3.5%) 5,994s
W3 MLP-only Metal 18.0971 2.8958 +2.899 (+19.1%) +2.366 (+15.0%) 5,949s

Reading: W3's MLP-only 3-bit quantization costs +2.37 PPL over Q4 on the same
harness — roughly 5.4× Q4's own delta vs f16 — in exchange for −25% MLP payload
bytes. This is the quantified silent-quality-loss number the gate below asked for.

Still unmeasured (disclosed hole): W3 decode A/B. The #420 acceptance also
calls for generation-path decode tok/s of W3 vs f16/Q4; that has not been run.
The wall clocks above are teacher-forced scoring throughput and must not be
quoted as a decode speedup.

Raw log archived locally at .khive/reports/w3_420_full_corpus_ppl.log
(structured @@lattice {"ev":"perplexity",...} events per leg).


Earlier smoke measurement — SUPERSEDED by the full-corpus table above

Local qwen3.5-0.8b, wiki corpus, window=128 stride=64. SMOKE numbers on a
capped corpus — not publication-grade.

Path Runtime Max tok PPL Δ vs f16 Bound
f16/BF16 CPU 1024 17.299129 baseline
Q4 Metal 1024 18.709340 +1.410211 outside known Q4 +0.1–0.3
W3 MLP-only N/A N/A N/A N/A INCOMPLETE — no runnable W3 path
Decode path tok/s Status
Q4 directory (Metal Q4_0) 158.42 measured
safetensors-direct (Metal Q8_0) 116.70 measured
W3 MLP-only N/A INCOMPLETE — no W3 decode path

Existing Q4-vs-Q8_0 gap is +35.75%, confirming byte-read reduction → real
decode speedup (the mechanism W3 exploits). W3's own speedup is not measured.


SILENT QUALITY LOSS RISK

This is the founder gate. (Update 2026-07-02: the PPL magnitude is now
measured — see the full-corpus table above: +2.366 PPL vs Q4, +2.899 vs f16. The
gate is now a founder judgment on that measured trade plus the still-missing
decode A/B, no longer on an unmeasured unknown.)
A 3-bit MLP model still decodes
fluent-looking tokens even when degraded, so the loss is silent — it shows up
only as higher perplexity, never as a crash.

  • W3 uses 3 bits vs Q4's 4 on the MLP (8 levels vs 16), so W3's MLP
    quantization error is strictly larger than Q4's on those tensors. Expected
    W3-vs-Q4 PPL delta is positive and larger than the MLP's share of Q4's
    +0.1–0.3 budget
    — exact magnitude not measured on this branch.
  • The Q4 smoke delta already landed outside the known full-eval bound (+1.15
    to +1.41), proving the capped harness is not comparability-grade. W3 must be
    evaluated on a full/comparable corpus before any acceptance.

To close the gate (none exist yet): converter → .w3 artifacts →
from_w3_mlp_dirgemv_w3_decode; --w3-mlp-dir wired for a full-corpus run;
delta_w3_vs_{f16,q4} + delta_q4_vs_f16 on the same corpus; a decode tok/s A/B.

Additional silent surfaces (each produces plausible tokens while wrong; each
needs a test): missing .w3 → silent .q4 fallback; wrong gate_byte_size in
fused gate||up; measuring only tok/s; treating smoke PPL as a quality claim.

Verdict: W3 quality is NOT verified. INCOMPLETE per Π_TBV. Do not merge on the
byte-reduction argument alone.


DONE vs DESIGNED

DONE (tested): weights::w3_weights CPU pack/dequant/.w3 I/O/classification
(~660 LOC); 46 W3 tests (35 inline + 11 integration), mutation-sensitivity
confirmed by fault injection, 0 workspace regressions (1364 pre-existing).

DESIGNED, NOT built: converter binary, mixed W3/Q4 Metal loader,
gemv_w3_decode/gemm_w3 kernels, CLI/bench/--w3-mlp-dir wiring, and the W3
PPL/decode measurement.

Gates (actual)

cargo fmt --check                                              → clean (exit 0)
cargo clippy --workspace -- -D warnings                       → 0 warnings (exit 0)
cargo clippy -p lattice-inference --features metal-gpu -D warnings → 0 warnings (exit 0)
cargo test -p lattice-inference --lib w3_weights              → 35 passed; 0 failed
cargo test -p lattice-inference --test w3_weights_integration → 11 passed; 0 failed

🤖 Generated with Claude Code

ohdearquant and others added 3 commits July 1, 2026 15:50
CPU-side W3 (3-bit) asymmetric weight quantization for dense MLP projections: 32-weight groups, f16 scale+bias, 12 packed bytes per 16-byte block (KHW3 v1). 25% payload / 20% block byte reduction vs Q4, targeting decode weight-bandwidth. is_w3_mlp_tensor_name fails closed on MoE/attention/GDN/embed/lm_head. Library APIs return Result; no unwrap/assert in public shape validation. Metal kernel, converter, and loader are DESIGNED but NOT implemented (review surface, founder-gated).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…420)

11 independent integration tests for weights::w3_weights: happy-path half-step error bound, group-boundary isolation on non-multiple-of-32 shapes, exact vs one-past-multiple sizes, degenerate/zero-range blocks, empty/single-element edges, real temp-file roundtrip, and a quantization-level mutation guard. Confirmed mutation-sensitive by fault injection. 46 W3 tests total (35 inline + 11 integration), 0 workspace regressions.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ace (#420)

Design doc: 3-bit format (packing/group/scale-bias), MLP-only layer scope, Metal kernel approach, measured f16/Q4 PPL + decode A/B, and an explicit SILENT QUALITY LOSS RISK section quantifying the unmeasured W3 PPL cost. Founder-gated review surface. W3 quality is UNVERIFIED (no runnable W3 quality-measurement path yet) and INCOMPLETE per the measure-or-declare rule.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

E2E Parity Report

FAIL: 1/4 prompts diverged within their match windows

Prompt Window Agreement First Diff HF tok/s Lattice tok/s Verdict
The capital of France is 3 3/15 pos 3 0.3 1.9 PASS
In the year 2024, artificial intelligence 3 10/15 pos 9 0.3 2.2 PASS
`def fibonacci(n):
if n <= 1:
    return n
return` | 3 | 15/15 | none | 0.3 | 0.7 | PASS |

| def merge_sort(arr): """ Merge sort implementation. | 2 | 3/15 | pos 0 | 0.2 | 0.1 | FAIL |

The capital of France is

  • HF: Paris.
    The capital of France is Paris.
    The capital of France
  • Lattice: Paris.
    A: Yes, the capital of France is Paris.

In the year 2024, artificial intelligence

  • HF: (AI) has become a significant part of the global economy. It is
  • Lattice: (AI) has become a significant part of our daily lives. From personal

def fibonacci(n): if n <= 1: return n return

  • HF: fibonacci(n-1) + fibonacci(n-2)

print(fib

  • Lattice: fibonacci(n-1) + fibonacci(n-2)

print(fib

def merge_sort(arr): """ Merge sort implementation.

  • HF: merge_sort(arr):
    """
    Merge sort implementation.

  • Lattice: main():

    Test cases

    test_cases = [

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant