Skip to content

fix(vindex): write + validate Gemma-4 PLE sidecars on --quant none (#49)#91

Open
deem0n wants to merge 2 commits into
chrishayuk:mainfrom
deem0n:gemma4-ple-noquant-fix
Open

fix(vindex): write + validate Gemma-4 PLE sidecars on --quant none (#49)#91
deem0n wants to merge 2 commits into
chrishayuk:mainfrom
deem0n:gemma4-ple-noquant-fix

Conversation

@deem0n
Copy link
Copy Markdown
Contributor

@deem0n deem0n commented May 13, 2026

Closes #49.

Summary

The Q4_K writer captured Gemma-4 Per-Layer Embeddings (PLE) tensors correctly; the float writer didn't. larql extract <model> against gemma-4-E4B-it and other PLE-active variants therefore produced a vindex with zero PLE entries on the default --quant none / --quant f16 path. The forward path's precompute_per_layer_inputs / apply_per_layer_embedding silently fell through (return Vec::new() / return h.clone()) when those tensors were missing — INFER returned near-random tokens at 5–9% confidence with no warning. (See #49 for the full evidence chain.)

What changed

Fix in five parts, mirroring the Q4_K path so future quant modes inherit correctness for free.

1. Shared weights::ple_sidecar module (refactor)
Extracted write_q4k::ple::write_ple_weights into weights/ple_sidecar.rs. Both writers now call it. Same bytes on disk, same kind::TENSOR_F16 manifest entries — pure refactor, no behavior change in the Q4_K path. The point is to prevent the bug from re-appearing: dedup means "anything fixed in one writer is automatically applied to the other."

2. write_f32: call the shared helper + capture Gemma-4 norms
After the existing per-layer norms block, the float writer now:

  • Adds arch.layer_scalar_key(layer) to the norms loop (Gemma-4 per-layer multiplier — None on every other arch, so this is a true superset).
  • Adds arch.post_per_layer_input_norm_key(layer) for PLE-active archs.
  • Writes the global per_layer_projection_norm.weight after the final norm.weight.
  • Calls super::ple_sidecar::write_ple_weights to emit ple_weights.bin.

The Q4_K norms writer already did all of this — the float writer dropping them was the same family of bug.

3. load/f32.rs: decode TENSOR_F16 + validate at load time
Extended the manifest-decode match arm to handle kind::TENSOR_F16 (mirrors load/q4k.rs). After the entries loop, when arch.has_per_layer_embeddings() is true, the loader requires every PLE tensor + vector + per-layer layer_scalar to be present. Missing entries fail with:

vindex is missing Gemma-4 PLE sidecar tensors (<N> entries, e.g. <keys>).
Rebuild with current larql — older extracts dropped these on --quant none / f16
and silently produced garbage INFER (chrishayuk/larql#49).

Stale (pre-fix) vindexes now fail loud with an actionable rebuild hint instead of silently producing garbage INFER.

4. forward/ple.rs: replace silent fall-throughs with panics
Replaced the three silent fall-through paths (return Vec::new() for the model projection, return h.clone() for the gate/projection tensors) with panic!() calls that name the missing key + reference #49. The loader enforces the invariant up front, so missing tensors here is now a load-path bug, not a fall-through case. The legitimate !arch.has_per_layer_embeddings() early-exit and the per_layer_input == None opt-out stay (those are real architectural states).

5. Test scaffolding

  • Factored the synthetic Gemma-4 PLE fixture into a write_gemma4_ple_fixture helper.
  • Extended it to include the per-layer layer_scalar 1-element vector — previously uncovered, even on the Q4_K side.
  • Added streaming_extract_noquant_carries_ple_tensors: the direct regression test for Gemma-4 PLE tensors silently dropped on --quant none extracts → garbage INFER #49. Builds the fixture, runs the real extract with --quant none / ExtractLevel::Inference, asserts ple_weights.bin exists, asserts the manifest has every tensor_f16 PLE entry + every vector PLE norm + every layer_scalar, then loads via load_model_weights and asserts every tensor / vector key + shape.
  • Extended streaming_extract_q4k_carries_ple_tensors with a parallel layer_scalar assertion.

Out of scope

Issue #49 calls out three adjacent bugs ("A. WalkFfn forward_moe_full_layer", "B. predict_with_ffn_attention shared KV", "C. LQL INFER chat-wrap"). Each needs its own change set; this PR doesn't touch them.

Test plan

  • cargo test -p larql-vindex --test test_vindex -- streaming_extract_q4k_carries_ple_tensors streaming_extract_noquant_carries_ple_tensors — 2/2 OK
  • cargo test -p larql-vindex -p larql-inference --no-fail-fast — 2342 passed, 1 failed (pre-existing macOS-local vector_extractor_ffn_down_byte_identical golden drift, unrelated)
  • cargo test --workspace --no-default-features --no-fail-fast — 4880 passed (+1 vs the 4879 baseline from the windows-fix run, due to the new test), same single pre-existing failure, 40 ignored
  • cargo fmt -p larql-vindex -p larql-inference -- --check
  • cargo clippy -p larql-vindex --tests --no-deps -- -D warnings
  • cargo clippy -p larql-inference --lib --tests --benches --no-deps -- -D warnings
  • Cannot validate locally: real gemma-4-E4B-it end-to-end extract → INFER. The synthetic regression test exercises the writer + loader + manifest paths the issue's evidence chain proved were broken; the on-disk byte layout matches the Q4_K path the issue's mitigation already confirmed correct. Someone with the model weights checked out should re-run the issue's reproducer (larql extract google/gemma-4-E4B-it -o e4b.vindex; larql lql 'USE \"e4b.vindex\"; INFER \"The capital of France is\" TOP 5;') and confirm "Paris" appears at high confidence.

🤖 Generated with Claude Code

Closes chrishayuk#49.

The Q4_K writer captured Gemma-4 Per-Layer Embeddings (PLE) tensors
correctly; the float writer didn't. `larql extract <model>` against
`gemma-4-E4B-it` and other PLE-active variants therefore produced a
vindex with zero PLE entries on the default --quant none / --quant
f16 path, and the forward path's `precompute_per_layer_inputs` /
`apply_per_layer_embedding` silently fell through (`return Vec::new()`
/ `return h.clone()`) when those tensors were missing — INFER returned
near-random tokens at 5–9% confidence with no warning.

Fix in five parts, mirroring the Q4_K path so future quant modes
inherit correctness for free:

  1. Extract the existing `write_q4k::ple::write_ple_weights` into a
     shared `weights::ple_sidecar` module. Same bytes on disk, same
     `kind::TENSOR_F16` manifest entries — pure refactor.
  2. `write_f32`: call the shared helper after the existing norms
     block. Also extend the per-layer norms loop with
     `arch.layer_scalar_key(layer)` (Gemma-4 per-layer multiplier,
     None on non-Gemma archs) and `arch.post_per_layer_input_norm_key`,
     plus the global `per_layer_projection_norm.weight`. The Q4_K
     norms writer already wrote all of these — the float writer
     dropping them was the same family of bug.
  3. `load/f32.rs`: extend the manifest-decode match to handle
     `kind::TENSOR_F16` (mirrors what the Q4_K loader already does)
     and validate post-load that every Gemma-4 PLE tensor + vector +
     `layer_scalar` is hydrated. Stale (pre-fix) vindexes now fail
     loud with an actionable rebuild hint instead of silently
     producing garbage INFER.
  4. `forward/ple.rs`: replace the three silent fall-through paths
     with `panic!()` calls. The loader now enforces the invariant up
     front, so missing tensors here is a bug, not a fall-through case.
     The legitimate `!arch.has_per_layer_embeddings()` early-exit and
     the `per_layer_input == None` opt-out stay.
  5. Test scaffolding: factor the Gemma-4 PLE fixture in
     `tests/test_vindex.rs` into a `write_gemma4_ple_fixture` helper,
     extend it to include the per-layer `layer_scalar` 1-element
     vector that previously wasn't covered, and add
     `streaming_extract_noquant_carries_ple_tensors` — the direct
     regression test for the bug. The existing Q4_K test gains a
     parallel `layer_scalar` assertion.

Adjacent issues called out in chrishayuk#49 but out of scope here (#A WalkFfn
forward_moe_full_layer, #B predict_with_ffn_attention shared KV,
#C LQL INFER chat-wrap) need their own changes.

Validation
  cargo test -p larql-vindex --test test_vindex --
    streaming_extract_q4k_carries_ple_tensors
    streaming_extract_noquant_carries_ple_tensors          2/2 OK
  cargo test -p larql-vindex -p larql-inference --no-fail-fast
    2342 passed, 1 failed (pre-existing macOS-local golden
                            vector_extractor_ffn_down_byte_identical
                            drift — unrelated)
  cargo test --workspace --no-default-features --no-fail-fast
    4880 passed (+1 vs baseline 4879 from the new test),
    1 failed (same pre-existing drift), 40 ignored
  cargo fmt -p larql-vindex -p larql-inference -- --check    OK
  cargo clippy -p larql-vindex --tests --no-deps -- -D warnings   OK
  cargo clippy -p larql-inference --lib --tests --benches
    --no-deps -- -D warnings                               OK

Can't validate against a real `gemma-4-E4B-it` extract locally
(no weights checked out). The synthetic regression test exercises
the writer + loader + manifest paths the issue's evidence chain
proved were broken.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR chrishayuk#91's first round bumped load/f32.rs's line coverage to 73.42%,
just under the 75% policy floor for that file. The fix's new
`if arch.has_per_layer_embeddings() { … }` validation block added
~30 lines that weren't exercised by either of the round-trip tests
(both deliberately produce a complete vindex, so the missing-
tensor branches stay cold).

Add `load_model_weights_rejects_ple_arch_with_missing_sidecars`:
build a complete Gemma-4 PLE fixture, run the extract, then rewrite
`weight_manifest.json` to drop every PLE entry — exactly what the
pre-fix writer did — and call `load_model_weights`. Asserts the
returned `VindexError::Parse` names "Gemma-4 PLE sidecar" and
points at chrishayuk#49. Exercises every branch of the new
validation block end-to-end.

cargo test -p larql-vindex --test test_vindex --
  load_model_weights_rejects_ple_arch_with_missing_sidecars
  streaming_extract_q4k_carries_ple_tensors
  streaming_extract_noquant_carries_ple_tensors
  3/3 OK

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@deem0n
Copy link
Copy Markdown
Contributor Author

deem0n commented May 14, 2026

Seems I fixed the bug, but found another one ;-)

Real-model validation on google/gemma-4-E4B-it

Ran the issue's exact reproducer on the unpacked model (16 GB safetensors, PLE active per hidden_size_per_layer_input: 256, 42
layers / 2560 hidden / 10240 intermediate / 262144 vocab / softcap 30.0). Extracted with this PR's build at default --quant none
(--summary-features-per-expert 64 to keep down_meta tractable — orthogonal to the fix). The vindex's weight_manifest.json
lists every PLE sidecar entry; ple_weights.bin is 5.8 GB; norms.bin includes layer_scalar and the post-PLE norms.

larql lql 'USE "/tmp/e4b.vindex"; INFER "The capital of France is" TOP 5;'
# Pre-fix (per #49) Post-fix (this PR)
1 mutable 8.92% France 99.68%
2 ッケ 7.39% French 0.07%
3 ceral 7.17% is 0.05%
4 4.77% The 0.03%
5 sby 4.76% the 0.02%

Random low-confidence garbage → coherent high-confidence English. The PLE contribution is being applied correctly across all 42
layers (the writer-loader-forward chain is intact). Top token is France rather than Paris because the IT model fits "The capital
of France is France" without a chat template — that's adjacent issue C in #49 (LQL INFER doesn't chat-wrap), out of scope here.

Same prompt with explicit Gemma-4 chat wrap

larql lql $'USE "/tmp/e4b.vindex"; INFER "<bos><|turn>user\nThe capital of France is<turn|>\n<|turn>model\n" TOP 10;'
# Token Confidence
1 The 99.69% (starts "The capital is Paris.")
2 Paris 0.16%
3 ** 0.15%
9 Berlin 0.0001%

Paris at rank 2, Berlin in top-10 — orders of magnitude above the noise floor. The PLE-driven geography knowledge is fully
restored. The 99.69% on The is the IT model's preference to emit a wrapping sentence rather than the bare answer.

Out-of-scope observation

Multi-token decode via larql run /tmp/e4b.vindex "The capital of France is" --max-tokens 5 still produces garbage
(ッケッケTobchal的存在). The first generated token via --max-tokens 1 is coherent — only continued decode goes wrong. That's
a different bug (decode-time PLE bookkeeping in the KV-cached path) and doesn't share a root cause with #49. Worth a follow-up
issue; not blocking this PR.

Coverage gate

The earlier coverage drop on crates/larql-vindex/src/format/weights/load/f32.rs is now back over the 75% floor — see commit
7d50997 (test(vindex): cover the PLE-arch missing-sidecar load rejection) which exercises every branch of the new validation block
end-to-end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gemma-4 PLE tensors silently dropped on --quant none extracts → garbage INFER

1 participant