fix(vindex): write + validate Gemma-4 PLE sidecars on --quant none (#49) by deem0n · Pull Request #91 · chrishayuk/larql

deem0n · 2026-05-13T14:24:50Z

Closes #49.

Summary

The Q4_K writer captured Gemma-4 Per-Layer Embeddings (PLE) tensors correctly; the float writer didn't. larql extract <model> against gemma-4-E4B-it and other PLE-active variants therefore produced a vindex with zero PLE entries on the default --quant none / --quant f16 path. The forward path's precompute_per_layer_inputs / apply_per_layer_embedding silently fell through (return Vec::new() / return h.clone()) when those tensors were missing — INFER returned near-random tokens at 5–9% confidence with no warning. (See #49 for the full evidence chain.)

What changed

Fix in five parts, mirroring the Q4_K path so future quant modes inherit correctness for free.

1. Shared weights::ple_sidecar module (refactor)
Extracted write_q4k::ple::write_ple_weights into weights/ple_sidecar.rs. Both writers now call it. Same bytes on disk, same kind::TENSOR_F16 manifest entries — pure refactor, no behavior change in the Q4_K path. The point is to prevent the bug from re-appearing: dedup means "anything fixed in one writer is automatically applied to the other."

2. write_f32: call the shared helper + capture Gemma-4 norms
After the existing per-layer norms block, the float writer now:

Adds arch.layer_scalar_key(layer) to the norms loop (Gemma-4 per-layer multiplier — None on every other arch, so this is a true superset).
Adds arch.post_per_layer_input_norm_key(layer) for PLE-active archs.
Writes the global per_layer_projection_norm.weight after the final norm.weight.
Calls super::ple_sidecar::write_ple_weights to emit ple_weights.bin.

The Q4_K norms writer already did all of this — the float writer dropping them was the same family of bug.

3. load/f32.rs: decode TENSOR_F16 + validate at load time
Extended the manifest-decode match arm to handle kind::TENSOR_F16 (mirrors load/q4k.rs). After the entries loop, when arch.has_per_layer_embeddings() is true, the loader requires every PLE tensor + vector + per-layer layer_scalar to be present. Missing entries fail with:

vindex is missing Gemma-4 PLE sidecar tensors (<N> entries, e.g. <keys>).
Rebuild with current larql — older extracts dropped these on --quant none / f16
and silently produced garbage INFER (chrishayuk/larql#49).

Stale (pre-fix) vindexes now fail loud with an actionable rebuild hint instead of silently producing garbage INFER.

4. forward/ple.rs: replace silent fall-throughs with panics
Replaced the three silent fall-through paths (return Vec::new() for the model projection, return h.clone() for the gate/projection tensors) with panic!() calls that name the missing key + reference #49. The loader enforces the invariant up front, so missing tensors here is now a load-path bug, not a fall-through case. The legitimate !arch.has_per_layer_embeddings() early-exit and the per_layer_input == None opt-out stay (those are real architectural states).

5. Test scaffolding

Factored the synthetic Gemma-4 PLE fixture into a write_gemma4_ple_fixture helper.
Extended it to include the per-layer layer_scalar 1-element vector — previously uncovered, even on the Q4_K side.
Added streaming_extract_noquant_carries_ple_tensors: the direct regression test for Gemma-4 PLE tensors silently dropped on --quant none extracts → garbage INFER #49. Builds the fixture, runs the real extract with --quant none / ExtractLevel::Inference, asserts ple_weights.bin exists, asserts the manifest has every tensor_f16 PLE entry + every vector PLE norm + every layer_scalar, then loads via load_model_weights and asserts every tensor / vector key + shape.
Extended streaming_extract_q4k_carries_ple_tensors with a parallel layer_scalar assertion.

Out of scope

Issue #49 calls out three adjacent bugs ("A. WalkFfn forward_moe_full_layer", "B. predict_with_ffn_attention shared KV", "C. LQL INFER chat-wrap"). Each needs its own change set; this PR doesn't touch them.

Test plan

cargo test -p larql-vindex --test test_vindex -- streaming_extract_q4k_carries_ple_tensors streaming_extract_noquant_carries_ple_tensors — 2/2 OK
cargo test -p larql-vindex -p larql-inference --no-fail-fast — 2342 passed, 1 failed (pre-existing macOS-local vector_extractor_ffn_down_byte_identical golden drift, unrelated)
cargo test --workspace --no-default-features --no-fail-fast — 4880 passed (+1 vs the 4879 baseline from the windows-fix run, due to the new test), same single pre-existing failure, 40 ignored
cargo fmt -p larql-vindex -p larql-inference -- --check
cargo clippy -p larql-vindex --tests --no-deps -- -D warnings
cargo clippy -p larql-inference --lib --tests --benches --no-deps -- -D warnings
Cannot validate locally: real gemma-4-E4B-it end-to-end extract → INFER. The synthetic regression test exercises the writer + loader + manifest paths the issue's evidence chain proved were broken; the on-disk byte layout matches the Q4_K path the issue's mitigation already confirmed correct. Someone with the model weights checked out should re-run the issue's reproducer (larql extract google/gemma-4-E4B-it -o e4b.vindex; larql lql 'USE \"e4b.vindex\"; INFER \"The capital of France is\" TOP 5;') and confirm "Paris" appears at high confidence.

🤖 Generated with Claude Code

Closes chrishayuk#49. The Q4_K writer captured Gemma-4 Per-Layer Embeddings (PLE) tensors correctly; the float writer didn't. `larql extract <model>` against `gemma-4-E4B-it` and other PLE-active variants therefore produced a vindex with zero PLE entries on the default --quant none / --quant f16 path, and the forward path's `precompute_per_layer_inputs` / `apply_per_layer_embedding` silently fell through (`return Vec::new()` / `return h.clone()`) when those tensors were missing — INFER returned near-random tokens at 5–9% confidence with no warning. Fix in five parts, mirroring the Q4_K path so future quant modes inherit correctness for free: 1. Extract the existing `write_q4k::ple::write_ple_weights` into a shared `weights::ple_sidecar` module. Same bytes on disk, same `kind::TENSOR_F16` manifest entries — pure refactor. 2. `write_f32`: call the shared helper after the existing norms block. Also extend the per-layer norms loop with `arch.layer_scalar_key(layer)` (Gemma-4 per-layer multiplier, None on non-Gemma archs) and `arch.post_per_layer_input_norm_key`, plus the global `per_layer_projection_norm.weight`. The Q4_K norms writer already wrote all of these — the float writer dropping them was the same family of bug. 3. `load/f32.rs`: extend the manifest-decode match to handle `kind::TENSOR_F16` (mirrors what the Q4_K loader already does) and validate post-load that every Gemma-4 PLE tensor + vector + `layer_scalar` is hydrated. Stale (pre-fix) vindexes now fail loud with an actionable rebuild hint instead of silently producing garbage INFER. 4. `forward/ple.rs`: replace the three silent fall-through paths with `panic!()` calls. The loader now enforces the invariant up front, so missing tensors here is a bug, not a fall-through case. The legitimate `!arch.has_per_layer_embeddings()` early-exit and the `per_layer_input == None` opt-out stay. 5. Test scaffolding: factor the Gemma-4 PLE fixture in `tests/test_vindex.rs` into a `write_gemma4_ple_fixture` helper, extend it to include the per-layer `layer_scalar` 1-element vector that previously wasn't covered, and add `streaming_extract_noquant_carries_ple_tensors` — the direct regression test for the bug. The existing Q4_K test gains a parallel `layer_scalar` assertion. Adjacent issues called out in chrishayuk#49 but out of scope here (#A WalkFfn forward_moe_full_layer, #B predict_with_ffn_attention shared KV, #C LQL INFER chat-wrap) need their own changes. Validation cargo test -p larql-vindex --test test_vindex -- streaming_extract_q4k_carries_ple_tensors streaming_extract_noquant_carries_ple_tensors 2/2 OK cargo test -p larql-vindex -p larql-inference --no-fail-fast 2342 passed, 1 failed (pre-existing macOS-local golden vector_extractor_ffn_down_byte_identical drift — unrelated) cargo test --workspace --no-default-features --no-fail-fast 4880 passed (+1 vs baseline 4879 from the new test), 1 failed (same pre-existing drift), 40 ignored cargo fmt -p larql-vindex -p larql-inference -- --check OK cargo clippy -p larql-vindex --tests --no-deps -- -D warnings OK cargo clippy -p larql-inference --lib --tests --benches --no-deps -- -D warnings OK Can't validate against a real `gemma-4-E4B-it` extract locally (no weights checked out). The synthetic regression test exercises the writer + loader + manifest paths the issue's evidence chain proved were broken. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR chrishayuk#91's first round bumped load/f32.rs's line coverage to 73.42%, just under the 75% policy floor for that file. The fix's new `if arch.has_per_layer_embeddings() { … }` validation block added ~30 lines that weren't exercised by either of the round-trip tests (both deliberately produce a complete vindex, so the missing- tensor branches stay cold). Add `load_model_weights_rejects_ple_arch_with_missing_sidecars`: build a complete Gemma-4 PLE fixture, run the extract, then rewrite `weight_manifest.json` to drop every PLE entry — exactly what the pre-fix writer did — and call `load_model_weights`. Asserts the returned `VindexError::Parse` names "Gemma-4 PLE sidecar" and points at chrishayuk#49. Exercises every branch of the new validation block end-to-end. cargo test -p larql-vindex --test test_vindex -- load_model_weights_rejects_ple_arch_with_missing_sidecars streaming_extract_q4k_carries_ple_tensors streaming_extract_noquant_carries_ple_tensors 3/3 OK Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

deem0n · 2026-05-14T13:56:37Z

Seems I fixed the bug, but found another one ;-)

Real-model validation on `google/gemma-4-E4B-it`

Ran the issue's exact reproducer on the unpacked model (16 GB safetensors, PLE active per hidden_size_per_layer_input: 256, 42
layers / 2560 hidden / 10240 intermediate / 262144 vocab / softcap 30.0). Extracted with this PR's build at default --quant none
(--summary-features-per-expert 64 to keep down_meta tractable — orthogonal to the fix). The vindex's weight_manifest.json
lists every PLE sidecar entry; ple_weights.bin is 5.8 GB; norms.bin includes layer_scalar and the post-PLE norms.

larql lql 'USE "/tmp/e4b.vindex"; INFER "The capital of France is" TOP 5;'

#	Pre-fix (per #49)	Post-fix (this PR)
1	`mutable` 8.92%	`France` 99.68%
2	`ッケ` 7.39%	`French` 0.07%
3	`ceral` 7.17%	`is` 0.05%
4	`ఉ` 4.77%	`The` 0.03%
5	`sby` 4.76%	`the` 0.02%

Random low-confidence garbage → coherent high-confidence English. The PLE contribution is being applied correctly across all 42
layers (the writer-loader-forward chain is intact). Top token is France rather than Paris because the IT model fits "The capital
of France is France" without a chat template — that's adjacent issue C in #49 (LQL INFER doesn't chat-wrap), out of scope here.

Same prompt with explicit Gemma-4 chat wrap

larql lql $'USE "/tmp/e4b.vindex"; INFER "<bos><|turn>user\nThe capital of France is<turn|>\n<|turn>model\n" TOP 10;'

#	Token	Confidence
1	`The`	99.69% (starts "The capital is Paris.")
2	`Paris`	0.16%
3	`**`	0.15%
9	`Berlin`	0.0001%

Paris at rank 2, Berlin in top-10 — orders of magnitude above the noise floor. The PLE-driven geography knowledge is fully
restored. The 99.69% on The is the IT model's preference to emit a wrapping sentence rather than the bare answer.

Out-of-scope observation

Multi-token decode via larql run /tmp/e4b.vindex "The capital of France is" --max-tokens 5 still produces garbage
(ッケッケTobchal的存在). The first generated token via --max-tokens 1 is coherent — only continued decode goes wrong. That's
a different bug (decode-time PLE bookkeeping in the KV-cached path) and doesn't share a root cause with #49. Worth a follow-up
issue; not blocking this PR.

Coverage gate

The earlier coverage drop on crates/larql-vindex/src/format/weights/load/f32.rs is now back over the 75% floor — see commit
7d50997 (test(vindex): cover the PLE-arch missing-sidecar load rejection) which exercises every branch of the new validation block
end-to-end.

deem0n mentioned this pull request May 13, 2026

Gemma-4 PLE tensors silently dropped on --quant none extracts → garbage INFER #49

Open

deem0n mentioned this pull request May 14, 2026

Multi-token decode produces garbage on Gemma-4 PLE models #98

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(vindex): write + validate Gemma-4 PLE sidecars on --quant none (#49)#91

fix(vindex): write + validate Gemma-4 PLE sidecars on --quant none (#49)#91
deem0n wants to merge 2 commits into
chrishayuk:mainfrom
deem0n:gemma4-ple-noquant-fix

deem0n commented May 13, 2026 •

edited

Loading

Uh oh!

deem0n commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

deem0n commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Out of scope

Test plan

Uh oh!

deem0n commented May 14, 2026

Real-model validation on google/gemma-4-E4B-it

Same prompt with explicit Gemma-4 chat wrap

Out-of-scope observation

Coverage gate

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

deem0n commented May 13, 2026 •

edited

Loading

Real-model validation on `google/gemma-4-E4B-it`