fix(vindex): write + validate Gemma-4 PLE sidecars on --quant none (#49)#91
fix(vindex): write + validate Gemma-4 PLE sidecars on --quant none (#49)#91deem0n wants to merge 2 commits into
Conversation
Closes chrishayuk#49. The Q4_K writer captured Gemma-4 Per-Layer Embeddings (PLE) tensors correctly; the float writer didn't. `larql extract <model>` against `gemma-4-E4B-it` and other PLE-active variants therefore produced a vindex with zero PLE entries on the default --quant none / --quant f16 path, and the forward path's `precompute_per_layer_inputs` / `apply_per_layer_embedding` silently fell through (`return Vec::new()` / `return h.clone()`) when those tensors were missing — INFER returned near-random tokens at 5–9% confidence with no warning. Fix in five parts, mirroring the Q4_K path so future quant modes inherit correctness for free: 1. Extract the existing `write_q4k::ple::write_ple_weights` into a shared `weights::ple_sidecar` module. Same bytes on disk, same `kind::TENSOR_F16` manifest entries — pure refactor. 2. `write_f32`: call the shared helper after the existing norms block. Also extend the per-layer norms loop with `arch.layer_scalar_key(layer)` (Gemma-4 per-layer multiplier, None on non-Gemma archs) and `arch.post_per_layer_input_norm_key`, plus the global `per_layer_projection_norm.weight`. The Q4_K norms writer already wrote all of these — the float writer dropping them was the same family of bug. 3. `load/f32.rs`: extend the manifest-decode match to handle `kind::TENSOR_F16` (mirrors what the Q4_K loader already does) and validate post-load that every Gemma-4 PLE tensor + vector + `layer_scalar` is hydrated. Stale (pre-fix) vindexes now fail loud with an actionable rebuild hint instead of silently producing garbage INFER. 4. `forward/ple.rs`: replace the three silent fall-through paths with `panic!()` calls. The loader now enforces the invariant up front, so missing tensors here is a bug, not a fall-through case. The legitimate `!arch.has_per_layer_embeddings()` early-exit and the `per_layer_input == None` opt-out stay. 5. Test scaffolding: factor the Gemma-4 PLE fixture in `tests/test_vindex.rs` into a `write_gemma4_ple_fixture` helper, extend it to include the per-layer `layer_scalar` 1-element vector that previously wasn't covered, and add `streaming_extract_noquant_carries_ple_tensors` — the direct regression test for the bug. The existing Q4_K test gains a parallel `layer_scalar` assertion. Adjacent issues called out in chrishayuk#49 but out of scope here (#A WalkFfn forward_moe_full_layer, #B predict_with_ffn_attention shared KV, #C LQL INFER chat-wrap) need their own changes. Validation cargo test -p larql-vindex --test test_vindex -- streaming_extract_q4k_carries_ple_tensors streaming_extract_noquant_carries_ple_tensors 2/2 OK cargo test -p larql-vindex -p larql-inference --no-fail-fast 2342 passed, 1 failed (pre-existing macOS-local golden vector_extractor_ffn_down_byte_identical drift — unrelated) cargo test --workspace --no-default-features --no-fail-fast 4880 passed (+1 vs baseline 4879 from the new test), 1 failed (same pre-existing drift), 40 ignored cargo fmt -p larql-vindex -p larql-inference -- --check OK cargo clippy -p larql-vindex --tests --no-deps -- -D warnings OK cargo clippy -p larql-inference --lib --tests --benches --no-deps -- -D warnings OK Can't validate against a real `gemma-4-E4B-it` extract locally (no weights checked out). The synthetic regression test exercises the writer + loader + manifest paths the issue's evidence chain proved were broken. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR chrishayuk#91's first round bumped load/f32.rs's line coverage to 73.42%, just under the 75% policy floor for that file. The fix's new `if arch.has_per_layer_embeddings() { … }` validation block added ~30 lines that weren't exercised by either of the round-trip tests (both deliberately produce a complete vindex, so the missing- tensor branches stay cold). Add `load_model_weights_rejects_ple_arch_with_missing_sidecars`: build a complete Gemma-4 PLE fixture, run the extract, then rewrite `weight_manifest.json` to drop every PLE entry — exactly what the pre-fix writer did — and call `load_model_weights`. Asserts the returned `VindexError::Parse` names "Gemma-4 PLE sidecar" and points at chrishayuk#49. Exercises every branch of the new validation block end-to-end. cargo test -p larql-vindex --test test_vindex -- load_model_weights_rejects_ple_arch_with_missing_sidecars streaming_extract_q4k_carries_ple_tensors streaming_extract_noquant_carries_ple_tensors 3/3 OK Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Seems I fixed the bug, but found another one ;-) Real-model validation on
|
| # | Pre-fix (per #49) | Post-fix (this PR) |
|---|---|---|
| 1 | mutable 8.92% |
France 99.68% |
| 2 | ッケ 7.39% |
French 0.07% |
| 3 | ceral 7.17% |
is 0.05% |
| 4 | ఉ 4.77% |
The 0.03% |
| 5 | sby 4.76% |
the 0.02% |
Random low-confidence garbage → coherent high-confidence English. The PLE contribution is being applied correctly across all 42
layers (the writer-loader-forward chain is intact). Top token is France rather than Paris because the IT model fits "The capital
of France is France" without a chat template — that's adjacent issue C in #49 (LQL INFER doesn't chat-wrap), out of scope here.
Same prompt with explicit Gemma-4 chat wrap
larql lql $'USE "/tmp/e4b.vindex"; INFER "<bos><|turn>user\nThe capital of France is<turn|>\n<|turn>model\n" TOP 10;'
| # | Token | Confidence |
|---|---|---|
| 1 | The |
99.69% (starts "The capital is Paris.") |
| 2 | Paris |
0.16% |
| 3 | ** |
0.15% |
| 9 | Berlin |
0.0001% |
Paris at rank 2, Berlin in top-10 — orders of magnitude above the noise floor. The PLE-driven geography knowledge is fully
restored. The 99.69% on The is the IT model's preference to emit a wrapping sentence rather than the bare answer.
Out-of-scope observation
Multi-token decode via larql run /tmp/e4b.vindex "The capital of France is" --max-tokens 5 still produces garbage
(ッケッケTobchal的存在). The first generated token via --max-tokens 1 is coherent — only continued decode goes wrong. That's
a different bug (decode-time PLE bookkeeping in the KV-cached path) and doesn't share a root cause with #49. Worth a follow-up
issue; not blocking this PR.
Coverage gate
The earlier coverage drop on crates/larql-vindex/src/format/weights/load/f32.rs is now back over the 75% floor — see commit
7d50997 (test(vindex): cover the PLE-arch missing-sidecar load rejection) which exercises every branch of the new validation block
end-to-end.
Closes #49.
Summary
The Q4_K writer captured Gemma-4 Per-Layer Embeddings (PLE) tensors correctly; the float writer didn't.
larql extract <model>againstgemma-4-E4B-itand other PLE-active variants therefore produced a vindex with zero PLE entries on the default--quant none/--quant f16path. The forward path'sprecompute_per_layer_inputs/apply_per_layer_embeddingsilently fell through (return Vec::new()/return h.clone()) when those tensors were missing — INFER returned near-random tokens at 5–9% confidence with no warning. (See #49 for the full evidence chain.)What changed
Fix in five parts, mirroring the Q4_K path so future quant modes inherit correctness for free.
1. Shared
weights::ple_sidecarmodule (refactor)Extracted
write_q4k::ple::write_ple_weightsintoweights/ple_sidecar.rs. Both writers now call it. Same bytes on disk, samekind::TENSOR_F16manifest entries — pure refactor, no behavior change in the Q4_K path. The point is to prevent the bug from re-appearing: dedup means "anything fixed in one writer is automatically applied to the other."2.
write_f32: call the shared helper + capture Gemma-4 normsAfter the existing per-layer norms block, the float writer now:
arch.layer_scalar_key(layer)to the norms loop (Gemma-4 per-layer multiplier —Noneon every other arch, so this is a true superset).arch.post_per_layer_input_norm_key(layer)for PLE-active archs.per_layer_projection_norm.weightafter the finalnorm.weight.super::ple_sidecar::write_ple_weightsto emitple_weights.bin.The Q4_K norms writer already did all of this — the float writer dropping them was the same family of bug.
3.
load/f32.rs: decodeTENSOR_F16+ validate at load timeExtended the manifest-decode match arm to handle
kind::TENSOR_F16(mirrorsload/q4k.rs). After the entries loop, whenarch.has_per_layer_embeddings()is true, the loader requires every PLE tensor + vector + per-layerlayer_scalarto be present. Missing entries fail with:Stale (pre-fix) vindexes now fail loud with an actionable rebuild hint instead of silently producing garbage INFER.
4.
forward/ple.rs: replace silent fall-throughs with panicsReplaced the three silent fall-through paths (
return Vec::new()for the model projection,return h.clone()for the gate/projection tensors) withpanic!()calls that name the missing key + reference #49. The loader enforces the invariant up front, so missing tensors here is now a load-path bug, not a fall-through case. The legitimate!arch.has_per_layer_embeddings()early-exit and theper_layer_input == Noneopt-out stay (those are real architectural states).5. Test scaffolding
write_gemma4_ple_fixturehelper.layer_scalar1-element vector — previously uncovered, even on the Q4_K side.streaming_extract_noquant_carries_ple_tensors: the direct regression test for Gemma-4 PLE tensors silently dropped on--quant noneextracts → garbage INFER #49. Builds the fixture, runs the real extract with--quant none/ExtractLevel::Inference, assertsple_weights.binexists, asserts the manifest has everytensor_f16PLE entry + everyvectorPLE norm + everylayer_scalar, then loads viaload_model_weightsand asserts every tensor / vector key + shape.streaming_extract_q4k_carries_ple_tensorswith a parallellayer_scalarassertion.Out of scope
Issue #49 calls out three adjacent bugs ("A. WalkFfn forward_moe_full_layer", "B. predict_with_ffn_attention shared KV", "C. LQL INFER chat-wrap"). Each needs its own change set; this PR doesn't touch them.
Test plan
cargo test -p larql-vindex --test test_vindex -- streaming_extract_q4k_carries_ple_tensors streaming_extract_noquant_carries_ple_tensors— 2/2 OKcargo test -p larql-vindex -p larql-inference --no-fail-fast— 2342 passed, 1 failed (pre-existing macOS-localvector_extractor_ffn_down_byte_identicalgolden drift, unrelated)cargo test --workspace --no-default-features --no-fail-fast— 4880 passed (+1 vs the 4879 baseline from the windows-fix run, due to the new test), same single pre-existing failure, 40 ignoredcargo fmt -p larql-vindex -p larql-inference -- --checkcargo clippy -p larql-vindex --tests --no-deps -- -D warningscargo clippy -p larql-inference --lib --tests --benches --no-deps -- -D warningsgemma-4-E4B-itend-to-end extract → INFER. The synthetic regression test exercises the writer + loader + manifest paths the issue's evidence chain proved were broken; the on-disk byte layout matches the Q4_K path the issue's mitigation already confirmed correct. Someone with the model weights checked out should re-run the issue's reproducer (larql extract google/gemma-4-E4B-it -o e4b.vindex; larql lql 'USE \"e4b.vindex\"; INFER \"The capital of France is\" TOP 5;') and confirm "Paris" appears at high confidence.🤖 Generated with Claude Code