feat: MLA absorption for DeepSeek V2/V3 — fuse low-rank Q/K/V into standard dense tensors#96
Open
mvkorobkov wants to merge 6 commits into
Open
feat: MLA absorption for DeepSeek V2/V3 — fuse low-rank Q/K/V into standard dense tensors#96mvkorobkov wants to merge 6 commits into
mvkorobkov wants to merge 6 commits into
Conversation
added 6 commits
May 14, 2026 10:49
DS-V3 absorbed attention has qk_head_dim=192 (nope=128+rope=64) but v_head_dim=128. The existing gqa_attention uses a single head_dim for all projections, which would corrupt V slicing and output shape. gqa_attention_asym accepts separate qk_head_dim and v_head_dim: - Q/K sliced with qk_head_dim (dot-product stays in the larger space) - V sliced and output written with v_head_dim - Returns (seq, num_q * v_head_dim) When qk_head_dim == v_head_dim the function is numerically identical to gqa_attention (verified by asym_sym_equivalence_when_dims_equal test). 4 tests added: shape, finiteness, sym-equivalence, seq=1 causal.
Three new optional fields on ModelConfig: qk_nope_head_dim — non-RoPE part of Q/K head dim (DS-V3: 128) qk_rope_head_dim — RoPE-rotated part of Q/K head dim (DS-V3: 64) v_head_dim — V projection head dim (DS-V3: 128) Parsed from config.json (qk_nope_head_dim / qk_rope_head_dim / v_head_dim). Trait accessors added to ModelArchitecture with None defaults. DeepSeekArch overrides to read from config. DS-V3 detection test extended to verify all three fields round-trip. Two GGUF test-only ModelConfig literals updated to include None stubs.
…eight matrices Implements `mla_absorb::absorb()` which converts the four MLA weight matrices (kv_a, kv_b, q_a, q_b) into standard dense Q/K/V tensors compatible with `gqa_attention_asym`. Key correctness points: - rope-K is MQA: single row in kv_a[kv_lora..] replicated num_kv times in absorbed K (not per-head in the input tensor) - DS-V3 native per-head layout [nope|rope] → LARQL convention [rope|nope] applied symmetrically to Q and K during absorption - V: straightforward kv_b[nope+v_hd slice] @ kv_compress Three tests (3 passed): - absorbed_forward_matches_reference: reference MLA forward vs absorbed path through gqa_attention_asym must match within 1e-4 - absorbed_shapes: output tensor dimensions - rope_k_is_broadcast_not_zero: single rope-K correctly replicated across heads
write_model_weights_with_opts now accepts DS-V3 / MLA architectures when all three geometry fields (qk_nope_head_dim, qk_rope_head_dim, v_head_dim) are present in config.json. When detected: - skips the standard-attention guard - per layer: fetches kv_a/kv_b/q_a/q_b projections, calls mla_absorb::absorb, writes the resulting dense Q/K/V under the standard attn_q/k/v key names - O projection is passed through unchanged (no absorption needed) The loader remains MLA-unaware: it reads standard Q/K/V tensors just as for any Llama/Mistral model. The extra storage cost (absorbed K replicates the MQA rope-K row num_kv times) is acceptable for DS-V3 full scale (~3.5 GB extra per 61 layers on num_kv=128). All 971 larql-vindex unit + integration tests pass.
Implements scalar dequantize for Q3_K (110 B/block) and Q5_K (176 B/block) so that DeepSeek-R1-0528-Qwen3-8B-Q3_K_L and similar models can be converted via larql gguf-to-vindex. - q3_k.rs: unpack_q3k_scales (kmask1/kmask2 per llama.cpp), two-half-block loop with m-bitmask for high bits, signed-scale centred at 32. - q5_k.rs: reuses pub(super) unpack_q4k_scales from q4_k; u1/u2 mask walk for high bits, 4 iterations of 64 elements each. - mod.rs: Q3_K_BLOCK_BYTES=110, Q5_K_BLOCK_BYTES=176, dispatch in tensor_data_size() and dequantize(). - q4_k.rs: unpack_q4k_scales promoted to pub(super) for Q5_K reuse.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
gqa_attention_asym— new attention kernel inlarql-inferencethat handles asymmetricqk_head_dim/v_head_dim(required for absorbed MLA tensors where Q/K use 192-dim heads but V uses 128-dim heads in DS-V3)ModelConfig—qk_nope_head_dim,qk_rope_head_dim,v_head_dimparsed fromconfig.json;DeepSeekArchexposes them via trait methodsmla_absorb— new module inlarql-vindexthat fuses the four DS-V2/V3 low-rank attention projections (kv_a,kv_b,q_a,q_b) into standard dense Q/K/V weight matriceswrite_model_weights— F32 weight writer now accepts MLA architectures: detects full geometry, runs absorption per layer, writes absorbed Q/K/V under standard key names so the loader needs no MLA awarenessWhy absorption
DS-V2/V3 stores attention as four low-rank matrices. Absorbing them into standard Q/K/V at extraction time means:
Correctness
Key details:
kv_arope-K is MQA (one shared row for all KV heads, not per-head) — replicatednum_kvtimes when building absorbed K[nope | rope]; LARQL convention is[rope | nope]— absorption reorders symmetrically for both Q and Kabsorbed_forward_matches_referencetest: reference MLA forward pass vs absorbed path throughgqa_attention_asymmust agree within 1e-4 (f32 precision)Test plan
cargo test -p larql-inference -- gqa_attention_asym— 4 tests (shape, finite, sym-equivalence, causal)cargo test -p larql-vindex -- mla_absorb— 3 tests (forward equivalence, shapes, rope broadcast)cargo test -p larql-models— existing DS-V3 detection tests extended with new geometry accessorscargo test -p larql-vindex— 971 tests, 0 failures