feat: MLA absorption for DeepSeek V2/V3 — fuse low-rank Q/K/V into standard dense tensors by mvkorobkov · Pull Request #96 · chrishayuk/larql

mvkorobkov · 2026-05-14T09:21:24Z

Summary

gqa_attention_asym — new attention kernel in larql-inference that handles asymmetric qk_head_dim / v_head_dim (required for absorbed MLA tensors where Q/K use 192-dim heads but V uses 128-dim heads in DS-V3)
MLA geometry fields in ModelConfig — qk_nope_head_dim, qk_rope_head_dim, v_head_dim parsed from config.json; DeepSeekArch exposes them via trait methods
mla_absorb — new module in larql-vindex that fuses the four DS-V2/V3 low-rank attention projections (kv_a, kv_b, q_a, q_b) into standard dense Q/K/V weight matrices
write_model_weights — F32 weight writer now accepts MLA architectures: detects full geometry, runs absorption per layer, writes absorbed Q/K/V under standard key names so the loader needs no MLA awareness

Why absorption

DS-V2/V3 stores attention as four low-rank matrices. Absorbing them into standard Q/K/V at extraction time means:

Inference path is uniform — no special MLA forward pass at runtime
Loader is unchanged — reads Q/K/V tensors as for any Llama/Mistral model
One-time compute cost at extraction, not at every inference step

Correctness

Key details:

kv_a rope-K is MQA (one shared row for all KV heads, not per-head) — replicated num_kv times when building absorbed K
DS-V3 native per-head layout is [nope | rope]; LARQL convention is [rope | nope] — absorption reorders symmetrically for both Q and K
Equivalence proven by absorbed_forward_matches_reference test: reference MLA forward pass vs absorbed path through gqa_attention_asym must agree within 1e-4 (f32 precision)

Test plan

cargo test -p larql-inference -- gqa_attention_asym — 4 tests (shape, finite, sym-equivalence, causal)
cargo test -p larql-vindex -- mla_absorb — 3 tests (forward equivalence, shapes, rope broadcast)
cargo test -p larql-models — existing DS-V3 detection tests extended with new geometry accessors
cargo test -p larql-vindex — 971 tests, 0 failures

DS-V3 absorbed attention has qk_head_dim=192 (nope=128+rope=64) but v_head_dim=128. The existing gqa_attention uses a single head_dim for all projections, which would corrupt V slicing and output shape. gqa_attention_asym accepts separate qk_head_dim and v_head_dim: - Q/K sliced with qk_head_dim (dot-product stays in the larger space) - V sliced and output written with v_head_dim - Returns (seq, num_q * v_head_dim) When qk_head_dim == v_head_dim the function is numerically identical to gqa_attention (verified by asym_sym_equivalence_when_dims_equal test). 4 tests added: shape, finiteness, sym-equivalence, seq=1 causal.

Three new optional fields on ModelConfig: qk_nope_head_dim — non-RoPE part of Q/K head dim (DS-V3: 128) qk_rope_head_dim — RoPE-rotated part of Q/K head dim (DS-V3: 64) v_head_dim — V projection head dim (DS-V3: 128) Parsed from config.json (qk_nope_head_dim / qk_rope_head_dim / v_head_dim). Trait accessors added to ModelArchitecture with None defaults. DeepSeekArch overrides to read from config. DS-V3 detection test extended to verify all three fields round-trip. Two GGUF test-only ModelConfig literals updated to include None stubs.

…eight matrices Implements `mla_absorb::absorb()` which converts the four MLA weight matrices (kv_a, kv_b, q_a, q_b) into standard dense Q/K/V tensors compatible with `gqa_attention_asym`. Key correctness points: - rope-K is MQA: single row in kv_a[kv_lora..] replicated num_kv times in absorbed K (not per-head in the input tensor) - DS-V3 native per-head layout [nope|rope] → LARQL convention [rope|nope] applied symmetrically to Q and K during absorption - V: straightforward kv_b[nope+v_hd slice] @ kv_compress Three tests (3 passed): - absorbed_forward_matches_reference: reference MLA forward vs absorbed path through gqa_attention_asym must match within 1e-4 - absorbed_shapes: output tensor dimensions - rope_k_is_broadcast_not_zero: single rope-K correctly replicated across heads

write_model_weights_with_opts now accepts DS-V3 / MLA architectures when all three geometry fields (qk_nope_head_dim, qk_rope_head_dim, v_head_dim) are present in config.json. When detected: - skips the standard-attention guard - per layer: fetches kv_a/kv_b/q_a/q_b projections, calls mla_absorb::absorb, writes the resulting dense Q/K/V under the standard attn_q/k/v key names - O projection is passed through unchanged (no absorption needed) The loader remains MLA-unaware: it reads standard Q/K/V tensors just as for any Llama/Mistral model. The extra storage cost (absorbed K replicates the MQA rope-K row num_kv times) is acceptable for DS-V3 full scale (~3.5 GB extra per 61 layers on num_kv=128). All 971 larql-vindex unit + integration tests pass.

Implements scalar dequantize for Q3_K (110 B/block) and Q5_K (176 B/block) so that DeepSeek-R1-0528-Qwen3-8B-Q3_K_L and similar models can be converted via larql gguf-to-vindex. - q3_k.rs: unpack_q3k_scales (kmask1/kmask2 per llama.cpp), two-half-block loop with m-bitmask for high bits, signed-scale centred at 32. - q5_k.rs: reuses pub(super) unpack_q4k_scales from q4_k; u1/u2 mask walk for high bits, 4 iterations of 64 elements each. - mod.rs: Q3_K_BLOCK_BYTES=110, Q5_K_BLOCK_BYTES=176, dispatch in tensor_data_size() and dequantize(). - q4_k.rs: unpack_q4k_scales promoted to pub(super) for Q5_K reuse.

Mykhailo Korobkov added 6 commits May 14, 2026 10:49

fix(gguf): map deepseek_v4/deepseekv4 arch string to DeepSeekV4Arch

b584394

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: MLA absorption for DeepSeek V2/V3 — fuse low-rank Q/K/V into standard dense tensors#96

feat: MLA absorption for DeepSeek V2/V3 — fuse low-rank Q/K/V into standard dense tensors#96
mvkorobkov wants to merge 6 commits into
chrishayuk:mainfrom
mvkorobkov:main

mvkorobkov commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mvkorobkov commented May 14, 2026

Summary

Why absorption

Correctness

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant