Skip to content

feat: MLA absorption for DeepSeek V2/V3 — fuse low-rank Q/K/V into standard dense tensors#96

Open
mvkorobkov wants to merge 6 commits into
chrishayuk:mainfrom
mvkorobkov:main
Open

feat: MLA absorption for DeepSeek V2/V3 — fuse low-rank Q/K/V into standard dense tensors#96
mvkorobkov wants to merge 6 commits into
chrishayuk:mainfrom
mvkorobkov:main

Conversation

@mvkorobkov
Copy link
Copy Markdown

Summary

  • gqa_attention_asym — new attention kernel in larql-inference that handles asymmetric qk_head_dim / v_head_dim (required for absorbed MLA tensors where Q/K use 192-dim heads but V uses 128-dim heads in DS-V3)
  • MLA geometry fields in ModelConfigqk_nope_head_dim, qk_rope_head_dim, v_head_dim parsed from config.json; DeepSeekArch exposes them via trait methods
  • mla_absorb — new module in larql-vindex that fuses the four DS-V2/V3 low-rank attention projections (kv_a, kv_b, q_a, q_b) into standard dense Q/K/V weight matrices
  • write_model_weights — F32 weight writer now accepts MLA architectures: detects full geometry, runs absorption per layer, writes absorbed Q/K/V under standard key names so the loader needs no MLA awareness

Why absorption

DS-V2/V3 stores attention as four low-rank matrices. Absorbing them into standard Q/K/V at extraction time means:

  • Inference path is uniform — no special MLA forward pass at runtime
  • Loader is unchanged — reads Q/K/V tensors as for any Llama/Mistral model
  • One-time compute cost at extraction, not at every inference step

Correctness

Key details:

  • kv_a rope-K is MQA (one shared row for all KV heads, not per-head) — replicated num_kv times when building absorbed K
  • DS-V3 native per-head layout is [nope | rope]; LARQL convention is [rope | nope] — absorption reorders symmetrically for both Q and K
  • Equivalence proven by absorbed_forward_matches_reference test: reference MLA forward pass vs absorbed path through gqa_attention_asym must agree within 1e-4 (f32 precision)

Test plan

  • cargo test -p larql-inference -- gqa_attention_asym — 4 tests (shape, finite, sym-equivalence, causal)
  • cargo test -p larql-vindex -- mla_absorb — 3 tests (forward equivalence, shapes, rope broadcast)
  • cargo test -p larql-models — existing DS-V3 detection tests extended with new geometry accessors
  • cargo test -p larql-vindex — 971 tests, 0 failures

Mykhailo Korobkov added 6 commits May 14, 2026 10:49
DS-V3 absorbed attention has qk_head_dim=192 (nope=128+rope=64) but
v_head_dim=128. The existing gqa_attention uses a single head_dim for
all projections, which would corrupt V slicing and output shape.

gqa_attention_asym accepts separate qk_head_dim and v_head_dim:
- Q/K sliced with qk_head_dim (dot-product stays in the larger space)
- V sliced and output written with v_head_dim
- Returns (seq, num_q * v_head_dim)

When qk_head_dim == v_head_dim the function is numerically identical
to gqa_attention (verified by asym_sym_equivalence_when_dims_equal test).

4 tests added: shape, finiteness, sym-equivalence, seq=1 causal.
Three new optional fields on ModelConfig:
  qk_nope_head_dim — non-RoPE part of Q/K head dim (DS-V3: 128)
  qk_rope_head_dim — RoPE-rotated part of Q/K head dim (DS-V3: 64)
  v_head_dim       — V projection head dim (DS-V3: 128)

Parsed from config.json (qk_nope_head_dim / qk_rope_head_dim / v_head_dim).
Trait accessors added to ModelArchitecture with None defaults.
DeepSeekArch overrides to read from config.
DS-V3 detection test extended to verify all three fields round-trip.
Two GGUF test-only ModelConfig literals updated to include None stubs.
…eight matrices

Implements `mla_absorb::absorb()` which converts the four MLA weight matrices
(kv_a, kv_b, q_a, q_b) into standard dense Q/K/V tensors compatible with
`gqa_attention_asym`. Key correctness points:

- rope-K is MQA: single row in kv_a[kv_lora..] replicated num_kv times in
  absorbed K (not per-head in the input tensor)
- DS-V3 native per-head layout [nope|rope] → LARQL convention [rope|nope]
  applied symmetrically to Q and K during absorption
- V: straightforward kv_b[nope+v_hd slice] @ kv_compress

Three tests (3 passed):
- absorbed_forward_matches_reference: reference MLA forward vs absorbed path
  through gqa_attention_asym must match within 1e-4
- absorbed_shapes: output tensor dimensions
- rope_k_is_broadcast_not_zero: single rope-K correctly replicated across heads
write_model_weights_with_opts now accepts DS-V3 / MLA architectures when all
three geometry fields (qk_nope_head_dim, qk_rope_head_dim, v_head_dim) are
present in config.json. When detected:

- skips the standard-attention guard
- per layer: fetches kv_a/kv_b/q_a/q_b projections, calls mla_absorb::absorb,
  writes the resulting dense Q/K/V under the standard attn_q/k/v key names
- O projection is passed through unchanged (no absorption needed)

The loader remains MLA-unaware: it reads standard Q/K/V tensors just as for
any Llama/Mistral model. The extra storage cost (absorbed K replicates the
MQA rope-K row num_kv times) is acceptable for DS-V3 full scale (~3.5 GB
extra per 61 layers on num_kv=128).

All 971 larql-vindex unit + integration tests pass.
Implements scalar dequantize for Q3_K (110 B/block) and Q5_K (176 B/block)
so that DeepSeek-R1-0528-Qwen3-8B-Q3_K_L and similar models can be converted
via larql gguf-to-vindex.

- q3_k.rs: unpack_q3k_scales (kmask1/kmask2 per llama.cpp), two-half-block
  loop with m-bitmask for high bits, signed-scale centred at 32.
- q5_k.rs: reuses pub(super) unpack_q4k_scales from q4_k; u1/u2 mask walk
  for high bits, 4 iterations of 64 elements each.
- mod.rs: Q3_K_BLOCK_BYTES=110, Q5_K_BLOCK_BYTES=176, dispatch in
  tensor_data_size() and dequantize().
- q4_k.rs: unpack_q4k_scales promoted to pub(super) for Q5_K reuse.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant