Skip to content

feat(ggml): add Q3_K and Q5_K dequantization (types 11 and 13)#103

Open
mvkorobkov wants to merge 2 commits into
chrishayuk:mainfrom
mvkorobkov:feat/q3k-q5k-dequant
Open

feat(ggml): add Q3_K and Q5_K dequantization (types 11 and 13)#103
mvkorobkov wants to merge 2 commits into
chrishayuk:mainfrom
mvkorobkov:feat/q3k-q5k-dequant

Conversation

@mvkorobkov
Copy link
Copy Markdown

What

Implements scalar dequantization for two missing K-quant formats:

Type ID Block size Elements/block
Q3_K 11 110 bytes 256
Q5_K 13 176 bytes 256

Both tensor_data_size() and dequantize() in mod.rs are wired up.

Why

Without these, larql convert gguf-to-vindex fails with unsupported type id 11/13 on any model that uses Q3_K or Q5_K tensors. This includes:

  • DeepSeek-R1-0528-Qwen3-8B-Q3_K_L — 145 Q3_K tensors + 108 Q5_K tensors + 1 Q6_K
  • DeepSeek-V4-Flash-Q3_K_M (multi-shard, separate PR) — same types
  • Any model quantised with llama.cpp Q3_K_S / Q3_K_M / Q3_K_L / Q5_K_S / Q5_K_M

Implementation

q3_k.rsdequantize_q3_k()

  • Block layout: hmask[0..32] · qs[32..96] · scales[96..108] · d[108..110]
  • unpack_q3k_scales(): 12 bytes → 16 six-bit signed values using the kmask1=0x03030303 / kmask2=0x0F0F0F0F shuffle from dequantize_row_q3_K in llama.cpp
  • Two-half loop (128 + 128 elements), m bitmask walks through hmask; clear bit → subtract 4 from q2 value

q5_k.rsdequantize_q5_k()

  • Block layout: d[0..2] · dmin[2..4] · scales[4..16] · qh[16..48] · qs[48..176]
  • Reuses pub(super) unpack_q4k_scales() from q4_k.rs (same 12-byte format as Q4_K)
  • u1/u2 bitmask pair walks through qh; set bit → add 16 to 4-bit nibble
  • 4 iterations × 64 elements, matching dequantize_row_q5_K in llama.cpp

q4_k.rsunpack_q4k_scales visibility changed fnpub(super) so Q5_K can share it without duplication.

Testing

Unit tests in each module:

  • q3_k: zero-scale all-zero output, hmask-clear subtracts 4, wrong-size error
  • q5_k: zero-scale all-zero output, high-bit adds 16, wrong-size error

End-to-end: larql convert gguf-to-vindex on DeepSeek-R1-0528-Qwen3-8B-Q3_K_L.gguf completes through dequantization without errors (145 Q3_K + 108 Q5_K tensors dequantized cleanly).

All 82 existing larql-models tests continue to pass.

Mykhailo Korobkov added 2 commits May 15, 2026 11:57
Implements scalar dequantize for Q3_K (110 B/block) and Q5_K (176 B/block)
so that DeepSeek-R1-0528-Qwen3-8B-Q3_K_L and similar models can be converted
via larql gguf-to-vindex.

- q3_k.rs: unpack_q3k_scales (kmask1/kmask2 per llama.cpp), two-half-block
  loop with m-bitmask for high bits, signed-scale centred at 32.
- q5_k.rs: reuses pub(super) unpack_q4k_scales from q4_k; u1/u2 mask walk
  for high bits, 4 iterations of 64 elements each.
- mod.rs: Q3_K_BLOCK_BYTES=110, Q5_K_BLOCK_BYTES=176, dispatch in
  tensor_data_size() and dequantize().
- q4_k.rs: unpack_q4k_scales promoted to pub(super) for Q5_K reuse.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant