Add TurboQuantKVCache: 3-bit/4-bit KV cache compression for generation by dedalien · Pull Request #1202 · ml-explore/mlx-lm

dedalien · 2026-04-26T15:15:39Z

Builds on ml-explore/mlx#3026 (Dan Yeh) — the generic quantized_scaled_dot_product_attention API with pluggable modes.

Needs PR CC-yeh/mlx#3 to be approved and merged before 3026 merge.

What this does

Adds TurboQuantKVCache, a drop-in KV cache that compresses keys and values to 3 or 4 bits during generation using TurboQuant (arXiv 2504.19874):

WHT rotation: K vectors are rotated via Walsh-Hadamard transform, spreading energy uniformly so the distribution approximates N(0,1)
Lloyd-Max quantization: optimal scalar quantizer for N(0,1); codebooks live in the Metal kernel as compile-time constants
Bit packing: indices packed into uint32, same layout as the existing affine SDPA kernel
V is not rotated (kernel output lands directly in the original V space)
Q rotation runs in float32 via a fused Metal kernel before the SDPA call; result is cast to bfloat16 for dispatch. bfloat16 butterfly accumulation across 8 WHT stages shifts softmax peaks on models with large key scales (Qwen3.6, key_scale ≈ 6–10).

Two-phase cache: prefill stores float16; on the first generation step all prefill tokens are batch-compressed. Subsequent steps compress token-by-token.

New files

mlx_lm/models/turbo_cache.py: TurboQuantKVCache(_BaseCache), make_turbo_cache(), WHT/encode helpers
mlx_lm/models/turbo_metal.py: two fused Metal kernels via mx.fast.metal_kernel (WHT+norm+codebook+pack, and WHT-only for Q rotation), one thread per token, float32 registers

Modified files

mlx_lm/models/base.py: detects TurboQuantKVCache via isinstance, routes to _turbo_scaled_dot_product_attention
mlx_lm/models/cache.py: make_turbo_cache() replaces KVCache with TurboQuantKVCache; leaves ArraysCache/DeltaNet layers untouched; fp16_layers= keeps first/last N attention layers in float16
mlx_lm/generate.py: generate_step gains kv_cache_type= and turbo_fp16_layers=; --kv-cache-type and --turbo-fp16-layers in CLI
tests/test_turbo_cache.py: 22 sub-cases covering WHT isometry, encode shapes, two-phase cache, D in {64,128,256}, GQA=6, B=2, bfloat16, fp16_layers boundary

Usage

from mlx_lm import load
from mlx_lm.generate import generate_step

model, tokenizer = load("mlx-community/Qwen3.6-27B-4bit")
tokens = tokenizer.encode("Hello", return_tensors="mlx")[0]

for tok, _ in generate_step(tokens, model, kv_cache_type="turbo3"):
    ...

Or via CLI:

mlx_lm.generate --model mlx-community/Qwen3.6-27B-4bit --kv-cache-type turbo3 --prompt "Hello"

Supported configurations

Head dims: 64, 128, 256
Bits: 3 (turbo3) or 4 (turbo4)
Requires Metal GPU

Tested on

Qwen3.6-27B (head_dim=256, 24Q/4KV, GQA=6) on a 24 GB unified memory Mac. The model fits entirely in RAM with turbo3 vs swapping to disk with fp16, enabling practical long-context generation on memory-constrained hardware (~5x KV cache compression).

Note: Depends on #3026 being merged first.

Builds on ml-explore/mlx#3026 (Dan Yeh) — the generic quantized_scaled_dot_product_attention API with pluggable modes. Companion PR: ml-explore/mlx#XXXX (adds turbo3/turbo4 to mlx core). New files: - mlx_lm/models/turbo_cache.py: TurboQuantKVCache(_BaseCache). Two-phase: prefill stores float16, generation returns (packed_uint32, float16_scales) for the fused kernel. On first generation step, prefill tokens are batch-compressed. K is WHT-rotated before quantization; V is not. - mlx_lm/models/turbo_metal.py: two fused Metal kernels via mx.fast.metal_kernel, one thread per token, float32 registers: turbo_encode_metal (WHT + norm + codebook + pack) and wht_rotate_metal (used to pre-rotate Q before SDPA). Modified files: - mlx_lm/models/base.py: detects TurboQuantKVCache via isinstance, routes to _turbo_scaled_dot_product_attention. Q rotation in float32 (bfloat16 butterfly shifts softmax peaks on models with large key scales). - mlx_lm/models/cache.py: make_turbo_cache() replaces KVCache with TurboQuantKVCache; leaves ArraysCache/DeltaNet layers untouched. fp16_layers= keeps first/last N attention layers in float16. - mlx_lm/generate.py: generate_step gains kv_cache_type= and turbo_fp16_layers=; --kv-cache-type and --turbo-fp16-layers in CLI. - tests/test_turbo_cache.py: 22 sub-cases covering WHT isometry, encode shapes, two-phase cache, D in {64,128,256}, gqa_factor=6, B=2, bfloat16, fp16_layers boundary. Supported head dims: 64, 128, 256. Requires Metal GPU. Tested on Qwen3.6-27B (head_dim=256, 24Q/4KV, GQA=6), 24 GB unified memory Mac. Enables longer context generation on memory-constrained hardware by compressing the KV cache ~5x.

lawcontinue · 2026-04-27T01:26:30Z

Useful for pipeline parallel setups where KV cache gets shipped between nodes — bandwidth becomes the bottleneck at that point, not compute. A 3-bit cache would cut transfer size significantly. Question: does the WHT rotation happen at cache store time or at attention compute time? If store-time, the compression is free during generation; if compute-time, the rotation cost offsets some of the savings.

dedalien · 2026-04-27T07:41:30Z

Useful for pipeline parallel setups where KV cache gets shipped between nodes — bandwidth becomes the bottleneck at that point, not compute. A 3-bit cache would cut transfer size significantly. Question: does the WHT rotation happen at cache store time or at attention compute time? If store-time, the compression is free during generation; if compute-time, the rotation cost offsets some of the savings.

Yo, WHT happens at compute time for Q. K is rotated and compressed at store time. Cost a bit, so maybe implement a context length threshold to enable turbo3 compression, (e.g. turbo_kv_start=1024 in generate_step)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TurboQuantKVCache: 3-bit/4-bit KV cache compression for generation#1202

Add TurboQuantKVCache: 3-bit/4-bit KV cache compression for generation#1202
dedalien wants to merge 1 commit intoml-explore:mainfrom
dedalien:turboq/integrate-generic-quant-sdpa

dedalien commented Apr 26, 2026 •

edited

Loading

Uh oh!

lawcontinue commented Apr 27, 2026

Uh oh!

dedalien commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dedalien commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

New files

Modified files

Usage

Supported configurations

Tested on

Uh oh!

lawcontinue commented Apr 27, 2026

Uh oh!

dedalien commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dedalien commented Apr 26, 2026 •

edited

Loading