Add TurboQuant KV cache + DeepSeek V4 support by arozanov · Pull Request #1067 · ml-explore/mlx-lm

arozanov · 2026-03-28T08:38:45Z

Summary

Adds TurboQuant KV cache compression and full DeepSeek-V4-Flash inference support.

TurboQuant KV Cache

Implementation of arXiv 2504.19874 (ICLR 2026).

4.6x compression at 3-bit (10 values packed per uint32)
0.98x FP16 speed on Qwen2.5-32B (M4 Pro 48GB)
Fused Metal kernels for quantize/dequantize
Drop-in: generate_step(prompt, model, turbo_kv_bits=3)

DeepSeek V4 Flash (284B MoE)

First MLX implementation of DeepSeek-V4-Flash with full architecture support:

Compressed Sparse Attention (CSA ratio=4) with Lightning Indexer
Heavily Compressed Attention (HCA ratio=128) with learned compressor
Hyper-Connections with Sinkhorn normalization
Hash routing MoE (256 experts, 6 active)
Grouped low-rank output projection
MQA with inverse RoPE

Performance optimizations:

Custom fused Metal kernels for MoE decode (gate+up+SwiGLU, down proj, grouped wo_a)
MoE layer skip (8/43 layers, quality-validated)
mx.compile for HC and MoE modules
Sparse prefill with chunked processing
SparseKVCache with full state serialization
Disk-backed prompt cache with memory-aware saving

Results

Model	Quant	tok/s	RAM	Hardware
DeepSeek-V4-Flash (284B)	4-bit	21	161 GB	Mac Studio M3 Ultra 512GB
DeepSeek-V4-Flash (284B)	8-bit	8.5	303 GB	Mac Studio M3 Ultra 512GB
Qwen2.5-32B	FP16 + TQ 3-bit KV	26	22 GB	MacBook Pro M4 Pro 48GB

Quick Start

pip install git+https://github.com/arozanov/mlx-lm.git@feature/turboquant-kv-cache
huggingface-cli download mlx-community/deepseek-ai-DeepSeek-V4-Flash-4bit --local-dir models/v4-4bit
mlx_lm.server --model models/v4-4bit --host 127.0.0.1 --port 8080 --prompt-cache-size 5 --no-batch

Other fixes

Tokenizer fallback for unrecognized model types (AutoTokenizer -> PreTrainedTokenizerFast)
Disk cache memory check (skip save when system RAM is low)
Stream threading fix (generation on main thread)
Multi-turn cache reuse via prefill checkpoints
Chunked prefill crash fix for compressed layers

Test plan

4-bit server: streaming, non-streaming, multi-turn
8-bit server: code generation, math, reasoning
Chunked prefill (2K+ token prompts)
Cache serialization save/restore
SparseKVCache trim
Unit tests (prefill, continuation, decode, second conversation)
Opus audit: all critical/important issues fixed

Implements TurboQuant (arXiv 2504.19874) KV cache compression: - PolarQuant: randomized Hadamard rotation + Lloyd-Max codebook - Bit-packed uint32 storage (3-bit: 10 values per word) - Fused Metal kernels for quantize and dequantize - Incremental decode buffer for O(1) per-step cost - Layer-adaptive mode: FP16 for first/last N layers Usage: generate_step(prompt, model, turbo_kv_bits=3) Results (Qwen2.5-32B, M4 Pro 48GB): - 4.6x compression, 0.98x FP16 speed, identical quality - 16K context: 4.2GB → 897MB KV cache

kipanshi · 2026-03-28T11:52:24Z

I tried this branch on GLM-4.7-Flash-REAP-23B-A3B-mlx-nvfp4 - it outputs garbage, on main branch it works fine

arozanov · 2026-03-28T11:57:13Z

I tried this branch on GLM-4.7-Flash-REAP-23B-A3B-mlx-nvfp4 - it outputs garbage, on main branch it works fine

That's unexpected - this branch shouldn't change default behavior, it only adds new files and optional parameters. Are you using the default generate() or did you pass turbo_kv_bits? If default, there might be a formatting issue from pre-commit that touched generate.py - I'll check.

kipanshi · 2026-03-28T12:14:56Z

I tried this branch on GLM-4.7-Flash-REAP-23B-A3B-mlx-nvfp4 - it outputs garbage, on main branch it works fine

That's unexpected - this branch shouldn't change default behavior, it only adds new files and optional parameters. Are you using the default generate() or did you pass turbo_kv_bits? If default, there might be a formatting issue from pre-commit that touched generate.py - I'll check.

This is the script I used:

#!/bin/bash
# Run GLM-4.7-Flash-REAP-23B-A3B with TurboQuant KV cache
# Optimized for M1 Max 32GB

MODEL_DIR="$HOME/my_docs/llms/GLM-4.7-Flash-REAP-23B-A3B-mlx-mxfp4"
MLX_LM_DIR="$HOME/opt/mlx-lm"

TURBO_KV_BITS="${TURBO_KV_BITS:-4}"       # 3-bit = 4.6x compression, 4-bit = safer quality
TURBO_FP16_LAYERS="${TURBO_FP16_LAYERS:-1}" # first/last N layers stay FP16
MAX_TOKENS="${MAX_TOKENS:-4096}"
TEMP="${TEMP:-0.7}"
TOP_P="${TOP_P:-0.9}"

PROMPT="${1:-Hello, who are you?}"

cd "$MLX_LM_DIR" || exit 1

uv run python -c "
from mlx_lm import load, stream_generate
from mlx_lm.generate import make_sampler
import sys

model, tokenizer = load('${MODEL_DIR}')

sampler = make_sampler(temp=${TEMP}, top_p=${TOP_P})

prompt = sys.argv[1]
if tokenizer.has_chat_template:
    messages = [{'role': 'user', 'content': prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True,
    )

for response in stream_generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=${MAX_TOKENS},
    sampler=sampler,
    turbo_kv_bits=${TURBO_KV_BITS},
    turbo_fp16_layers=${TURBO_FP16_LAYERS},
):
    print(response.text, end='', flush=True)
print()
" "$PROMPT"

arozanov · 2026-03-28T12:21:15Z

I tried this branch on GLM-4.7-Flash-REAP-23B-A3B-mlx-nvfp4 - it outputs garbage, on main branch it works fine

That's unexpected - this branch shouldn't change default behavior, it only adds new files and optional parameters. Are you using the default generate() or did you pass turbo_kv_bits? If default, there might be a formatting issue from pre-commit that touched generate.py - I'll check.

This is the script I used:

#!/bin/bash
# Run GLM-4.7-Flash-REAP-23B-A3B with TurboQuant KV cache
# Optimized for M1 Max 32GB

MODEL_DIR="$HOME/my_docs/llms/GLM-4.7-Flash-REAP-23B-A3B-mlx-mxfp4"
MLX_LM_DIR="$HOME/opt/mlx-lm"

TURBO_KV_BITS="${TURBO_KV_BITS:-4}"       # 3-bit = 4.6x compression, 4-bit = safer quality
TURBO_FP16_LAYERS="${TURBO_FP16_LAYERS:-1}" # first/last N layers stay FP16
MAX_TOKENS="${MAX_TOKENS:-4096}"
TEMP="${TEMP:-0.7}"
TOP_P="${TOP_P:-0.9}"

PROMPT="${1:-Hello, who are you?}"

cd "$MLX_LM_DIR" || exit 1

uv run python -c "
from mlx_lm import load, stream_generate
from mlx_lm.generate import make_sampler
import sys

model, tokenizer = load('${MODEL_DIR}')

sampler = make_sampler(temp=${TEMP}, top_p=${TOP_P})

prompt = sys.argv[1]
if tokenizer.has_chat_template:
    messages = [{'role': 'user', 'content': prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True,
    )

for response in stream_generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=${MAX_TOKENS},
    sampler=sampler,
    turbo_kv_bits=${TURBO_KV_BITS},
    turbo_fp16_layers=${TURBO_FP16_LAYERS},
):
    print(response.text, end='', flush=True)
print()
" "$PROMPT"

Ah got it, you are using turbo_kv_bits=4. That's expected to have quality issues on the K tensor - I've seen the same thing. Try turbo_kv_bits=3 which actually works better (counterintuitively, the 3-bit codebook fits the post-rotation Gaussian distribution better than 4-bit for K. Also for a 23B model try increasing turbo_fp16_layers=2 or turbo_fp16_layers=4 to keep more layers in full precision.

arozanov · 2026-03-28T12:36:03Z

I tried this branch on GLM-4.7-Flash-REAP-23B-A3B-mlx-nvfp4 - it outputs garbage, on main branch it works fine

That's unexpected - this branch shouldn't change default behavior, it only adds new files and optional parameters. Are you using the default generate() or did you pass turbo_kv_bits? If default, there might be a formatting issue from pre-commit that touched generate.py - I'll check.

This is the script I used:

#!/bin/bash
# Run GLM-4.7-Flash-REAP-23B-A3B with TurboQuant KV cache
# Optimized for M1 Max 32GB

MODEL_DIR="$HOME/my_docs/llms/GLM-4.7-Flash-REAP-23B-A3B-mlx-mxfp4"
MLX_LM_DIR="$HOME/opt/mlx-lm"

TURBO_KV_BITS="${TURBO_KV_BITS:-4}"       # 3-bit = 4.6x compression, 4-bit = safer quality
TURBO_FP16_LAYERS="${TURBO_FP16_LAYERS:-1}" # first/last N layers stay FP16
MAX_TOKENS="${MAX_TOKENS:-4096}"
TEMP="${TEMP:-0.7}"
TOP_P="${TOP_P:-0.9}"

PROMPT="${1:-Hello, who are you?}"

cd "$MLX_LM_DIR" || exit 1

uv run python -c "
from mlx_lm import load, stream_generate
from mlx_lm.generate import make_sampler
import sys

model, tokenizer = load('${MODEL_DIR}')

sampler = make_sampler(temp=${TEMP}, top_p=${TOP_P})

prompt = sys.argv[1]
if tokenizer.has_chat_template:
    messages = [{'role': 'user', 'content': prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True,
    )

for response in stream_generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=${MAX_TOKENS},
    sampler=sampler,
    turbo_kv_bits=${TURBO_KV_BITS},
    turbo_fp16_layers=${TURBO_FP16_LAYERS},
):
    print(response.text, end='', flush=True)
print()
" "$PROMPT"

Found it. Your config turbo_kv_bits=4, turbo_fp16_layers=1 should work on most models, but MoE architectures like GLM-4.7-Flash might need more FP16 layers. Try turbo_fp16_layers=4 or turbo_fp16_layers=6. On my 7B tests, both 3-bit and 4-bit with fp16_layers=1 produce clean output.

kipanshi · 2026-03-28T12:38:19Z

Ah got it, you are using turbo_kv_bits=4. That's expected to have quality issues on the K tensor - I've seen the same thing. Try turbo_kv_bits=3 which actually works better (counterintuitively, the 3-bit codebook fits the post-rotation Gaussian distribution better than 4-bit for K. Also for a 23B model try increasing turbo_fp16_layers=2 or turbo_fp16_layers=4 to keep more layers in full precision.

whith params you suggested same garbage issue:
"""
Without turboquant the branch works fine — model outputs correctly. So the turboquant cache itself is incompatible with the glm4_moe_lite MLA
architecture. The MLA stores compressed latents (kv_lora_rank=512, qk_rope_head_dim=64) in the cache, not standard key/value tensors — and
turboquant's PolarQuant rotation likely can't handle that compressed representation correctly.
"""
I will try to test it with Qwen3.5 35B MoE

arozanov · 2026-03-28T12:40:38Z

Ah got it, you are using turbo_kv_bits=4. That's expected to have quality issues on the K tensor - I've seen the same thing. Try turbo_kv_bits=3 which actually works better (counterintuitively, the 3-bit codebook fits the post-rotation Gaussian distribution better than 4-bit for K. Also for a 23B model try increasing turbo_fp16_layers=2 or turbo_fp16_layers=4 to keep more layers in full precision.

whith params you suggested same garbage issue: """ Without turboquant the branch works fine — model outputs correctly. So the turboquant cache itself is incompatible with the glm4_moe_lite MLA architecture. The MLA stores compressed latents (kv_lora_rank=512, qk_rope_head_dim=64) in the cache, not standard key/value tensors — and turboquant's PolarQuant rotation likely can't handle that compressed representation correctly. """ I will try to test it with Qwen3.5 35B MoE

Yeah MLA is a different beast, makes sense it breaks. Good catch. Qwen3.5 should be fine since it's standard attention. Let me know how it goes.

kipanshi · 2026-03-28T13:27:53Z

whith params you suggested same garbage issue: """ Without turboquant the branch works fine — model outputs correctly. So the turboquant cache itself is incompatible with the glm4_moe_lite MLA architecture. The MLA stores compressed latents (kv_lora_rank=512, qk_rope_head_dim=64) in the cache, not standard key/value tensors — and turboquant's PolarQuant rotation likely can't handle that compressed representation correctly. """ I will try to test it with Qwen3.5 35B MoE

Yeah MLA is a different beast, makes sense it breaks. Good catch. Qwen3.5 should be fine since it's standard attention. Let me know how it goes.

Did more testing:

GLM-4.7-Flash (MLA): loads but produces garbage (MLA latent cache incompatible)
Qwen3.5-35B-A3B (hybrid SSM/attention): crashes (SSM cache not supported)
Standard attention models (Llama, Mistral): works correctly

arozanov · 2026-03-28T13:35:05Z

whith params you suggested same garbage issue: """ Without turboquant the branch works fine — model outputs correctly. So the turboquant cache itself is incompatible with the glm4_moe_lite MLA architecture. The MLA stores compressed latents (kv_lora_rank=512, qk_rope_head_dim=64) in the cache, not standard key/value tensors — and turboquant's PolarQuant rotation likely can't handle that compressed representation correctly. """ I will try to test it with Qwen3.5 35B MoE

Yeah MLA is a different beast, makes sense it breaks. Good catch. Qwen3.5 should be fine since it's standard attention. Let me know how it goes.

Did more testing:

GLM-4.7-Flash (MLA): loads but produces garbage (MLA latent cache incompatible)

Qwen3.5-35B-A3B (hybrid SSM/attention): crashes (SSM cache not supported)

Standard attention models (Llama, Mistral): works correctly

Thanks for testing across architectures. MLA and SSM are expected - TurboQuant only works with standard multi-head attention KV cache. I should add a check that raises a clear error instead of silently producing garbage. Will fix.

babhishek21 · 2026-03-28T20:44:15Z

some thoughts for DX:

Since this is a compression scheme, perhaps it should be given the same treatment as KVCache#to_quantized(). The basic unbounded KVCache could have a to_turbo_quantized() (or equivalent) that returns a TurboQuantKVCache.
Possibility to have a generalized convert_to_turbo_quantized() function (similar to how [Experimental] Add TurboQuantKVCache: PolarQuant KV cache compression at 2-4 bits #1059 does it), with supporting cache specializations progressively adopting to_turbo_quantized() (where supported).
make_prompt_cache should still be the entry point for feature enablement related to caches; in this particular case whether to enable TurboQuant compression or not. Similar to how max_kv_size causes a switch to bounded cache, params turboq_kv_bits and turboq_fp16_layers could switch on TurboQuant. Would also help dedupe all the logic around if hasattr(model, "make_cache"): return model.make_cache().
Lib users are still able to pass in any custom prompt_cache to generate.
CLI users should be able to pass in TurboQuant args in the same way they pass in --max-kv-size.

arozanov · 2026-03-28T20:53:52Z

some thoughts for DX:

Since this is a compression scheme, perhaps it should be given the same treatment as KVCache#to_quantized(). The basic unbounded KVCache could have a to_turbo_quantized() (or equivalent) that returns a TurboQuantKVCache.

Possibility to have a generalized convert_to_turbo_quantized() function (similar to how [Experimental] Add TurboQuantKVCache: PolarQuant KV cache compression at 2-4 bits #1059 does it), with supporting cache specializations progressively adopting to_turbo_quantized() (where supported).

make_prompt_cache should still be the entry point for feature enablement related to caches; in this particular case whether to enable TurboQuant compression or not. Similar to how max_kv_size causes a switch to bounded cache, params turboq_kv_bits and turboq_fp16_layers could switch on TurboQuant. Would also help dedupe all the logic around if hasattr(model, "make_cache"): return model.make_cache().

Lib users are still able to pass in any custom prompt_cache to generate.

CLI users should be able to pass in TurboQuant args in the same way they pass in --max-kv-size.

Good points, agree on all of them. Specifically:

to_turbo_quantized() on KVCache - makes sense, will add
Routing through make_prompt_cache instead of separate function - cleaner, agreed
CLI args --turbo-kv-bits and --turbo-fp16-layers alongside --max-kv-size - will do
I'll rework the PR to follow the existing patterns. Thanks for the detailed review.

…LI args

babhishek21 · 2026-03-29T04:19:21Z

@arozanov I think you'll need to add tests.
@awni @andresy with that, I think this PR will probably supersede #1059

Adds --turbo-kv-bits flag (1-4) to compress stored prefix cache entries using TurboQuant (arXiv 2504.19874). 3-bit gives 4.6x compression vs FP16, compared to ~2x from the existing 8-bit quantization. Integration points: - memory_cache.py: _turbo_quantize_cache/_dequantize_cache, memory estimation, trim support, needs_dequantize property, config validation - scheduler.py: turbo_kv_bits in SchedulerConfig, propagation to MemoryCacheConfig - cli.py: --turbo-kv-bits for serve and bench commands Requires mlx-lm with TurboQuant support (ml-explore/mlx-lm#1067).

QROST · 2026-04-01T05:28:50Z

#1064 #1063 #1059

deceptech-packet-ninja · 2026-04-02T04:17:34Z

Findings from independent testing + potential contributions

I've been working on TurboQuant for MLX-LM independently and have some findings that might be useful for this PR.

1. Bug in MLX core PR #3328 (`turboquant_sdpa` kernel)

The kernel dispatches N = k_norms.shape(1) at line 450 of scaled_dot_product_attention.cpp, but this reads the head dimension instead of the sequence length. Should be k_norms.shape(2). After fixing, the kernel produces exact matches (1.000 cosine similarity). I commented on PR #3328 with the fix and benchmarks.

2. Value compression (4-bit alongside 3-bit keys)

The current implementation stores values as FP16. Adding 4-bit affine quantization for values (via standard mx.quantize) doubles the memory savings:

Context	Keys only compressed	Keys + Values compressed
50K (Llama-3-8B)	4.3 GB KV	2.7 GB KV
200K (Llama-3-8B)	14.9 GB KV	10.7 GB KV

Speed impact is negligible (0.94x vs 0.95x). Quality is identical — values tolerate 4-bit well.

3. 200K context proof (32GB Mac)

On a 32GB machine with Llama-3-8B-Instruct-4bit:

FP16 at 200K context: system enters swap death — unusable for 1+ hour, had to kill the process
TurboQuant at 200K: completed fill and generated tokens successfully. KV cache = 10.7 GB.

This is the concrete proof that TurboQuant enables context lengths that are physically impossible with FP16 on memory-constrained machines.

4. Quality on real prompts

Tested coding, explanation, and analysis prompts. Outputs are near-identical to standard:

Standard:  "quantum computers are special computers that can do some things that regular computers can't.
            Imagine you have a big box of different colored balls..."
TurboQuant: "quantum computers are special computers that can do some really cool things that regular computers can't.
            Imagine you have a big box of different colored balls..."

5. Separate infrastructure PRs

We have two related PRs that complement TurboQuant:

feat: add KV cache quantization args to server #1073: Server --kv-bits support (enables KV quantization via API)
feat: QuantizedRotatingKVCache + KVSplit (K/V different bits) #1074: QuantizedRotatingKVCache + KVSplit bits=(key, value) tuple support

Happy to help with testing, benchmarks, or integration work on this PR. Great implementation.

Adds --turbo-kv-bits flag (1-4) to compress stored prefix cache entries using TurboQuant (arXiv 2504.19874). 3-bit gives 4.6x compression vs FP16, compared to ~2x from the existing 8-bit quantization. Integration points: - memory_cache.py: _turbo_quantize_cache/_dequantize_cache, memory estimation, trim support, needs_dequantize property, config validation - scheduler.py: turbo_kv_bits in SchedulerConfig, propagation to MemoryCacheConfig - cli.py: --turbo-kv-bits for serve and bench commands Requires mlx-lm with TurboQuant support (ml-explore/mlx-lm#1067).

Fail-fast when --turbo-kv-bits is requested without mlx-lm TurboQuant support: config and CLI now error out with an actionable message pointing to ml-explore/mlx-lm#1067 instead of silently no-oping. Reject the --turbo-kv-bits + --kv-cache-quantization combination in both argparse-fed callers and in MemoryCacheConfig.__post_init__ so programmatic users get the same guard. Log a one-time warning when _TurboQuantCacheWrapper.is_trimmable() degrades to False because the upstream TurboQuantKVCache lacks copy(); prefix-cache trimming (supersequence / LCP reuse) falls back to full-prefix matching in that case (correct, just less efficient). Document the dual estimate_kv_cache_memory paths (wrapper vs bare) and the copy() contract we depend on in _trim_cache_offset.

Saves KV cache at prefill completion and every 32K tokens during long prefills. Enables prefix matching on follow-up messages in multi-turn conversations. Previously cache was only saved after generation with key = prompt + generated tokens. Next turn re-tokenizes the assistant response with template wrappers, producing different tokens and breaking prefix match. Now saves with prompt-only key at prefill end, so next turn matches the prompt prefix and skips it. Tested: 46-token prompt cached, follow-up processed only 16 new tokens instead of full 62.

When all prompt tokens match a cached entry, rest=[] causes stream_generate to crash with ValueError (empty prompt). Fix: trim cache by 1 token and re-process the last token. Tested: exact same prompt sent twice, second request processes only 1 token with 40 cached. No crash.

Disk entries were capped at 2x prompt-cache-size, causing old caches to be evicted too aggressively. Disk is cheap, RAM is not. Added --prompt-cache-disk-size (default 100) to control disk entries independently from --prompt-cache-size (RAM entries). Tested: 5 RAM entries + 10 disk entries, capped correctly.

…latest main) Resolve mlx_lm/server.py using prior merge artifact 7aeb6df (fda593e + ed1fca4). Made-with: Cursor

Full implementation of DeepSeek-V4-Flash architecture: - Compressed Sparse Attention (CSA ratio=4, HCA ratio=128) - Lightning Indexer for top-k compressed position selection - Learned compressor with overlap transform - Hyper-Connections with Sinkhorn normalization - Hash routing MoE (256 experts, 6 active) - Grouped low-rank output projection - MQA with inverse RoPE Sparse attention with window + compressed KV: - Chunked sparse prefill (256 queries/chunk) - Sparse decode with circular window buffer - Step-based buffer growth for compressed entries - RotatingKVCache for pure sliding window layers - SparseKVCache with full state serialization Tested on DeepSeek-V4-Flash-8bit (303GB, 6.4 tok/s).

AutoTokenizer.from_pretrained crashes on model types that transformers doesn't know (e.g. deepseek_v4) due to rope_scaling standardization. Fall back to PreTrainedTokenizerFast when AutoTokenizer fails. Also register deepseek_v4 config with transformers AutoConfig.

SparseKVCache.update_and_fetch didn't preserve existing data on reallocation, causing shape mismatch on continuation prefill chunks. Also handle continuation prefill correctly: when the server splits a long prompt into chunks, subsequent chunks now extend the existing sparse buffers instead of reinitializing them.

Serialization materializes all KV cache arrays, temporarily doubling memory usage. For large models (400GB+) on machines with limited headroom this triggers OOM. Check available memory before saving and skip if less than 8GB free.

Compressor decode mode expects single-token input (L=1) but continuation prefill chunks pass L>1. Process chunk tokens one-by-one through the compressor to maintain correct state.

- Fix Metal resource limit crash: token-by-token compressor loop in continuation prefill creates too many buffers. Add periodic mx.eval to flush every 32 tokens. - Fix missing _win_buf on second generate call: use cache.offset==0 to detect first prefill instead of checking _win_buf existence. - Add safety init in sparse decode for single-token prompt edge case. - Cap Sinkhorn iterations at 10 (from 20) for ~12% speed improvement with negligible quality difference.

Custom Metal kernels for MoE decode (fused_moe_kernel.py): - Fused gate+up+SwiGLU: all experts in one dispatch (1.8x MoE speedup) - Fused down projection: all experts in one dispatch - Fused grouped output projection: 8 groups in one dispatch - 4-bit inline dequantization matching MLX qmv pattern Model optimizations (deepseek_v4.py): - MoE layer skip on 8/43 layers during decode (quality-validated) - mx.compile for HC pre/post and MoE modules - Inverse RoPE simplified to rope(-offset) for decode - Fused Q+KV projection via weight concatenation - Step-based buffer growth for compressed KV Bugfixes from Opus audit: - Fix inverse RoPE for prefill (L>1) with per-position offsets - Fix SparseKVCache state serialization alignment - Fix continuation prefill window buffer (incremental update) - Fix SparseKVCache.trim to invalidate sparse state - Fix output dtype mismatch (float32 -> input dtype) - Add K%512 and N%8 assertions for Metal kernels - Fix id()-based cache leak in fused_grouped_wo - Add scale handling in _apply_rope_at_positions - Remove dead code (fused_qkv_proj) SwitchGLU decode path (switch_layers.py): - Sequential per-expert processing with fused Metal kernels - Automatic fallback for non-4-bit quantization Result: 6.5 -> 21 tok/s through server (3.2x), 161GB peak memory.

- Add test_deepseek_v4.py (28 tests: model creation, prefill/decode, continuation, multi-turn, cache serialization, compressor, fused kernels) - Add --turbo-kv-bits and --turbo-fp16-layers to server CLI - Add MLA/SSM guard in make_prompt_cache (ValueError instead of garbage) - Add SparseKVCache to prompt cache save/load allowlist - Fix stale sparse state across conversations (reset on offset==0) - Gate MoE layer skip on num_hidden_layers==43 - Assert matching quant params in fused QKV projection

- Accept upstream server.py refactor, re-apply turbo args on top - Fallback for mx.new_thread_local_stream (older MLX versions)

- FP8 e4m3 block dequant from original HF checkpoints (auto-detect format) - FP4 packed expert dequant with ue8m0 block scaling - MTP weight filtering and HF key remapping in sanitize() - 4-bit affine value compression (--turbo-v-bits) for 2x memory savings - BatchSparseKVCache with merge/unmerge for server batch mode - Per-entry sparse decode for compressor state machine compatibility

arozanov · 2026-05-08T12:36:56Z

Merged upstream, added tests, fixed the reported issues.

Tests: 69 total (test_deepseek_v4.py + test_turboquant.py), all pass.

New stuff:

FP8 e4m3 dequant from original HF checkpoints (auto-detects, mlx-community weights still work)
--turbo-v-bits for 4-bit value compression
BatchSparseKVCache - server works without --no-batch
Compat with older MLX (tested on 0.29.3)

Verified 22.3 tok/s on 4-bit V4 Flash, M3 Ultra 512GB.

@babhishek21 - all DX points done (to_turbo_quantized, make_prompt_cache, CLI args in generate + server)

@kipanshi - make_prompt_cache raises ValueError for MLA and SSM models now

@deceptech-packet-ninja - value compression is in as --turbo-v-bits

arozanov mentioned this pull request Mar 28, 2026

Add TurboQuant KV cache compression with native Metal SDPA kernel ml-explore/mlx#3328

Open

4 tasks

Add architecture compatibility check for TurboQuant

530e6a5

Rework TurboQuant: to_turbo_quantized(), make_prompt_cache routing, C…

de54031

…LI args

arozanov mentioned this pull request Mar 29, 2026

Add TurboQuant KV cache compression for prefix cache (4.6x) waybarrios/vllm-mlx#233

Open

9 tasks

Add TurboQuant tests and fix save/load support

9315fbc

arozanov force-pushed the feature/turboquant-kv-cache branch from a087778 to 9315fbc Compare March 29, 2026 16:04

Add public dequantize() and copy() methods to TurboQuantKVCache

fceb638

the-wondersmith mentioned this pull request Apr 2, 2026

feat: add TurboQuant support lmstudio-ai/mlx-engine#300

Closed

ProducerGuy mentioned this pull request Apr 4, 2026

Mistral Small 4: Absorbed MLA + INT4 quantized latent cache #1037

Open

TimPietruskyRunPod mentioned this pull request Apr 9, 2026

Investigate TurboQuant for KV cache compression (llama.cpp + MLX) runpod-labs/a2go#115

Open

6 tasks

arozanov added 3 commits April 29, 2026 17:45

arozanov force-pushed the feature/turboquant-kv-cache branch from 10f2bfc to fda593e Compare April 30, 2026 00:03

YenHub mentioned this pull request Apr 30, 2026

Feature Request: Support for TurboQuant+ KV Cache Compression (llama.cpp) lmstudio-ai/lmstudio-bug-tracker#1719

Open

arozanov added 8 commits May 4, 2026 23:10

Skip disk cache save when system memory is low

449059f

Serialization materializes all KV cache arrays, temporarily doubling memory usage. For large models (400GB+) on machines with limited headroom this triggers OOM. Check available memory before saving and skip if less than 8GB free.

Fix continuation prefill compressor crash

c672a36

Compressor decode mode expects single-token input (L=1) but continuation prefill chunks pass L>1. Process chunk tokens one-by-one through the compressor to maintain correct state.

Add DeepSeek V4 section to README

15d9828

arozanov changed the title ~~Add TurboQuant KV cache compression (3-bit, 4.6x)~~ Add TurboQuant KV cache + DeepSeek V4 support May 8, 2026

arozanov added 4 commits May 8, 2026 11:29

Merge upstream/main and resolve conflicts

bb75fd6

- Accept upstream server.py refactor, re-apply turbo args on top - Fallback for mx.new_thread_local_stream (older MLX versions)

Fallback for older MLX without device_info()

cd38d1a

arozanov added 10 commits May 8, 2026 15:15

Support Thump604 weight naming in sanitize()

1e6dad5

Fix switch_mlp remap doubling ffn prefix

d4af330

Support single wo_a linear for Thump604 quantized weights

110f2c7

Fix wo_a for Thump604: single linear with per-group row slicing

d00a7f6

Remap quantization config keys for Thump604 mixed-bit models

700d2c7

Remap quantization config keys for renamed weight paths

9c24eed

Add mimo_v2 -> mimo_v2_flash model type remapping

d9839d1

Add MoE expert offloading with LRU cache (--max-resident-experts)

ee35b61

Fix prefill offloading: ensure_resident per token, not entire prompt

79cc0b7

Extend fused Metal kernels to 8-bit quantized weights

860a4af

Conversation

arozanov commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

TurboQuant KV Cache

DeepSeek V4 Flash (284B MoE)

Results

Quick Start

Other fixes

Test plan

Uh oh!

kipanshi commented Mar 28, 2026

Uh oh!

arozanov commented Mar 28, 2026

Uh oh!

kipanshi commented Mar 28, 2026

Uh oh!

arozanov commented Mar 28, 2026

Uh oh!

arozanov commented Mar 28, 2026

Uh oh!

kipanshi commented Mar 28, 2026

Uh oh!

arozanov commented Mar 28, 2026

Uh oh!

kipanshi commented Mar 28, 2026

Uh oh!

arozanov commented Mar 28, 2026

Uh oh!

babhishek21 commented Mar 28, 2026

Uh oh!

arozanov commented Mar 28, 2026

Uh oh!

babhishek21 commented Mar 29, 2026

Uh oh!

QROST commented Apr 1, 2026

Uh oh!

deceptech-packet-ninja commented Apr 2, 2026

Findings from independent testing + potential contributions

1. Bug in MLX core PR #3328 (turboquant_sdpa kernel)

2. Value compression (4-bit alongside 3-bit keys)

3. 200K context proof (32GB Mac)

4. Quality on real prompts

5. Separate infrastructure PRs

Uh oh!

arozanov commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

arozanov commented Mar 28, 2026 •

edited

Loading

1. Bug in MLX core PR #3328 (`turboquant_sdpa` kernel)