Skip to content

fix: skip KV cache quantization in single-node BatchGenerator mode#1990

Open
adurham wants to merge 1 commit into
exo-explore:mainfrom
adurham:fix-single-node-kv-quant
Open

fix: skip KV cache quantization in single-node BatchGenerator mode#1990
adurham wants to merge 1 commit into
exo-explore:mainfrom
adurham:fix-single-node-kv-quant

Conversation

@adurham
Copy link
Copy Markdown
Contributor

@adurham adurham commented Apr 26, 2026

Single-node inference crashed with:
ValueError: <class 'mlx_lm.models.cache.QuantizedKVCache'> does not yet support batching with history

mlx-lm's BatchGenerator calls _merge_caches on every step — even when there's only one prompt in flight — and that helper requires every layer's cache to implement .merge(). QuantizedKVCache has no merge implementation, so any single-node inference with EXO_KV_CACHE_BITS set crashes on the first real request.

Fix: only build QuantizedKVCache when the model is actually running in PP mode (detected by PipelineFirstLayer/PipelineLastLayer wrappers). Single-node falls back to vanilla KVCache, and logs that EXO_KV_CACHE_BITS is being ignored so the operator can see what's happening. Fixes #1875.

Single-node inference crashed with:

  ValueError: <class 'mlx_lm.models.cache.QuantizedKVCache'> does not
  yet support batching with history

mlx-lm's BatchGenerator calls _merge_caches on every step — even when
there's only one prompt in flight — and that helper requires every
layer's cache to implement .merge(). QuantizedKVCache has no merge
implementation, so any single-node inference with EXO_KV_CACHE_BITS set
crashes on the first real request.

The PP mode stayed working because it runs in pipeline-parallel mode which
uses a different inference path that doesn't go through _merge_caches.

Fix: only build QuantizedKVCache when the model is actually running in
PP mode (detected by PipelineFirstLayer/PipelineLastLayer wrappers).
Single-node falls back to vanilla KVCache, and logs that EXO_KV_CACHE_BITS
is being ignored so the operator can see what's happening.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
adurham pushed a commit to adurham/exo that referenced this pull request Apr 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BatchGenerator doesn't support KV cache quantization (kv_bits)

1 participant