fix: skip KV cache quantization in single-node BatchGenerator mode by adurham · Pull Request #1990 · exo-explore/exo

adurham · 2026-04-26T18:19:05Z

Single-node inference crashed with:
ValueError: <class 'mlx_lm.models.cache.QuantizedKVCache'> does not yet support batching with history

mlx-lm's BatchGenerator calls _merge_caches on every step — even when there's only one prompt in flight — and that helper requires every layer's cache to implement .merge(). QuantizedKVCache has no merge implementation, so any single-node inference with EXO_KV_CACHE_BITS set crashes on the first real request.

Fix: only build QuantizedKVCache when the model is actually running in PP mode (detected by PipelineFirstLayer/PipelineLastLayer wrappers). Single-node falls back to vanilla KVCache, and logs that EXO_KV_CACHE_BITS is being ignored so the operator can see what's happening. Fixes #1875.

Single-node inference crashed with: ValueError: <class 'mlx_lm.models.cache.QuantizedKVCache'> does not yet support batching with history mlx-lm's BatchGenerator calls _merge_caches on every step — even when there's only one prompt in flight — and that helper requires every layer's cache to implement .merge(). QuantizedKVCache has no merge implementation, so any single-node inference with EXO_KV_CACHE_BITS set crashes on the first real request. The PP mode stayed working because it runs in pipeline-parallel mode which uses a different inference path that doesn't go through _merge_caches. Fix: only build QuantizedKVCache when the model is actually running in PP mode (detected by PipelineFirstLayer/PipelineLastLayer wrappers). Single-node falls back to vanilla KVCache, and logs that EXO_KV_CACHE_BITS is being ignored so the operator can see what's happening. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Reflects the opening of PRs #3455, #3456, exo-explore#1989, exo-explore#1990, and exo-explore#1991.

adurham pushed a commit to adurham/exo that referenced this pull request Apr 26, 2026

docs: update upstream-prs.md and fork-notes.md with newly opened PRs

0363c4e

Reflects the opening of PRs #3455, #3456, exo-explore#1989, exo-explore#1990, and exo-explore#1991.

team-wcv mentioned this pull request May 7, 2026

Import useful upstream exo PRs team-wcv/exo#16

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: skip KV cache quantization in single-node BatchGenerator mode#1990

fix: skip KV cache quantization in single-node BatchGenerator mode#1990
adurham wants to merge 1 commit into
exo-explore:mainfrom
adurham:fix-single-node-kv-quant

adurham commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adurham commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant