fix: skip KV cache quantization in single-node BatchGenerator mode#1990
Open
adurham wants to merge 1 commit into
Open
fix: skip KV cache quantization in single-node BatchGenerator mode#1990adurham wants to merge 1 commit into
adurham wants to merge 1 commit into
Conversation
Single-node inference crashed with: ValueError: <class 'mlx_lm.models.cache.QuantizedKVCache'> does not yet support batching with history mlx-lm's BatchGenerator calls _merge_caches on every step — even when there's only one prompt in flight — and that helper requires every layer's cache to implement .merge(). QuantizedKVCache has no merge implementation, so any single-node inference with EXO_KV_CACHE_BITS set crashes on the first real request. The PP mode stayed working because it runs in pipeline-parallel mode which uses a different inference path that doesn't go through _merge_caches. Fix: only build QuantizedKVCache when the model is actually running in PP mode (detected by PipelineFirstLayer/PipelineLastLayer wrappers). Single-node falls back to vanilla KVCache, and logs that EXO_KV_CACHE_BITS is being ignored so the operator can see what's happening. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
adurham
pushed a commit
to adurham/exo
that referenced
this pull request
Apr 26, 2026
Reflects the opening of PRs #3455, #3456, exo-explore#1989, exo-explore#1990, and exo-explore#1991.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Single-node inference crashed with:
ValueError: <class 'mlx_lm.models.cache.QuantizedKVCache'> does not yet support batching with historymlx-lm's BatchGenerator calls
_merge_cacheson every step — even when there's only one prompt in flight — and that helper requires every layer's cache to implement.merge().QuantizedKVCachehas no merge implementation, so any single-node inference withEXO_KV_CACHE_BITSset crashes on the first real request.Fix: only build
QuantizedKVCachewhen the model is actually running in PP mode (detected by PipelineFirstLayer/PipelineLastLayer wrappers). Single-node falls back to vanilla KVCache, and logs thatEXO_KV_CACHE_BITSis being ignored so the operator can see what's happening. Fixes #1875.