fix(server): wire --prompt-cache-bytes CLI flag to LRUPromptCache#1267
Open
andreinknv wants to merge 1 commit into
Open
fix(server): wire --prompt-cache-bytes CLI flag to LRUPromptCache#1267andreinknv wants to merge 1 commit into
andreinknv wants to merge 1 commit into
Conversation
The `--prompt-cache-bytes` CLI flag has been declared on
`mlx_lm.server`'s argparser since it landed but was never forwarded
to the LRUPromptCache constructor. As a result, even operators who
explicitly set `--prompt-cache-bytes 1GB` saw an unbounded cache
(default `max_bytes = 1 << 63`), and a long-running server could
accumulate enough KV cache to exhaust GPU memory mid-inference,
producing `kIOGPUCommandBufferCallbackErrorOutOfMemory` from Metal
on the next chat completion.
Reproduction: 4-server pool of `mlx_lm.server --model
mlx-community/granite-4.0-1b-4bit` on M4 Max 36 GB. After ~430
classify-style chat-completion requests, one pool member's prompt
cache reaches 10 sequences / 0.51 GB and a 3,731-token new prompt
trips Metal's command-buffer status. Server log immediately before
the crash:
Prompt Cache: 10 sequences, 0.51 GB
POST /v1/chat/completions HTTP/1.1 200 -
libc++abi: terminating due to uncaught exception of type
std::runtime_error: [METAL] Command buffer execution failed:
Insufficient Memory (kIOGPUCommandBufferCallbackErrorOutOfMemory)
(That terminate -> abort path itself is mlx-core
ml-explore/mlx#3317 / #3519 — this PR addresses the *trigger*: an
unbounded prompt cache that the operator cannot actually bound via
the documented CLI flag.)
Fix: forward `--prompt-cache-bytes` through `cli_args` into the
`LRUPromptCache` constructor. When the flag is omitted, behavior is
unchanged (LRUPromptCache's existing `max_bytes = 1 << 63` default
preserves back-compat).
Tests:
- 42 existing tests pass (`tests/test_server.py` + `tests/test_prompt_cache.py`)
- Functional check: `LRUPromptCache(max_size=5,
max_bytes=_parse_size("100MB"))` instantiates with `max_bytes ==
100_000_000`; no-flag path still gets `max_bytes == 1 << 63`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The
--prompt-cache-bytesCLI flag has been declared onmlx_lm.server's argparser since it landed but is never forwarded to theLRUPromptCacheconstructor —server.py:1743only passesprompt_cache_size. As a result, even operators who explicitly set--prompt-cache-bytes 1GBget an unbounded cache (LRUPromptCache's defaultmax_bytes = 1 << 63), and a long-running server can accumulate enough KV cache to exhaust GPU memory mid-inference.Reproduction
A 4-server pool of
mlx_lm.server --model mlx-community/granite-4.0-1b-4biton M4 Max 36 GB. After ~430 classify-style chat-completion requests, one pool member's prompt cache reaches 10 sequences / 0.51 GB and a 3,731-token new prompt trips Metal's command-buffer status. Server log immediately before the crash:The terminate → abort path itself is in mlx core (ml-explore/mlx#3317 / ml-explore/mlx#3519). This PR addresses the trigger: the unbounded prompt cache that the operator cannot actually bound via the documented CLI flag.
Fix
Forward
--prompt-cache-bytesthroughcli_argsinto theLRUPromptCacheconstructor. Existing behavior is preserved when the flag is omitted (LRUPromptCache'smax_bytes = 1 << 63default still applies).Total diff: +13 / −1 LOC, one file.
Tests
tests/test_server.py+tests/test_prompt_cache.py)LRUPromptCache(max_size=5, max_bytes=_parse_size("100MB"))correctly enforcesmax_bytes == 100_000_000max_bytesstays1 << 63(back-compat preserved)Related