Skip to content

fix(server): wire --prompt-cache-bytes CLI flag to LRUPromptCache#1267

Open
andreinknv wants to merge 1 commit into
ml-explore:mainfrom
andreinknv:fix/wire-prompt-cache-bytes-cli
Open

fix(server): wire --prompt-cache-bytes CLI flag to LRUPromptCache#1267
andreinknv wants to merge 1 commit into
ml-explore:mainfrom
andreinknv:fix/wire-prompt-cache-bytes-cli

Conversation

@andreinknv
Copy link
Copy Markdown

Summary

The --prompt-cache-bytes CLI flag has been declared on mlx_lm.server's argparser since it landed but is never forwarded to the LRUPromptCache constructor — server.py:1743 only passes prompt_cache_size. As a result, even operators who explicitly set --prompt-cache-bytes 1GB get an unbounded cache (LRUPromptCache's default max_bytes = 1 << 63), and a long-running server can accumulate enough KV cache to exhaust GPU memory mid-inference.

Reproduction

A 4-server pool of mlx_lm.server --model mlx-community/granite-4.0-1b-4bit on M4 Max 36 GB. After ~430 classify-style chat-completion requests, one pool member's prompt cache reaches 10 sequences / 0.51 GB and a 3,731-token new prompt trips Metal's command-buffer status. Server log immediately before the crash:

Prompt Cache: 10 sequences, 0.51 GB
POST /v1/chat/completions HTTP/1.1 200 -
libc++abi: terminating due to uncaught exception of type std::runtime_error:
[METAL] Command buffer execution failed: Insufficient Memory
(kIOGPUCommandBufferCallbackErrorOutOfMemory)

The terminate → abort path itself is in mlx core (ml-explore/mlx#3317 / ml-explore/mlx#3519). This PR addresses the trigger: the unbounded prompt cache that the operator cannot actually bound via the documented CLI flag.

Fix

Forward --prompt-cache-bytes through cli_args into the LRUPromptCache constructor. Existing behavior is preserved when the flag is omitted (LRUPromptCache's max_bytes = 1 << 63 default still applies).

cli_args = model_provider.cli_args
cache_kwargs = {"max_size": cli_args.prompt_cache_size}
if getattr(cli_args, "prompt_cache_bytes", None) is not None:
    cache_kwargs["max_bytes"] = cli_args.prompt_cache_bytes
prompt_cache = LRUPromptCache(**cache_kwargs)

Total diff: +13 / −1 LOC, one file.

Tests

  • ✅ 42 tests pass (tests/test_server.py + tests/test_prompt_cache.py)
  • ✅ Functional: LRUPromptCache(max_size=5, max_bytes=_parse_size("100MB")) correctly enforces max_bytes == 100_000_000
  • ✅ No-flag path: max_bytes stays 1 << 63 (back-compat preserved)

Related

The `--prompt-cache-bytes` CLI flag has been declared on
`mlx_lm.server`'s argparser since it landed but was never forwarded
to the LRUPromptCache constructor. As a result, even operators who
explicitly set `--prompt-cache-bytes 1GB` saw an unbounded cache
(default `max_bytes = 1 << 63`), and a long-running server could
accumulate enough KV cache to exhaust GPU memory mid-inference,
producing `kIOGPUCommandBufferCallbackErrorOutOfMemory` from Metal
on the next chat completion.

Reproduction: 4-server pool of `mlx_lm.server --model
mlx-community/granite-4.0-1b-4bit` on M4 Max 36 GB. After ~430
classify-style chat-completion requests, one pool member's prompt
cache reaches 10 sequences / 0.51 GB and a 3,731-token new prompt
trips Metal's command-buffer status. Server log immediately before
the crash:

    Prompt Cache: 10 sequences, 0.51 GB
    POST /v1/chat/completions HTTP/1.1 200 -
    libc++abi: terminating due to uncaught exception of type
    std::runtime_error: [METAL] Command buffer execution failed:
    Insufficient Memory (kIOGPUCommandBufferCallbackErrorOutOfMemory)

(That terminate -> abort path itself is mlx-core
ml-explore/mlx#3317 / #3519 — this PR addresses the *trigger*: an
unbounded prompt cache that the operator cannot actually bound via
the documented CLI flag.)

Fix: forward `--prompt-cache-bytes` through `cli_args` into the
`LRUPromptCache` constructor. When the flag is omitted, behavior is
unchanged (LRUPromptCache's existing `max_bytes = 1 << 63` default
preserves back-compat).

Tests:
  - 42 existing tests pass (`tests/test_server.py` + `tests/test_prompt_cache.py`)
  - Functional check: `LRUPromptCache(max_size=5,
    max_bytes=_parse_size("100MB"))` instantiates with `max_bytes ==
    100_000_000`; no-flag path still gets `max_bytes == 1 << 63`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant