fix(server): wire --prompt-cache-bytes CLI flag to LRUPromptCache by andreinknv · Pull Request #1267 · ml-explore/mlx-lm

andreinknv · 2026-05-11T03:06:21Z

Summary

The --prompt-cache-bytes CLI flag has been declared on mlx_lm.server's argparser since it landed but is never forwarded to the LRUPromptCache constructor — server.py:1743 only passes prompt_cache_size. As a result, even operators who explicitly set --prompt-cache-bytes 1GB get an unbounded cache (LRUPromptCache's default max_bytes = 1 << 63), and a long-running server can accumulate enough KV cache to exhaust GPU memory mid-inference.

Reproduction

A 4-server pool of mlx_lm.server --model mlx-community/granite-4.0-1b-4bit on M4 Max 36 GB. After ~430 classify-style chat-completion requests, one pool member's prompt cache reaches 10 sequences / 0.51 GB and a 3,731-token new prompt trips Metal's command-buffer status. Server log immediately before the crash:

Prompt Cache: 10 sequences, 0.51 GB
POST /v1/chat/completions HTTP/1.1 200 -
libc++abi: terminating due to uncaught exception of type std::runtime_error:
[METAL] Command buffer execution failed: Insufficient Memory
(kIOGPUCommandBufferCallbackErrorOutOfMemory)

The terminate → abort path itself is in mlx core (ml-explore/mlx#3317 / ml-explore/mlx#3519). This PR addresses the trigger: the unbounded prompt cache that the operator cannot actually bound via the documented CLI flag.

Fix

Forward --prompt-cache-bytes through cli_args into the LRUPromptCache constructor. Existing behavior is preserved when the flag is omitted (LRUPromptCache's max_bytes = 1 << 63 default still applies).

cli_args = model_provider.cli_args
cache_kwargs = {"max_size": cli_args.prompt_cache_size}
if getattr(cli_args, "prompt_cache_bytes", None) is not None:
    cache_kwargs["max_bytes"] = cli_args.prompt_cache_bytes
prompt_cache = LRUPromptCache(**cache_kwargs)

Total diff: +13 / −1 LOC, one file.

Tests

✅ 42 tests pass (tests/test_server.py + tests/test_prompt_cache.py)
✅ Functional: LRUPromptCache(max_size=5, max_bytes=_parse_size("100MB")) correctly enforces max_bytes == 100_000_000
✅ No-flag path: max_bytes stays 1 << 63 (back-compat preserved)

The `--prompt-cache-bytes` CLI flag has been declared on `mlx_lm.server`'s argparser since it landed but was never forwarded to the LRUPromptCache constructor. As a result, even operators who explicitly set `--prompt-cache-bytes 1GB` saw an unbounded cache (default `max_bytes = 1 << 63`), and a long-running server could accumulate enough KV cache to exhaust GPU memory mid-inference, producing `kIOGPUCommandBufferCallbackErrorOutOfMemory` from Metal on the next chat completion. Reproduction: 4-server pool of `mlx_lm.server --model mlx-community/granite-4.0-1b-4bit` on M4 Max 36 GB. After ~430 classify-style chat-completion requests, one pool member's prompt cache reaches 10 sequences / 0.51 GB and a 3,731-token new prompt trips Metal's command-buffer status. Server log immediately before the crash: Prompt Cache: 10 sequences, 0.51 GB POST /v1/chat/completions HTTP/1.1 200 - libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Insufficient Memory (kIOGPUCommandBufferCallbackErrorOutOfMemory) (That terminate -> abort path itself is mlx-core ml-explore/mlx#3317 / #3519 — this PR addresses the *trigger*: an unbounded prompt cache that the operator cannot actually bound via the documented CLI flag.) Fix: forward `--prompt-cache-bytes` through `cli_args` into the `LRUPromptCache` constructor. When the flag is omitted, behavior is unchanged (LRUPromptCache's existing `max_bytes = 1 << 63` default preserves back-compat). Tests: - 42 existing tests pass (`tests/test_server.py` + `tests/test_prompt_cache.py`) - Functional check: `LRUPromptCache(max_size=5, max_bytes=_parse_size("100MB"))` instantiates with `max_bytes == 100_000_000`; no-flag path still gets `max_bytes == 1 << 63`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(server): wire --prompt-cache-bytes CLI flag to LRUPromptCache#1267

fix(server): wire --prompt-cache-bytes CLI flag to LRUPromptCache#1267
andreinknv wants to merge 1 commit into
ml-explore:mainfrom
andreinknv:fix/wire-prompt-cache-bytes-cli

andreinknv commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andreinknv commented May 11, 2026

Summary

Reproduction

Fix

Tests

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant