fix(metal): defer command-buffer error reporting + poison-on-failure by andreinknv · Pull Request #3519 · ml-explore/mlx

andreinknv · 2026-05-11T03:02:30Z

Summary

C++ exceptions thrown from inside Metal command-buffer completion handlers (the three addCompletedHandler callbacks in mlx/backend/metal/eval.cpp) hit std::terminate → abort() because those callbacks run on Metal-managed dispatch threads where C++ exceptions cannot be caught. Production users have hit this in #3317 (M2 Ultra, Qwen3.5-122B, ~3.5h sustained inference) and again recently on M4 Max running mlx-lm 0.31.3 4-server pool with Granite-1b — kIOGPUCommandBufferCallbackErrorOutOfMemory from prompt-cache buildup → SIGABRT, killing one pool member mid-run.

Crash signature (matches both reports):

libc++abi: terminating due to uncaught exception of type std::runtime_error:
  [METAL] Command buffer execution failed: ...
__cxa_throw → _objc_terminate → abort
mlx::core::gpu::check_error(MTL::CommandBuffer*)
[completion handler block]
-[_MTLCommandBuffer didCompleteWithStartTime:endTime:error:]

Background — why prior PR #3318 was rejected

#3318 proposed deferring the error to the next user-thread eval(). It got production validation (5 days unattended on Qwen3.5-122B by the author, overnight stress on M3 Ultra by hnshah) but was rejected by maintainers:

zcbenz on check_error() throws inside Metal completion handler, hits std::terminate #3317: "We can not simply transfer the error: MLX code is not exception-safe so when the exception is thrown the process would be in a stale state that will likely fail later in a much weirder way."
awni on when it crashed in the background, can not catch the exception #2670: "It's not likely we are going to add this feature in the near future. The reason is that we don't have guarantees on the state being in a reasonable condition if there is an exception during eval."

The author of #3318 closed it themselves later because they "narrowed active upstream work to vllm-mlx and can't maintain it." The bug remains open on main.

How this PR is different

This PR addresses the state-safety objection head-on by combining the deferred-throw with explicit stream poisoning:

The async completion handler never throws. It captures the error message into a per-StreamThread slot and sets poisoned = true.
The next user-thread eval() / finalize() / synchronize() entry on that stream calls throw_if_captured():
- First call after error: re-throws the original std::runtime_error on a thread the language runtime can handle.
- Subsequent calls: refuses with a clear message — "[METAL] Stream is in error state from a prior failure. Call mx.clear_streams() (or destroy this Stream) before queuing more work."
mx.clear_streams() now also calls scheduler::reset_all_errors() so the user has a documented path back to a working state.

This guarantees no further operations execute against a poisoned stream. Once an async error is captured, NOTHING runs on that stream until the caller explicitly resets — there's no opportunity to operate on partially-initialized encoder state, which was the maintainer's central concern.

Per-stream rather than global: concurrent streams (e.g. mlx-lm's 4-server pool, batched inference) won't cross-pollute errors. A failure on stream 2 only poisons stream 2, not 0/1/3.

Changes

mlx/scheduler.h: add captured_error + poisoned slots and capture_error / take_error / reset_error methods to StreamThread. Expose Scheduler::capture_error / take_error / reset_error / reset_all_errors (with threads_mtx_ shared lock). Add free-function aliases.
mlx/backend/metal/eval.cpp: replace check_error (still defined for the synchronous CommandEncoder::synchronize path) with capture_async_error for the three addCompletedHandler callsites. Add throw_if_captured at eval() / finalize() / synchronize() entry. clear_streams now resets all per-stream error slots.

Total diff: +148 / −6 LOC across 2 files.

Test results

✅ 709 mlx tests pass, 26 skipped, zero new failures (python -m unittest discover in python/tests/)
✅ Stress test: 8 threads × 1000 iter × (512×512) matmul on per-thread streams → 0.7 s, no errors
✅ Smoke: addition / matmul / pre-flight allocation check / post-error recovery all behave correctly
✅ Functional: mx.clear_streams() continues to work, ops resume after reset

The deferred error mechanism cannot be unit-tested without injecting a real Metal command-buffer failure (the pre-flight metal::malloc size check intercepts most synthetic OOM repros). The crash dumps in #3317 and the local Granite-1b reproduction validate the underlying fix correctness.

Closes / references

cc @zcbenz @awni @davidkoski — would appreciate your review on whether stream poisoning addresses the state-safety concerns from your earlier comments.

C++ exceptions thrown from inside Metal completion handlers (the three addCompletedHandler callbacks in eval.cpp) hit std::terminate -> abort() because the handlers run on Metal-managed dispatch threads where C++ exceptions cannot be caught. Production users have hit this in ml-explore#3317 (M2 Ultra, Qwen3.5-122B, ~3.5 h sustained inference) and again locally (M4 Max, Granite-1b 4-pool, mlx-lm 0.31.3 prompt-cache buildup, kIOGPUCommandBufferCallbackErrorOutOfMemory). Prior PR ml-explore#3318 proposed deferring the error to the next user-thread eval(). It was rejected because mlx core is not exception-safe — re- throwing later could leave the encoder in a stale state and "fail later in a much weirder way" (zcbenz). awni reiterated in ml-explore#2670 that mlx wouldn't add the feature without state guarantees. This PR addresses that concern by combining the deferred-throw with explicit STREAM POISONING: - The async completion handler captures the error message into a per-StreamThread slot and SETS poisoned=true. It never throws. - The next user-thread eval()/finalize()/synchronize() entry on that stream calls throw_if_captured(): re-throws the original error on first call, then refuses all subsequent work with a clear "stream is poisoned, mx.clear_streams() to reset" message. - mx.clear_streams() now also calls scheduler::reset_all_errors() so the user has a documented path back to a working state. This guarantees no further operations execute against a stream that just had a Metal failure. The state-safety concern is preserved because once a stream is in error, NOTHING runs on it until the caller explicitly resets — there's no opportunity to operate on half-initialized encoder state. Tests: - 709 mlx tests pass, 26 skipped (the suite already had skips), zero new failures - Custom stress: 8 threads × 1000 iter × (512x512) matmul on per- thread streams completes in 0.7 s with no errors (post-patch), same throughput as baseline build - Smoke: addition / matmul / pre-flight alloc check / post-error recovery all behave correctly Per-stream rather than global: concurrent streams (eg. mlx-lm's 4-server pool, batched inference) won't cross-pollute errors. A crash on stream 2 only poisons stream 2, not 0/1/3. Closes ml-explore#3317. References ml-explore#2670, ml-explore#3318.

zcbenz

This PR is much better than previous tryings, but the error handling is over complicated, and the timing of error throwing is bad. I'm closing this in favor of #3523.

andreinknv mentioned this pull request May 11, 2026

fix(server): wire --prompt-cache-bytes CLI flag to LRUPromptCache ml-explore/mlx-lm#1267

Open

zcbenz reviewed May 11, 2026

View reviewed changes

zcbenz closed this May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(metal): defer command-buffer error reporting + poison-on-failure#3519

fix(metal): defer command-buffer error reporting + poison-on-failure#3519
andreinknv wants to merge 1 commit into
ml-explore:mainfrom
andreinknv:fix/metal-completion-handler-no-crash

andreinknv commented May 11, 2026

Uh oh!

zcbenz left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andreinknv commented May 11, 2026

Summary

Background — why prior PR #3318 was rejected

How this PR is different

Changes

Test results

Closes / references

Uh oh!

zcbenz left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants