Skip to content

fix(metal): defer command-buffer error reporting + poison-on-failure#3519

Closed
andreinknv wants to merge 1 commit into
ml-explore:mainfrom
andreinknv:fix/metal-completion-handler-no-crash
Closed

fix(metal): defer command-buffer error reporting + poison-on-failure#3519
andreinknv wants to merge 1 commit into
ml-explore:mainfrom
andreinknv:fix/metal-completion-handler-no-crash

Conversation

@andreinknv
Copy link
Copy Markdown

Summary

C++ exceptions thrown from inside Metal command-buffer completion handlers (the three addCompletedHandler callbacks in mlx/backend/metal/eval.cpp) hit std::terminateabort() because those callbacks run on Metal-managed dispatch threads where C++ exceptions cannot be caught. Production users have hit this in #3317 (M2 Ultra, Qwen3.5-122B, ~3.5h sustained inference) and again recently on M4 Max running mlx-lm 0.31.3 4-server pool with Granite-1b — kIOGPUCommandBufferCallbackErrorOutOfMemory from prompt-cache buildup → SIGABRT, killing one pool member mid-run.

Crash signature (matches both reports):

libc++abi: terminating due to uncaught exception of type std::runtime_error:
  [METAL] Command buffer execution failed: ...
__cxa_throw → _objc_terminate → abort
mlx::core::gpu::check_error(MTL::CommandBuffer*)
[completion handler block]
-[_MTLCommandBuffer didCompleteWithStartTime:endTime:error:]

Background — why prior PR #3318 was rejected

#3318 proposed deferring the error to the next user-thread eval(). It got production validation (5 days unattended on Qwen3.5-122B by the author, overnight stress on M3 Ultra by hnshah) but was rejected by maintainers:

The author of #3318 closed it themselves later because they "narrowed active upstream work to vllm-mlx and can't maintain it." The bug remains open on main.

How this PR is different

This PR addresses the state-safety objection head-on by combining the deferred-throw with explicit stream poisoning:

  1. The async completion handler never throws. It captures the error message into a per-StreamThread slot and sets poisoned = true.
  2. The next user-thread eval() / finalize() / synchronize() entry on that stream calls throw_if_captured():
    • First call after error: re-throws the original std::runtime_error on a thread the language runtime can handle.
    • Subsequent calls: refuses with a clear message — "[METAL] Stream is in error state from a prior failure. Call mx.clear_streams() (or destroy this Stream) before queuing more work."
  3. mx.clear_streams() now also calls scheduler::reset_all_errors() so the user has a documented path back to a working state.

This guarantees no further operations execute against a poisoned stream. Once an async error is captured, NOTHING runs on that stream until the caller explicitly resets — there's no opportunity to operate on partially-initialized encoder state, which was the maintainer's central concern.

Per-stream rather than global: concurrent streams (e.g. mlx-lm's 4-server pool, batched inference) won't cross-pollute errors. A failure on stream 2 only poisons stream 2, not 0/1/3.

Changes

  • mlx/scheduler.h: add captured_error + poisoned slots and capture_error / take_error / reset_error methods to StreamThread. Expose Scheduler::capture_error / take_error / reset_error / reset_all_errors (with threads_mtx_ shared lock). Add free-function aliases.
  • mlx/backend/metal/eval.cpp: replace check_error (still defined for the synchronous CommandEncoder::synchronize path) with capture_async_error for the three addCompletedHandler callsites. Add throw_if_captured at eval() / finalize() / synchronize() entry. clear_streams now resets all per-stream error slots.

Total diff: +148 / −6 LOC across 2 files.

Test results

  • 709 mlx tests pass, 26 skipped, zero new failures (python -m unittest discover in python/tests/)
  • Stress test: 8 threads × 1000 iter × (512×512) matmul on per-thread streams → 0.7 s, no errors
  • Smoke: addition / matmul / pre-flight allocation check / post-error recovery all behave correctly
  • Functional: mx.clear_streams() continues to work, ops resume after reset

The deferred error mechanism cannot be unit-tested without injecting a real Metal command-buffer failure (the pre-flight metal::malloc size check intercepts most synthetic OOM repros). The crash dumps in #3317 and the local Granite-1b reproduction validate the underlying fix correctness.

Closes / references

cc @zcbenz @awni @davidkoski — would appreciate your review on whether stream poisoning addresses the state-safety concerns from your earlier comments.

C++ exceptions thrown from inside Metal completion handlers (the three
addCompletedHandler callbacks in eval.cpp) hit std::terminate ->
abort() because the handlers run on Metal-managed dispatch threads
where C++ exceptions cannot be caught. Production users have hit this
in ml-explore#3317 (M2 Ultra, Qwen3.5-122B, ~3.5 h sustained inference) and
again locally (M4 Max, Granite-1b 4-pool, mlx-lm 0.31.3 prompt-cache
buildup, kIOGPUCommandBufferCallbackErrorOutOfMemory).

Prior PR ml-explore#3318 proposed deferring the error to the next user-thread
eval(). It was rejected because mlx core is not exception-safe — re-
throwing later could leave the encoder in a stale state and "fail
later in a much weirder way" (zcbenz). awni reiterated in ml-explore#2670 that
mlx wouldn't add the feature without state guarantees.

This PR addresses that concern by combining the deferred-throw with
explicit STREAM POISONING:

  - The async completion handler captures the error message into a
    per-StreamThread slot and SETS poisoned=true. It never throws.
  - The next user-thread eval()/finalize()/synchronize() entry on
    that stream calls throw_if_captured(): re-throws the original
    error on first call, then refuses all subsequent work with a
    clear "stream is poisoned, mx.clear_streams() to reset" message.
  - mx.clear_streams() now also calls scheduler::reset_all_errors()
    so the user has a documented path back to a working state.

This guarantees no further operations execute against a stream that
just had a Metal failure. The state-safety concern is preserved
because once a stream is in error, NOTHING runs on it until the
caller explicitly resets — there's no opportunity to operate on
half-initialized encoder state.

Tests:
  - 709 mlx tests pass, 26 skipped (the suite already had skips),
    zero new failures
  - Custom stress: 8 threads × 1000 iter × (512x512) matmul on per-
    thread streams completes in 0.7 s with no errors (post-patch),
    same throughput as baseline build
  - Smoke: addition / matmul / pre-flight alloc check / post-error
    recovery all behave correctly

Per-stream rather than global: concurrent streams (eg. mlx-lm's
4-server pool, batched inference) won't cross-pollute errors. A
crash on stream 2 only poisons stream 2, not 0/1/3.

Closes ml-explore#3317. References ml-explore#2670, ml-explore#3318.
Copy link
Copy Markdown
Collaborator

@zcbenz zcbenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is much better than previous tryings, but the error handling is over complicated, and the timing of error throwing is bad. I'm closing this in favor of #3523.

@zcbenz zcbenz closed this May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

check_error() throws inside Metal completion handler, hits std::terminate

2 participants