fix(metal): defer command-buffer error reporting + poison-on-failure#3519
Closed
andreinknv wants to merge 1 commit into
Closed
fix(metal): defer command-buffer error reporting + poison-on-failure#3519andreinknv wants to merge 1 commit into
andreinknv wants to merge 1 commit into
Conversation
C++ exceptions thrown from inside Metal completion handlers (the three addCompletedHandler callbacks in eval.cpp) hit std::terminate -> abort() because the handlers run on Metal-managed dispatch threads where C++ exceptions cannot be caught. Production users have hit this in ml-explore#3317 (M2 Ultra, Qwen3.5-122B, ~3.5 h sustained inference) and again locally (M4 Max, Granite-1b 4-pool, mlx-lm 0.31.3 prompt-cache buildup, kIOGPUCommandBufferCallbackErrorOutOfMemory). Prior PR ml-explore#3318 proposed deferring the error to the next user-thread eval(). It was rejected because mlx core is not exception-safe — re- throwing later could leave the encoder in a stale state and "fail later in a much weirder way" (zcbenz). awni reiterated in ml-explore#2670 that mlx wouldn't add the feature without state guarantees. This PR addresses that concern by combining the deferred-throw with explicit STREAM POISONING: - The async completion handler captures the error message into a per-StreamThread slot and SETS poisoned=true. It never throws. - The next user-thread eval()/finalize()/synchronize() entry on that stream calls throw_if_captured(): re-throws the original error on first call, then refuses all subsequent work with a clear "stream is poisoned, mx.clear_streams() to reset" message. - mx.clear_streams() now also calls scheduler::reset_all_errors() so the user has a documented path back to a working state. This guarantees no further operations execute against a stream that just had a Metal failure. The state-safety concern is preserved because once a stream is in error, NOTHING runs on it until the caller explicitly resets — there's no opportunity to operate on half-initialized encoder state. Tests: - 709 mlx tests pass, 26 skipped (the suite already had skips), zero new failures - Custom stress: 8 threads × 1000 iter × (512x512) matmul on per- thread streams completes in 0.7 s with no errors (post-patch), same throughput as baseline build - Smoke: addition / matmul / pre-flight alloc check / post-error recovery all behave correctly Per-stream rather than global: concurrent streams (eg. mlx-lm's 4-server pool, batched inference) won't cross-pollute errors. A crash on stream 2 only poisons stream 2, not 0/1/3. Closes ml-explore#3317. References ml-explore#2670, ml-explore#3318.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
C++ exceptions thrown from inside Metal command-buffer completion handlers (the three
addCompletedHandlercallbacks inmlx/backend/metal/eval.cpp) hitstd::terminate→abort()because those callbacks run on Metal-managed dispatch threads where C++ exceptions cannot be caught. Production users have hit this in #3317 (M2 Ultra, Qwen3.5-122B, ~3.5h sustained inference) and again recently on M4 Max runningmlx-lm 0.31.34-server pool with Granite-1b —kIOGPUCommandBufferCallbackErrorOutOfMemoryfrom prompt-cache buildup → SIGABRT, killing one pool member mid-run.Crash signature (matches both reports):
Background — why prior PR #3318 was rejected
#3318 proposed deferring the error to the next user-thread
eval(). It got production validation (5 days unattended on Qwen3.5-122B by the author, overnight stress on M3 Ultra by hnshah) but was rejected by maintainers:The author of #3318 closed it themselves later because they "narrowed active upstream work to vllm-mlx and can't maintain it." The bug remains open on
main.How this PR is different
This PR addresses the state-safety objection head-on by combining the deferred-throw with explicit stream poisoning:
StreamThreadslot and setspoisoned = true.eval()/finalize()/synchronize()entry on that stream callsthrow_if_captured():std::runtime_erroron a thread the language runtime can handle.mx.clear_streams()now also callsscheduler::reset_all_errors()so the user has a documented path back to a working state.This guarantees no further operations execute against a poisoned stream. Once an async error is captured, NOTHING runs on that stream until the caller explicitly resets — there's no opportunity to operate on partially-initialized encoder state, which was the maintainer's central concern.
Per-stream rather than global: concurrent streams (e.g. mlx-lm's 4-server pool, batched inference) won't cross-pollute errors. A failure on stream 2 only poisons stream 2, not 0/1/3.
Changes
mlx/scheduler.h: addcaptured_error+poisonedslots andcapture_error/take_error/reset_errormethods toStreamThread. ExposeScheduler::capture_error/take_error/reset_error/reset_all_errors(withthreads_mtx_shared lock). Add free-function aliases.mlx/backend/metal/eval.cpp: replacecheck_error(still defined for the synchronousCommandEncoder::synchronizepath) withcapture_async_errorfor the threeaddCompletedHandlercallsites. Addthrow_if_capturedateval()/finalize()/synchronize()entry.clear_streamsnow resets all per-stream error slots.Total diff: +148 / −6 LOC across 2 files.
Test results
python -m unittest discoverinpython/tests/)mx.clear_streams()continues to work, ops resume after resetThe deferred error mechanism cannot be unit-tested without injecting a real Metal command-buffer failure (the pre-flight
metal::mallocsize check intercepts most synthetic OOM repros). The crash dumps in #3317 and the local Granite-1b reproduction validate the underlying fix correctness.Closes / references
cc @zcbenz @awni @davidkoski — would appreciate your review on whether stream poisoning addresses the state-safety concerns from your earlier comments.