Summary
Following the documented iOS static-shape decode export (export/ios.py static buckets + KVCacheHandler
fixed-capacity KV state + CoreAIStaticShapeEngine, with the per-step in_step write), the exported decode
core converts fine but SIGTRAPs / SIGSEGVs at the first execute on the WWDC26 betas. The crash is in
the Core AI runtime / MPSGraph backend and reproduces on Mac GPU, iPhone GPU, and iPhone ANE.
I isolated it to one thing: the slice_update that writes the new KV column uses a runtime-tensor begin
index (in_step). The same graph with a shape-symint begin index (the KVCache.update_and_fetch
dynamic path) lowers and runs. Reporting here because this breaks the repo's official on-device LLM recipe;
also filed with Apple Feedback (FB23024751) since the root cause looks like an MPSGraph lowering limit.
Environment
- macOS 27.0 (26A5353q), Apple Silicon (Mac Studio)
- Xcode 27.0 (27A5194q), iOS SDK 27.0; iPhone 17 Pro on iOS 27 beta
coreai-torch 0.4.0, coreai-core 1.0.0b1, coreai-models @ b1cb71b
Steps to reproduce
A single attention block built from coreai_models.primitives (official KVCache write + composite
SDPA), exported three ways that differ in only the KV-write column index. Full runnable script:
https://gist.github.com/john-rocky/1fd6add76b3d5393ebc44fac52ce6b27. The decisive line:
# write one new KV column at decode position p: cache[:, :, :, p:p+1, :] = k_new (slice_update)
# (a) begin index from a SHAPE symint — the update_and_fetch path
p = position_ids.shape[-1] - query_len # symint
# (b) begin index from a RUNTIME TENSOR — the static / in_step path
p = in_step # int32 scalar input
Export each variant, load on the GPU delegate, run one forward at read-length B = 512 (each in its own
process).
Observed
begin index |
shapes |
macOS 27 Mac GPU |
| shape symint |
dynamic |
runs, finite output (exit 0) |
| runtime tensor |
dynamic |
SIGTRAP, exit 133 |
| runtime tensor |
static |
SIGTRAP, exit 133 |
- Mac GPU:
EXC_BREAKPOINT (SIGTRAP, code 5); faulting-thread top frames are all CoreAIRuntime →
_coreai_runtime_os.cpython-311-darwin.so (at execute).
- iPhone GPU: SIGSEGV at the first execute (loads + specializes first).
- iPhone ANE:
MPSGraphExecutable.mm → optimizeOriginalModule → "MLIR pass manager failed" (SIGABRT).
Real artifact (not just the minimal block): the stock Gemma-4 E2B iOS static decode core
(set_static_shape_config, in_step write) SIGTRAPs identically on the Mac GPU.
Expected
slice_update with a runtime-tensor begin lowers and executes on MPSGraph (GPU + ANE), exactly as the
shape-symint form does. As-is, the documented fixed-shape / ANE path is unusable on the beta and only the
slower re-specializing dynamic path (recompiles per sequence length) runs.
Workaround (and it localizes the bug)
Drop the Core AI state + indexed write: keep KV as plain model I/O, append the new column with
torch.cat, and have the host write it back between steps — so there is no in-graph indexed write at
all (only cat + masked SDPA over plain inputs). Numerically identical (8/8 top-1 vs Hugging Face), and
it runs on Mac GPU, iPhone GPU (full model), and iPhone ANE (chunked). That a cat-append works while the
indexed slice_update does not points specifically at the data-indexed slice-update lowering.
Notes
- Decisive pair =
symint-dyn (runs) vs tensor-dyn (crashes): identical module, identical dynamic Dim,
only the begin-index source differs.
- Model-agnostic — every model shares
KVCache.update_and_fetch. Confirmed the official gemma3 and
qwen3 dynamic (symint) cores run + re-specialize.
- Happy to attach the crash
.ips, the full repro script, and the official-model dynamic-runs counterpart.
Summary
Following the documented iOS static-shape decode export (
export/ios.pystatic buckets +KVCacheHandlerfixed-capacity KV state +
CoreAIStaticShapeEngine, with the per-stepin_stepwrite), the exported decodecore converts fine but SIGTRAPs / SIGSEGVs at the first
executeon the WWDC26 betas. The crash is inthe Core AI runtime / MPSGraph backend and reproduces on Mac GPU, iPhone GPU, and iPhone ANE.
I isolated it to one thing: the
slice_updatethat writes the new KV column uses a runtime-tensorbeginindex (
in_step). The same graph with a shape-symintbeginindex (theKVCache.update_and_fetchdynamic path) lowers and runs. Reporting here because this breaks the repo's official on-device LLM recipe;
also filed with Apple Feedback (FB23024751) since the root cause looks like an MPSGraph lowering limit.
Environment
coreai-torch0.4.0,coreai-core1.0.0b1,coreai-models@b1cb71bSteps to reproduce
A single attention block built from
coreai_models.primitives(officialKVCachewrite + compositeSDPA), exported three ways that differ in only the KV-write column index. Full runnable script:https://gist.github.com/john-rocky/1fd6add76b3d5393ebc44fac52ce6b27. The decisive line:Export each variant, load on the GPU delegate, run one forward at read-length
B = 512(each in its ownprocess).
Observed
beginindexEXC_BREAKPOINT(SIGTRAP, code 5); faulting-thread top frames are allCoreAIRuntime→_coreai_runtime_os.cpython-311-darwin.so(at execute).MPSGraphExecutable.mm→optimizeOriginalModule→ "MLIR pass manager failed" (SIGABRT).Real artifact (not just the minimal block): the stock Gemma-4 E2B iOS static decode core
(
set_static_shape_config,in_stepwrite) SIGTRAPs identically on the Mac GPU.Expected
slice_updatewith a runtime-tensorbeginlowers and executes on MPSGraph (GPU + ANE), exactly as theshape-symint form does. As-is, the documented fixed-shape / ANE path is unusable on the beta and only the
slower re-specializing dynamic path (recompiles per sequence length) runs.
Workaround (and it localizes the bug)
Drop the Core AI state + indexed write: keep KV as plain model I/O, append the new column with
torch.cat, and have the host write it back between steps — so there is no in-graph indexed write atall (only
cat+ masked SDPA over plain inputs). Numerically identical (8/8 top-1 vs Hugging Face), andit runs on Mac GPU, iPhone GPU (full model), and iPhone ANE (chunked). That a
cat-append works while theindexed
slice_updatedoes not points specifically at the data-indexed slice-update lowering.Notes
symint-dyn(runs) vstensor-dyn(crashes): identical module, identical dynamicDim,only the begin-index source differs.
KVCache.update_and_fetch. Confirmed the officialgemma3andqwen3dynamic (symint) cores run + re-specialize..ips, the full repro script, and the official-model dynamic-runs counterpart.