Official iOS static-shape decode path crashes at runtime on the macOS 27 / iOS 27 beta — MPSGraph can't lower the data-indexed KV-cache slice_update

## Summary

Following the documented iOS static-shape decode export (`export/ios.py` static buckets + `KVCacheHandler`
fixed-capacity KV state + `CoreAIStaticShapeEngine`, with the per-step `in_step` write), the exported decode
core **converts fine but SIGTRAPs / SIGSEGVs at the first `execute`** on the WWDC26 betas. The crash is in
the Core AI runtime / MPSGraph backend and reproduces on **Mac GPU, iPhone GPU, and iPhone ANE**.

I isolated it to one thing: the `slice_update` that writes the new KV column uses a **runtime-tensor `begin`
index** (`in_step`). The *same* graph with a **shape-symint** `begin` index (the `KVCache.update_and_fetch`
dynamic path) lowers and runs. Reporting here because this breaks the repo's official on-device LLM recipe;
also filed with Apple Feedback (**FB23024751**) since the root cause looks like an MPSGraph lowering limit.

## Environment

- macOS 27.0 (26A5353q), Apple Silicon (Mac Studio)
- Xcode 27.0 (27A5194q), iOS SDK 27.0; iPhone 17 Pro on iOS 27 beta
- `coreai-torch` 0.4.0, `coreai-core` 1.0.0b1, `coreai-models` @ `b1cb71b`

## Steps to reproduce

A single attention block built from `coreai_models.primitives` (official `KVCache` write + composite
`SDPA`), exported three ways that differ in **only** the KV-write column index. Full runnable script:
`https://gist.github.com/john-rocky/1fd6add76b3d5393ebc44fac52ce6b27`. The decisive line:

```python
# write one new KV column at decode position p:  cache[:, :, :, p:p+1, :] = k_new   (slice_update)
# (a) begin index from a SHAPE symint   — the update_and_fetch path
p = position_ids.shape[-1] - query_len    # symint
# (b) begin index from a RUNTIME TENSOR  — the static / in_step path
p = in_step                               # int32 scalar input
```

Export each variant, load on the GPU delegate, run one forward at read-length `B = 512` (each in its own
process).

## Observed

| `begin` index | shapes | macOS 27 Mac GPU |
|---|---|---|
| shape symint | dynamic | runs, finite output (exit 0) |
| runtime tensor | dynamic | **SIGTRAP, exit 133** |
| runtime tensor | static | **SIGTRAP, exit 133** |

- **Mac GPU**: `EXC_BREAKPOINT` (SIGTRAP, code 5); faulting-thread top frames are all `CoreAIRuntime` →
  `_coreai_runtime_os.cpython-311-darwin.so` (at execute).
- **iPhone GPU**: SIGSEGV at the first execute (loads + specializes first).
- **iPhone ANE**: `MPSGraphExecutable.mm` → `optimizeOriginalModule` → "MLIR pass manager failed" (SIGABRT).

Real artifact (not just the minimal block): the stock **Gemma-4 E2B** iOS static decode core
(`set_static_shape_config`, `in_step` write) SIGTRAPs identically on the Mac GPU.

## Expected

`slice_update` with a runtime-tensor `begin` lowers and executes on MPSGraph (GPU + ANE), exactly as the
shape-symint form does. As-is, the documented fixed-shape / ANE path is unusable on the beta and only the
slower re-specializing dynamic path (recompiles per sequence length) runs.

## Workaround (and it localizes the bug)

Drop the Core AI state + indexed write: keep KV as plain model **I/O**, append the new column with
`torch.cat`, and have the **host** write it back between steps — so there is no in-graph indexed write at
all (only `cat` + masked SDPA over plain inputs). Numerically identical (**8/8 top-1 vs Hugging Face**), and
it runs on Mac GPU, iPhone GPU (full model), and iPhone ANE (chunked). That a `cat`-append works while the
indexed `slice_update` does not points specifically at the **data-indexed slice-update lowering**.

## Notes

- Decisive pair = `symint-dyn` (runs) vs `tensor-dyn` (crashes): identical module, identical dynamic `Dim`,
  only the begin-index *source* differs.
- Model-agnostic — every model shares `KVCache.update_and_fetch`. Confirmed the official `gemma3` and
  `qwen3` dynamic (symint) cores run + re-specialize.
- Happy to attach the crash `.ips`, the full repro script, and the official-model dynamic-runs counterpart.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Official iOS static-shape decode path crashes at runtime on the macOS 27 / iOS 27 beta — MPSGraph can't lower the data-indexed KV-cache slice_update #5

Summary

Environment

Steps to reproduce

Observed

Expected

Workaround (and it localizes the bug)

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

`begin` index	shapes	macOS 27 Mac GPU
shape symint	dynamic	runs, finite output (exit 0)
runtime tensor	dynamic	SIGTRAP, exit 133
runtime tensor	static	SIGTRAP, exit 133

Official iOS static-shape decode path crashes at runtime on the macOS 27 / iOS 27 beta — MPSGraph can't lower the data-indexed KV-cache slice_update #5

Description

Summary

Environment

Steps to reproduce

Observed

Expected

Workaround (and it localizes the bug)

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions