ANE Training — Backpropagation on Apple Neural Engine

Training neural networks directly on Apple's Neural Engine (ANE) via reverse-engineered private APIs. No CoreML training APIs, no Metal, no GPU — pure ANE compute.

What This Is

A from-scratch implementation of transformer training (forward + backward pass) running on the ANE in Apple Silicon. The ANE is a 15.8 TFLOPS (M4) inference accelerator that Apple does not expose for training. This project reverse-engineers the _ANEClient / _ANECompiler private APIs and the MIL (Model Intermediate Language) format to run custom compute graphs — including backpropagation — directly on ANE hardware.

Benchmark Modes (Important)

This repo now has two benchmark regimes:

Mainline (main): 12-layer Stories110M training path in training/train_large.m (details in training/README.md)
Initial release (f213c8d): single-layer benchmark path used for the original 9.3 ms/step claim

Use ./scripts/benchmark_compare.sh to run both modes on the same machine and print a side-by-side table.

Representative results on Apple M4 (same machine, 2026-03-02):

Regime	Ref	Workload	ms/step	ANE TFLOPS	ANE util
Mainline Stories110M	`main`	12 layers, seq=256	107.6	0.86	5.5%
Initial single-layer	`f213c8d`	1 layer, seq=512	9.5	1.73	11.0%

Architecture

The initial single-layer training loop uses 6 ANE kernels per step:

Kernel	Function	Weights
`kFwdAttn`	RMSNorm + QKV projection + SDPA + output projection	Wq, Wk, Wv, Wo, rms1, mask
`kFwdFFN`	RMSNorm + SwiGLU FFN (W1, W3, SiLU, W2)	W1, W2, W3, rms2
`kFFNBwd`	FFN backward (W2^T + SiLU_bwd + W1^T + W3^T)	W2^T, W1^T, W3^T
`kSdpaBwd1`	Wo^T + SDPA backward part 1 (dV, probs, dp)	Wo^T, mask
`kSdpaBwd2`	SDPA backward part 2 (softmax grad, dQ, dK)	—
`kQKVb`	QKV backward (Wq^T + Wk^T + Wv^T → dx)	Wq^T, Wk^T, Wv^T

CPU handles: RMSNorm backward, residual connections, loss computation, dW gradient accumulation (cblas_sgemm), Adam optimizer updates.

Key optimizations:

Channel-first CPU layout — matches ANE IOSurface [1,C,1,S] format, eliminates all transpose overhead
vDSP vectorized RMSNorm — 10x faster than naive (6.7ms → 0.7ms)
GCD async cblas overlap — dW gradient sgemms run in parallel with ANE evals on a serial dispatch queue
Deferred cblas wait — wait pushed into next step's forward pass for maximum overlap
ANE RMSNorm fusion — RMSNorm folded into forward kernels as MIL ops (reduce_sum + pow + mul)
Wo^T fusion — output projection backward merged into SDPA backward kernel
Forward taps — Q, K, V, attention scores, hidden states exposed via concat outputs, avoiding CPU recompute
exec() restart — bypasses ~119 ANE compile limit per process

File Structure

├── api_exploration.m       # Initial ANE API discovery
├── inmem_basic.m           # In-memory MIL compilation proof-of-concept
├── inmem_bench.m           # ANE dispatch latency benchmarks
├── inmem_peak.m            # Peak TFLOPS measurement (2048x2048 matmul)
├── sram_bench.m            # ANE SRAM bandwidth probing
├── sram_probe.m            # SRAM size/layout exploration
└── training/
    ├── ane_runtime.h       # ANE private API wrapper (compile, eval, IOSurface)
    ├── ane_mil_gen.h       # MIL program generation helpers
    ├── model.h             # Model weight initialization and blob builders
    ├── forward.h           # Forward pass MIL generators
    ├── backward.h          # Backward pass MIL generators
    ├── train.m             # Minimal training loop (early prototype)
    ├── tiny_train.m        # 2-layer tiny model training
    ├── train_large.m       # Mainline: 12-layer Stories110M training
    ├── test_*.m            # Unit tests for individual kernels
    └── Makefile
└── scripts/
    └── benchmark_compare.sh # Side-by-side mainline vs initial-release benchmark

Building

Requires macOS 15+ on Apple Silicon (tested on M4).

# Build the main training program
xcrun clang -O2 -framework Foundation -framework IOSurface \
  -framework CoreML -framework Accelerate -ldl -lobjc \
  -o train_large training/train_large.m

# Run
./train_large

# Compare benchmark modes (mainline vs initial-release)
./scripts/benchmark_compare.sh --main-steps 10

No external dependencies. Uses only system frameworks + private ANE APIs resolved at runtime via objc_msgSend.

How It Works

MIL generation — Objective-C code constructs MIL program text at runtime, specifying convolutions (for linear layers), matmul (for attention), softmax, element-wise ops
In-memory compilation — _ANEInMemoryModelDescriptor compiles MIL text + weight blobs directly to ANE programs, no disk mlmodelc needed
IOSurface I/O — Input/output tensors passed via IOSurface shared memory in [1, channels, 1, spatial] format (fp16)
Weight embedding — Weights baked into ANE programs as BLOBFILE constants; recompiled each batch when weights change
Gradient flow — Forward taps expose intermediates needed for backward; backward kernels compute dx (input gradients) on ANE; dW (weight gradients) computed on CPU via cblas

Limitations

SDPA causal masking — ANE hardware ignores attn_mask in SDPA ops; causal attention is decomposed into separate Q@K^T (ANE) → mask+softmax (ANE via add+softmax) → scores@V (ANE)
~119 compile limit — ANE compiler leaks resources; worked around via exec() restart with checkpoint
Two regimes, different goals — main targets 12-layer Stories110M; f213c8d preserves a single-layer microbenchmark path for kernel-level optimization tracking
Synthetic data — Currently uses random data for benchmarking; real tokenized data support is WIP

Single-Layer Performance History (Initial Release)

Optimization	ms/step	ANE util
Baseline (vDSP transpose)	33.5	3.1%
Channel-first layout	20.3	5.2%
vDSP vectorized RMSNorm	14.2	7.4%
GCD async cblas overlap	11.4	9.2%
ANE RMSNorm fusion	11.4	9.2%
Wo^T fusion (7→6 kernels)	11.4	9.2%
Deferred cblas wait	9.3	11.2%

Mainline Stories110M Performance

Config	ms/step	ANE util
12-layer, seq=256 (`main`)	107.6	5.5%

Disclaimer

This project is independent research into Apple Neural Engine architecture. It uses undocumented APIs discovered through runtime introspection for research and educational purposes under fair use and interoperability provisions (see Sega v. Accolade, 1992; DMCA §1201(f)). No Apple proprietary code or binaries are included in this repository. This project is not affiliated with or endorsed by Apple Inc. Use at your own risk.

License

MIT — see LICENSE

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ANE Training — Backpropagation on Apple Neural Engine

What This Is

Benchmark Modes (Important)

Architecture

File Structure

Building

How It Works

Limitations

Single-Layer Performance History (Initial Release)

Mainline Stories110M Performance

Disclaimer

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
scripts		scripts
training		training
LICENSE		LICENSE
README.md		README.md
api_exploration.m		api_exploration.m
inmem_basic		inmem_basic
inmem_basic.m		inmem_basic.m
inmem_bench		inmem_bench
inmem_bench.m		inmem_bench.m
inmem_peak		inmem_peak
inmem_peak.m		inmem_peak.m
sram_bench.m		sram_bench.m
sram_probe.m		sram_probe.m
work.md		work.md

Folders and files

Latest commit

History

Repository files navigation

ANE Training — Backpropagation on Apple Neural Engine

What This Is

Benchmark Modes (Important)

Architecture

File Structure

Building

How It Works

Limitations

Single-Layer Performance History (Initial Release)

Mainline Stories110M Performance

Disclaimer

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages