From 7e40026e1ba53e8956c4fabb3c3d40626ef40623 Mon Sep 17 00:00:00 2001
From: Hyoyeol Kim <dailykim149656@gmail.com>
Date: Fri, 5 Jun 2026 12:34:02 +0900
Subject: [PATCH] Document local M2 ANE benchmark behavior

Record M2 measurements from a clean origin/main worktree so cross-generation benchmark coverage includes the base M2 Mac mini alongside existing Pro/Max/Ultra data.

Constraint: Results are documentation-only and measured on Apple M2 Mac mini 8 GB, macOS 26.5, commit d91c9845c0784dec7753048954fc6d0e8411fe29.

Rejected: Mixing benchmark tool fixes into this contribution | output quirks in inmem_peak and ane_int8_bench are noted but left for a separate tooling PR.

Confidence: medium

Scope-risk: narrow

Directive: Keep dynamic training separate from the static pipeline table unless a comparable static M2 run is submitted.

Tested: python3 -m json.tool benchmarks/community_results.json; git -c core.whitespace=blank-at-eol,blank-at-eof,space-before-tab,cr-at-eol diff --check -- benchmarks/local_m2_results.md benchmarks/community_results.json benchmarks/ANE_BENCHMARK_REPORT.md; clean-worktree benchmark runs logged under /private/tmp/ane_m2_2026-06-05_clean/.

Not-tested: Static train_large pipeline on M2; Qwen3-0.6B dynamic training on 8 GB M2.
---
 benchmarks/ANE_BENCHMARK_REPORT.md |  29 ++++--
 benchmarks/community_results.json  |  20 +++-
 benchmarks/local_m2_results.md     | 157 +++++++++++++++++++++++++++++
 3 files changed, 195 insertions(+), 11 deletions(-)
 create mode 100644 benchmarks/local_m2_results.md

diff --git a/benchmarks/ANE_BENCHMARK_REPORT.md b/benchmarks/ANE_BENCHMARK_REPORT.md
index b7095a0..9a4b123 100644
--- a/benchmarks/ANE_BENCHMARK_REPORT.md
+++ b/benchmarks/ANE_BENCHMARK_REPORT.md
@@ -40,6 +40,8 @@ M5              101-120   9.1-9.8  3.2-3.4s     0.77-0.91    4.9-5.8  @GitBubble
 
 *M3 Ultra = reference platform this project was developed on.
 
+M2 dynamic pipeline data was submitted separately from the static training table: Stories110M dynamic weight pipeline averaged 1554.0 ms/step over 20 steps on an 8 GB M2 Mac mini (compile once: 1.2s).
+
 ## Peak ANE Throughput (inmem_peak, 128x conv 512ch sp64)
 
 ```
@@ -47,6 +49,7 @@ Chip            NE Cores  FP16 TFLOPS (measured)    Rated TOPS (Apple spec*)
 ────────────────────────────────────────────────────────────────────────────
 M1 Pro          16        FAIL                      11    (MIL compat issue)
 M1 Max          16        FAIL                      11    (MIL compat issue)
+M2              16        7.99                      15.8  (Mac mini, median of 3)
 M3 Pro          16        9.98                      15.8
 M3 Ultra        32        -                         31.6  (ref platform)
 M4 Pro          16        12.57                     38
@@ -83,6 +86,7 @@ Peak ANE Throughput (TFLOPS, higher is better)
 
 M1 Pro    FAIL (MIL compat)
 M1 Max    FAIL (MIL compat)
+M2        ████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  7.99
 M3 Pro    ████████████████████░░░░░░░░░░░░░░░░░░░░░░░░  9.98
 M4 Pro    ████████████████████████████████░░░░░░░░░░░░░  12.57
 M4 Max    ██████████████████████░░░░░░░░░░░░░░░░░░░░░░  10.93
@@ -108,6 +112,13 @@ M3 Pro    ███████████████████████
 - ANE compiler handles weight blobs differently from M4+
 - Training at 148-167 ms/step, ~0.6 TFLOPS
 
+### M2
+- In-memory MIL benchmarks compile and run on macOS 26.5 with h14 ANE subtype
+- Peak-style `inmem_peak` median: 7.99 TFLOPS at 128x conv 512ch sp64 (about 50.6% of the 15.8 TFLOPS FP16 reference)
+- `inmem_bench` accepts 256/512/1024/2048/3072/4096 channel configurations tested here
+- INT8 W8A8 is roughly parity to modestly faster than FP16 on tested kernels (median ratios 1.01x-1.11x)
+- Stories110M dynamic weight pipeline works but is IO-dominated on the tested 8 GB Mac mini: 1554.0 ms/step over 20 steps
+
 ### M3 Pro
 - **Only ch=512 compiles** — 52 channel values tested (1-4096), only 512 accepted
 - Fixed 512-wide lane structure in SRAM tiling
@@ -131,13 +142,13 @@ M3 Pro    ███████████████████████
 ### Cross-Generation MIL Compatibility
 
 ```
-Feature                    M1       M3       M4       M5
-─────────────────────────────────────────────────────────
-program(1.3) / ios18       PARTIAL  YES      YES      YES
-Single-blob weights        FAIL     YES      YES      YES
-Per-matrix weight blobs    YES      YES      YES      YES
-Channel flexibility        ?        ch=512   FLEX     FLEX
-BLOBFILE offset refs       FAIL     YES      YES      YES
+Feature                    M1       M2       M3       M4       M5
+──────────────────────────────────────────────────────────────────
+program(1.3) / ios18       PARTIAL  YES      YES      YES      YES
+Single-blob weights        FAIL     YES      YES      YES      YES
+Per-matrix weight blobs    YES      YES      YES      YES      YES
+Channel flexibility        ?        FLEX     ch=512   FLEX     FLEX
+BLOBFILE offset refs       FAIL     YES      YES      YES      YES
 ```
 
 ## macOS Compatibility Issues
@@ -159,5 +170,5 @@ cd training && make train_large
 Include: chip model, macOS version, full output with JSON lines.
 
 ---
-*Report compiled 2026-03-04 from community submissions.*
-*Contributors: @SethBurkart123, @srt54558, @andyg5000, @moriwang, @D-Ogi, @GitBubble, @elijah-pelton*
+*Report compiled 2026-03-04 from community submissions; updated 2026-06-05 with local M2 results.*
+*Contributors: @SethBurkart123, @srt54558, @andyg5000, @moriwang, @D-Ogi, @GitBubble, @elijah-pelton, kimhyoyeol*
diff --git a/benchmarks/community_results.json b/benchmarks/community_results.json
index e975925..8daf0f5 100644
--- a/benchmarks/community_results.json
+++ b/benchmarks/community_results.json
@@ -1,6 +1,6 @@
 {
-  "report_date": "2026-03-04",
-  "source": "https://github.com/maderix/ANE/issues/3",
+  "report_date": "2026-06-05",
+  "source": "https://github.com/maderix/ANE/issues/3 plus benchmarks/local_m2_results.md",
   "model": "Stories110M (12-layer transformer, 109M params)",
   "config": {"dim": 768, "hidden": 2048, "heads": 12, "seq": 256, "vocab": 32000, "layers": 12},
   "training_results": [
@@ -32,6 +32,22 @@
       "notes": "Same MIL compat issue as M1 Pro.",
       "contributor": "andyg5000"
     },
+    {
+      "chip": "M2",
+      "cores": "8-core CPU (4P+4E)",
+      "ram_gb": 8,
+      "macos": "26.5",
+      "pipeline": "dynamic weight (Stories110M)",
+      "dynamic_ms_per_step": [1554.0, 1554.0],
+      "dynamic_compile_ms": 1196,
+      "dynamic_wall_s": 68.0,
+      "peak_tflops_inmem": 7.99,
+      "peak_reference_tflops": 15.8,
+      "int8_w8a8_ratio_range": [1.01, 1.11],
+      "benchmarks_pass": true,
+      "notes": "Measured on Mac mini (Mac14,3), M2, 8GB. inmem benchmarks run sequentially from a clean origin/main worktree. Static pipeline not submitted; Qwen3-0.6B not run due expected memory pressure.",
+      "contributor": "kimhyoyeol"
+    },
     {
       "chip": "M3 Pro",
       "cores": "12-core CPU",
diff --git a/benchmarks/local_m2_results.md b/benchmarks/local_m2_results.md
new file mode 100644
index 0000000..3e53b82
--- /dev/null
+++ b/benchmarks/local_m2_results.md
@@ -0,0 +1,157 @@
+# Local Apple M2 ANE Benchmark Results
+
+Date: 2026-06-05
+
+Host:
+
+- Mac mini (Mac14,3), Apple M2, 8-core CPU (4 performance + 4 efficiency), 8 GB memory
+- macOS 26.5, build 25F71
+- Apple clang 21.0.0 (clang-2100.1.1.101)
+- ANE subtype reported by the private benchmark API: h14
+- Repository commit tested: d91c9845c0784dec7753048954fc6d0e8411fe29 (`origin/main`)
+
+Runtime notes:
+
+- Results were measured from a clean worktree at `/private/tmp/ANE-m2-clean` to avoid local code changes affecting benchmark numbers.
+- Benchmarks were run sequentially on the same machine because parallel ANE workloads contend for the accelerator and produce outliers.
+- Tables below use the median of three runs where repeated. Raw logs were kept locally under `/private/tmp/ane_m2_2026-06-05_clean/`.
+- Qwen3-0.6B dynamic training was not run on this 8 GB M2 machine; resident fp32 weights, gradients, Adam state, activations, transposed buffers, and IOSurfaces are expected to be memory-heavy.
+- Two upstream output quirks were observed but not fixed in this results-only report: `inmem_peak` prints an invalid `%peak` value, and `ane_int8_bench` labels the h14 run as `M4`. The tables below use the raw TFLOPS/TOPS and host metadata instead.
+
+## In-Memory Baseline
+
+Command:
+
+```bash
+xcrun clang -O2 -fobjc-arc -framework Foundation -framework IOSurface -ldl \
+  -o inmem_bench inmem_bench.m
+./inmem_bench
+```
+
+Median of three runs:
+
+| Config | Weight MB | ms/eval | TFLOPS |
+|---|---:|---:|---:|
+| 256ch x 64sp | 0.1 | 0.136 | 0.06 |
+| 512ch x 64sp | 0.5 | 0.140 | 0.24 |
+| 1024ch x 64sp | 2.0 | 0.204 | 0.66 |
+| 2048ch x 64sp | 8.0 | 0.358 | 1.50 |
+| 3072ch x 64sp | 18.0 | 0.552 | 2.19 |
+| 4096ch x 64sp | 32.0 | 0.909 | 2.36 |
+
+## Peak-Style Conv Chain
+
+Command:
+
+```bash
+xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML \
+  -framework IOSurface -ldl -o inmem_peak inmem_peak.m
+./inmem_peak
+```
+
+Median of three runs. `% peak` below is computed against the M2 15.8 TFLOPS FP16 reference; the current program output prints an invalid `%peak` column.
+
+| Config | Weight MB | GFLOP | ms/eval | TFLOPS | % peak |
+|---|---:|---:|---:|---:|---:|
+| 32x conv 512ch sp64 | 16.0 | 1.07 | 0.239 | 4.50 | 28.5 |
+| 48x conv 512ch sp64 | 24.0 | 1.61 | 0.280 | 5.74 | 36.3 |
+| 64x conv 512ch sp64 | 32.0 | 2.15 | 0.301 | 7.13 | 45.1 |
+| 96x conv 512ch sp64 | 48.0 | 3.22 | 0.404 | 7.98 | 50.5 |
+| 128x conv 512ch sp64 | 64.0 | 4.29 | 0.537 | 7.99 | 50.6 |
+| 64x conv 256ch sp64 | 8.0 | 0.54 | 0.160 | 3.35 | 21.2 |
+| 128x conv 256ch sp64 | 16.0 | 1.07 | 0.222 | 4.84 | 30.6 |
+| 256x conv 256ch sp64 | 32.0 | 2.15 | 0.340 | 6.32 | 40.0 |
+| 64x conv 384ch sp64 | 18.0 | 1.21 | 0.245 | 4.94 | 31.3 |
+| 128x conv 384ch sp64 | 36.0 | 2.42 | 0.345 | 7.00 | 44.3 |
+
+## INT8 W8A8
+
+Command:
+
+```bash
+xcrun clang -O2 -fobjc-arc -framework Foundation -framework IOSurface -ldl \
+  -o ane_int8_bench ane_int8_bench.m
+./ane_int8_bench
+```
+
+Median of three runs:
+
+| Config | Precision | Weight MB | GOP | ms/eval | TOPS | Ratio |
+|---|---|---:|---:|---:|---:|---:|
+| 128x conv 512ch 64x64 | FP16 | 64.0 | 274.88 | 22.847 | 12.03 | - |
+| 128x conv 512ch 64x64 | W8A8 | 32.0 | 274.88 | 23.046 | 11.93 | 1.01x |
+| 64x conv 512ch 64x64 | FP16 | 32.0 | 137.44 | 12.442 | 11.05 | - |
+| 64x conv 512ch 64x64 | W8A8 | 16.0 | 137.44 | 11.861 | 11.59 | 1.06x |
+| 256x conv 256ch 64x64 | FP16 | 32.0 | 137.44 | 12.984 | 10.59 | - |
+| 256x conv 256ch 64x64 | W8A8 | 16.0 | 137.44 | 13.272 | 10.36 | 1.05x |
+| 128x conv 256ch 64x64 | FP16 | 16.0 | 68.72 | 6.801 | 10.10 | - |
+| 128x conv 256ch 64x64 | W8A8 | 8.0 | 68.72 | 6.348 | 10.83 | 1.11x |
+| 128x conv 384ch 64x64 | FP16 | 36.0 | 154.62 | 14.220 | 10.87 | - |
+| 128x conv 384ch 64x64 | W8A8 | 18.0 | 154.62 | 13.770 | 11.23 | 1.08x |
+
+On this M2, W8A8 is approximately parity to a modest improvement for these kernels, not the larger M4 speedup reported upstream.
+
+## Dynamic Matmul
+
+Command:
+
+```bash
+cd training
+xcrun clang -O2 -Wall -DACCELERATE_NEW_LAPACK -fobjc-arc \
+  -o test_dynamic_matmul test_dynamic_matmul.m \
+  -framework Foundation -framework CoreML -framework IOSurface -ldl -framework Accelerate
+./test_dynamic_matmul
+```
+
+Result:
+
+- 64x64 identity correctness: PASS, max error 0.001938
+- 64x64 scale-by-2 correctness: PASS, ratio 2.000
+- 768x768x256 single dynamic matmul: 1.012 ms/eval, 298.4 GFLOP/s
+- With weight IO: 0.871 ms/eval, 346.8 GFLOP/s
+- vs `cblas_sgemm`: PASS, max error 0.014646
+
+Tiled 768x768 matmul:
+
+| tile_oc | tiles | compile ms | eval ms | GFLOP/s |
+|---:|---:|---:|---:|---:|
+| 64 | 12 | 543 | 4.318 | 69.9 |
+| 128 | 6 | 260 | 1.752 | 172.4 |
+| 256 | 3 | 110 | 1.041 | 290.1 |
+| 384 | 2 | 69 | 0.871 | 346.8 |
+| 768 | 1 | 47 | 0.652 | 463.0 |
+
+## Dynamic Training
+
+Data:
+
+```bash
+cd training
+bash download_data.sh
+```
+
+The local run used `tinystories_data00.bin`: 20,658,981 tokens, 41.3 MB.
+
+Build and run:
+
+```bash
+cd training/training_dynamic
+make MODEL=stories110m
+./train --scratch --steps 20 --accum 10 --warmup 2 --data ../tinystories_data00.bin
+```
+
+Result:
+
+- Model: Stories110M, 109.5M parameters
+- Active compact vocab: 9,205 tokens from 32,000
+- One-time compile: 1,196 ms for 10 kernels
+- Train time: 31,081 ms total
+- Average train: 1,554.0 ms/step
+- Wall time: 68.0 s
+- Step 0 loss: 9.1105
+- Step 10 loss: 8.6389
+
+Notes:
+
+- This is the dynamic weight pipeline, not the static pipeline used by the main cross-generation training table.
+- The measured dynamic training run was IO-dominated on this 8 GB M2 machine.