From 7e40026e1ba53e8956c4fabb3c3d40626ef40623 Mon Sep 17 00:00:00 2001 From: Hyoyeol Kim Date: Fri, 5 Jun 2026 12:34:02 +0900 Subject: [PATCH] Document local M2 ANE benchmark behavior Record M2 measurements from a clean origin/main worktree so cross-generation benchmark coverage includes the base M2 Mac mini alongside existing Pro/Max/Ultra data. Constraint: Results are documentation-only and measured on Apple M2 Mac mini 8 GB, macOS 26.5, commit d91c9845c0784dec7753048954fc6d0e8411fe29. Rejected: Mixing benchmark tool fixes into this contribution | output quirks in inmem_peak and ane_int8_bench are noted but left for a separate tooling PR. Confidence: medium Scope-risk: narrow Directive: Keep dynamic training separate from the static pipeline table unless a comparable static M2 run is submitted. Tested: python3 -m json.tool benchmarks/community_results.json; git -c core.whitespace=blank-at-eol,blank-at-eof,space-before-tab,cr-at-eol diff --check -- benchmarks/local_m2_results.md benchmarks/community_results.json benchmarks/ANE_BENCHMARK_REPORT.md; clean-worktree benchmark runs logged under /private/tmp/ane_m2_2026-06-05_clean/. Not-tested: Static train_large pipeline on M2; Qwen3-0.6B dynamic training on 8 GB M2. --- benchmarks/ANE_BENCHMARK_REPORT.md | 29 ++++-- benchmarks/community_results.json | 20 +++- benchmarks/local_m2_results.md | 157 +++++++++++++++++++++++++++++ 3 files changed, 195 insertions(+), 11 deletions(-) create mode 100644 benchmarks/local_m2_results.md diff --git a/benchmarks/ANE_BENCHMARK_REPORT.md b/benchmarks/ANE_BENCHMARK_REPORT.md index b7095a0..9a4b123 100644 --- a/benchmarks/ANE_BENCHMARK_REPORT.md +++ b/benchmarks/ANE_BENCHMARK_REPORT.md @@ -40,6 +40,8 @@ M5 101-120 9.1-9.8 3.2-3.4s 0.77-0.91 4.9-5.8 @GitBubble *M3 Ultra = reference platform this project was developed on. +M2 dynamic pipeline data was submitted separately from the static training table: Stories110M dynamic weight pipeline averaged 1554.0 ms/step over 20 steps on an 8 GB M2 Mac mini (compile once: 1.2s). + ## Peak ANE Throughput (inmem_peak, 128x conv 512ch sp64) ``` @@ -47,6 +49,7 @@ Chip NE Cores FP16 TFLOPS (measured) Rated TOPS (Apple spec*) ──────────────────────────────────────────────────────────────────────────── M1 Pro 16 FAIL 11 (MIL compat issue) M1 Max 16 FAIL 11 (MIL compat issue) +M2 16 7.99 15.8 (Mac mini, median of 3) M3 Pro 16 9.98 15.8 M3 Ultra 32 - 31.6 (ref platform) M4 Pro 16 12.57 38 @@ -83,6 +86,7 @@ Peak ANE Throughput (TFLOPS, higher is better) M1 Pro FAIL (MIL compat) M1 Max FAIL (MIL compat) +M2 ████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 7.99 M3 Pro ████████████████████░░░░░░░░░░░░░░░░░░░░░░░░ 9.98 M4 Pro ████████████████████████████████░░░░░░░░░░░░░ 12.57 M4 Max ██████████████████████░░░░░░░░░░░░░░░░░░░░░░ 10.93 @@ -108,6 +112,13 @@ M3 Pro ███████████████████████ - ANE compiler handles weight blobs differently from M4+ - Training at 148-167 ms/step, ~0.6 TFLOPS +### M2 +- In-memory MIL benchmarks compile and run on macOS 26.5 with h14 ANE subtype +- Peak-style `inmem_peak` median: 7.99 TFLOPS at 128x conv 512ch sp64 (about 50.6% of the 15.8 TFLOPS FP16 reference) +- `inmem_bench` accepts 256/512/1024/2048/3072/4096 channel configurations tested here +- INT8 W8A8 is roughly parity to modestly faster than FP16 on tested kernels (median ratios 1.01x-1.11x) +- Stories110M dynamic weight pipeline works but is IO-dominated on the tested 8 GB Mac mini: 1554.0 ms/step over 20 steps + ### M3 Pro - **Only ch=512 compiles** — 52 channel values tested (1-4096), only 512 accepted - Fixed 512-wide lane structure in SRAM tiling @@ -131,13 +142,13 @@ M3 Pro ███████████████████████ ### Cross-Generation MIL Compatibility ``` -Feature M1 M3 M4 M5 -───────────────────────────────────────────────────────── -program(1.3) / ios18 PARTIAL YES YES YES -Single-blob weights FAIL YES YES YES -Per-matrix weight blobs YES YES YES YES -Channel flexibility ? ch=512 FLEX FLEX -BLOBFILE offset refs FAIL YES YES YES +Feature M1 M2 M3 M4 M5 +────────────────────────────────────────────────────────────────── +program(1.3) / ios18 PARTIAL YES YES YES YES +Single-blob weights FAIL YES YES YES YES +Per-matrix weight blobs YES YES YES YES YES +Channel flexibility ? FLEX ch=512 FLEX FLEX +BLOBFILE offset refs FAIL YES YES YES YES ``` ## macOS Compatibility Issues @@ -159,5 +170,5 @@ cd training && make train_large Include: chip model, macOS version, full output with JSON lines. --- -*Report compiled 2026-03-04 from community submissions.* -*Contributors: @SethBurkart123, @srt54558, @andyg5000, @moriwang, @D-Ogi, @GitBubble, @elijah-pelton* +*Report compiled 2026-03-04 from community submissions; updated 2026-06-05 with local M2 results.* +*Contributors: @SethBurkart123, @srt54558, @andyg5000, @moriwang, @D-Ogi, @GitBubble, @elijah-pelton, kimhyoyeol* diff --git a/benchmarks/community_results.json b/benchmarks/community_results.json index e975925..8daf0f5 100644 --- a/benchmarks/community_results.json +++ b/benchmarks/community_results.json @@ -1,6 +1,6 @@ { - "report_date": "2026-03-04", - "source": "https://github.com/maderix/ANE/issues/3", + "report_date": "2026-06-05", + "source": "https://github.com/maderix/ANE/issues/3 plus benchmarks/local_m2_results.md", "model": "Stories110M (12-layer transformer, 109M params)", "config": {"dim": 768, "hidden": 2048, "heads": 12, "seq": 256, "vocab": 32000, "layers": 12}, "training_results": [ @@ -32,6 +32,22 @@ "notes": "Same MIL compat issue as M1 Pro.", "contributor": "andyg5000" }, + { + "chip": "M2", + "cores": "8-core CPU (4P+4E)", + "ram_gb": 8, + "macos": "26.5", + "pipeline": "dynamic weight (Stories110M)", + "dynamic_ms_per_step": [1554.0, 1554.0], + "dynamic_compile_ms": 1196, + "dynamic_wall_s": 68.0, + "peak_tflops_inmem": 7.99, + "peak_reference_tflops": 15.8, + "int8_w8a8_ratio_range": [1.01, 1.11], + "benchmarks_pass": true, + "notes": "Measured on Mac mini (Mac14,3), M2, 8GB. inmem benchmarks run sequentially from a clean origin/main worktree. Static pipeline not submitted; Qwen3-0.6B not run due expected memory pressure.", + "contributor": "kimhyoyeol" + }, { "chip": "M3 Pro", "cores": "12-core CPU", diff --git a/benchmarks/local_m2_results.md b/benchmarks/local_m2_results.md new file mode 100644 index 0000000..3e53b82 --- /dev/null +++ b/benchmarks/local_m2_results.md @@ -0,0 +1,157 @@ +# Local Apple M2 ANE Benchmark Results + +Date: 2026-06-05 + +Host: + +- Mac mini (Mac14,3), Apple M2, 8-core CPU (4 performance + 4 efficiency), 8 GB memory +- macOS 26.5, build 25F71 +- Apple clang 21.0.0 (clang-2100.1.1.101) +- ANE subtype reported by the private benchmark API: h14 +- Repository commit tested: d91c9845c0784dec7753048954fc6d0e8411fe29 (`origin/main`) + +Runtime notes: + +- Results were measured from a clean worktree at `/private/tmp/ANE-m2-clean` to avoid local code changes affecting benchmark numbers. +- Benchmarks were run sequentially on the same machine because parallel ANE workloads contend for the accelerator and produce outliers. +- Tables below use the median of three runs where repeated. Raw logs were kept locally under `/private/tmp/ane_m2_2026-06-05_clean/`. +- Qwen3-0.6B dynamic training was not run on this 8 GB M2 machine; resident fp32 weights, gradients, Adam state, activations, transposed buffers, and IOSurfaces are expected to be memory-heavy. +- Two upstream output quirks were observed but not fixed in this results-only report: `inmem_peak` prints an invalid `%peak` value, and `ane_int8_bench` labels the h14 run as `M4`. The tables below use the raw TFLOPS/TOPS and host metadata instead. + +## In-Memory Baseline + +Command: + +```bash +xcrun clang -O2 -fobjc-arc -framework Foundation -framework IOSurface -ldl \ + -o inmem_bench inmem_bench.m +./inmem_bench +``` + +Median of three runs: + +| Config | Weight MB | ms/eval | TFLOPS | +|---|---:|---:|---:| +| 256ch x 64sp | 0.1 | 0.136 | 0.06 | +| 512ch x 64sp | 0.5 | 0.140 | 0.24 | +| 1024ch x 64sp | 2.0 | 0.204 | 0.66 | +| 2048ch x 64sp | 8.0 | 0.358 | 1.50 | +| 3072ch x 64sp | 18.0 | 0.552 | 2.19 | +| 4096ch x 64sp | 32.0 | 0.909 | 2.36 | + +## Peak-Style Conv Chain + +Command: + +```bash +xcrun clang -O2 -fobjc-arc -framework Foundation -framework CoreML \ + -framework IOSurface -ldl -o inmem_peak inmem_peak.m +./inmem_peak +``` + +Median of three runs. `% peak` below is computed against the M2 15.8 TFLOPS FP16 reference; the current program output prints an invalid `%peak` column. + +| Config | Weight MB | GFLOP | ms/eval | TFLOPS | % peak | +|---|---:|---:|---:|---:|---:| +| 32x conv 512ch sp64 | 16.0 | 1.07 | 0.239 | 4.50 | 28.5 | +| 48x conv 512ch sp64 | 24.0 | 1.61 | 0.280 | 5.74 | 36.3 | +| 64x conv 512ch sp64 | 32.0 | 2.15 | 0.301 | 7.13 | 45.1 | +| 96x conv 512ch sp64 | 48.0 | 3.22 | 0.404 | 7.98 | 50.5 | +| 128x conv 512ch sp64 | 64.0 | 4.29 | 0.537 | 7.99 | 50.6 | +| 64x conv 256ch sp64 | 8.0 | 0.54 | 0.160 | 3.35 | 21.2 | +| 128x conv 256ch sp64 | 16.0 | 1.07 | 0.222 | 4.84 | 30.6 | +| 256x conv 256ch sp64 | 32.0 | 2.15 | 0.340 | 6.32 | 40.0 | +| 64x conv 384ch sp64 | 18.0 | 1.21 | 0.245 | 4.94 | 31.3 | +| 128x conv 384ch sp64 | 36.0 | 2.42 | 0.345 | 7.00 | 44.3 | + +## INT8 W8A8 + +Command: + +```bash +xcrun clang -O2 -fobjc-arc -framework Foundation -framework IOSurface -ldl \ + -o ane_int8_bench ane_int8_bench.m +./ane_int8_bench +``` + +Median of three runs: + +| Config | Precision | Weight MB | GOP | ms/eval | TOPS | Ratio | +|---|---|---:|---:|---:|---:|---:| +| 128x conv 512ch 64x64 | FP16 | 64.0 | 274.88 | 22.847 | 12.03 | - | +| 128x conv 512ch 64x64 | W8A8 | 32.0 | 274.88 | 23.046 | 11.93 | 1.01x | +| 64x conv 512ch 64x64 | FP16 | 32.0 | 137.44 | 12.442 | 11.05 | - | +| 64x conv 512ch 64x64 | W8A8 | 16.0 | 137.44 | 11.861 | 11.59 | 1.06x | +| 256x conv 256ch 64x64 | FP16 | 32.0 | 137.44 | 12.984 | 10.59 | - | +| 256x conv 256ch 64x64 | W8A8 | 16.0 | 137.44 | 13.272 | 10.36 | 1.05x | +| 128x conv 256ch 64x64 | FP16 | 16.0 | 68.72 | 6.801 | 10.10 | - | +| 128x conv 256ch 64x64 | W8A8 | 8.0 | 68.72 | 6.348 | 10.83 | 1.11x | +| 128x conv 384ch 64x64 | FP16 | 36.0 | 154.62 | 14.220 | 10.87 | - | +| 128x conv 384ch 64x64 | W8A8 | 18.0 | 154.62 | 13.770 | 11.23 | 1.08x | + +On this M2, W8A8 is approximately parity to a modest improvement for these kernels, not the larger M4 speedup reported upstream. + +## Dynamic Matmul + +Command: + +```bash +cd training +xcrun clang -O2 -Wall -DACCELERATE_NEW_LAPACK -fobjc-arc \ + -o test_dynamic_matmul test_dynamic_matmul.m \ + -framework Foundation -framework CoreML -framework IOSurface -ldl -framework Accelerate +./test_dynamic_matmul +``` + +Result: + +- 64x64 identity correctness: PASS, max error 0.001938 +- 64x64 scale-by-2 correctness: PASS, ratio 2.000 +- 768x768x256 single dynamic matmul: 1.012 ms/eval, 298.4 GFLOP/s +- With weight IO: 0.871 ms/eval, 346.8 GFLOP/s +- vs `cblas_sgemm`: PASS, max error 0.014646 + +Tiled 768x768 matmul: + +| tile_oc | tiles | compile ms | eval ms | GFLOP/s | +|---:|---:|---:|---:|---:| +| 64 | 12 | 543 | 4.318 | 69.9 | +| 128 | 6 | 260 | 1.752 | 172.4 | +| 256 | 3 | 110 | 1.041 | 290.1 | +| 384 | 2 | 69 | 0.871 | 346.8 | +| 768 | 1 | 47 | 0.652 | 463.0 | + +## Dynamic Training + +Data: + +```bash +cd training +bash download_data.sh +``` + +The local run used `tinystories_data00.bin`: 20,658,981 tokens, 41.3 MB. + +Build and run: + +```bash +cd training/training_dynamic +make MODEL=stories110m +./train --scratch --steps 20 --accum 10 --warmup 2 --data ../tinystories_data00.bin +``` + +Result: + +- Model: Stories110M, 109.5M parameters +- Active compact vocab: 9,205 tokens from 32,000 +- One-time compile: 1,196 ms for 10 kernels +- Train time: 31,081 ms total +- Average train: 1,554.0 ms/step +- Wall time: 68.0 s +- Step 0 loss: 9.1105 +- Step 10 loss: 8.6389 + +Notes: + +- This is the dynamic weight pipeline, not the static pipeline used by the main cross-generation training table. +- The measured dynamic training run was IO-dominated on this 8 GB M2 machine.