diff --git a/bench/m5_max_128gb_report.md b/bench/m5_max_128gb_report.md new file mode 100644 index 00000000..b3406385 --- /dev/null +++ b/bench/m5_max_128gb_report.md @@ -0,0 +1,114 @@ +# DS4 throughput on MacBook Pro M5 Max (128 GB) — Metal, IQ2 + +Five back-to-back `ds4-bench` sweeps with the documented long-context command, +spaced by a passive cool-down between runs to keep thermal throttling from +dominating any single sweep. + +## Setup + +| | | +|---|---| +| Host | MacBook Pro Apple M5 Max (`Mac17,7`), 18 cores, 128 GB unified memory | +| OS | macOS 26.5 (build `25F71`), Darwin `25.5.0` | +| Backend | Metal (default on macOS) | +| Model | `DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf` | +| Model SHA-256 | `31598c67c8b8744d3bcebcd19aa62253c6dc43cef3b8adf9f593656c9e86fd8c` | +| DS4 commit | `920f987` | +| Context buffers | 1933.10 MiB at `ctx=65665`, `prefill_chunk=2048`, `raw_kv_rows=2304`, `compressed_kv_rows=16418` | + +Command used for every run: + +```sh +./ds4-bench -m ds4flash.gguf \ + --prompt-file bench/promessi_sposi.txt \ + --ctx-start 2048 --ctx-max 65536 \ + --step-incr 2048 --gen-tokens 128 +``` + +Five sequential sweeps, each followed by an idle cool-down before the next was +launched, to keep tail rows comparable to head rows within a single run. + +## Per-run summary + +`prefill avg` and `gen avg` are arithmetic means across all 32 frontiers in the +sweep; the `@64k` columns are the last frontier (`ctx_tokens = 65536`). + +| run | wall (s) | prefill avg | gen avg | prefill @64k | gen @64k | +|----:|---------:|------------:|--------:|-------------:|---------:| +| 1 | 491 | 215.77 | 24.68 | 166.11 | 21.53 | +| 2 | 467 | 231.78 | 25.08 | 162.05 | 20.56 | +| 3 | 477 | 226.16 | 24.68 | 160.92 | 19.93 | +| 4 | 527 | 209.87 | 22.66 | 150.88 | 18.31 | +| 5 | 479 | 227.23 | 25.01 | 167.57 | 21.15 | + +Run 4 is visibly the hottest pass: wall time +12% over the median, both +averages and the `@64k` numbers are the worst of the five. Runs 1, 2, 3 and 5 +are tightly clustered, suggesting the cool-down between runs is otherwise doing +its job; run 4 is included as-is — the spread is part of the result. + +## Aggregated per-frontier throughput (n = 5) + +`σ` is the population standard deviation across the five runs at that frontier. + +| ctx | KV (MiB) | prefill mean | prefill σ | prefill min/max | gen mean | gen σ | gen min/max | +|----:|---------:|-------------:|----------:|----------------:|---------:|------:|------------:| +| 2048 | 49.8 | 369.72 | 1.68 | 367.41 / 371.25 | 31.46 | 0.07 | 31.37 / 31.56 | +| 4096 | 76.6 | 316.62 | 4.07 | 308.93 / 320.03 | 30.85 | 0.17 | 30.61 / 31.02 | +| 6144 | 103.5 | 300.43 | 8.07 | 287.41 / 308.13 | 30.46 | 0.30 | 29.88 / 30.70 | +| 8192 | 130.4 | 293.18 | 4.23 | 287.25 / 299.70 | 30.21 | 0.40 | 29.44 / 30.58 | +| 10240 | 157.3 | 280.01 | 5.79 | 271.40 / 287.65 | 29.58 | 0.12 | 29.45 / 29.79 | +| 12288 | 184.2 | 272.51 | 7.61 | 262.67 / 284.56 | 29.03 | 0.47 | 28.24 / 29.61 | +| 14336 | 211.1 | 263.58 | 8.23 | 253.18 / 276.85 | 28.88 | 0.25 | 28.55 / 29.11 | +| 16384 | 237.9 | 261.95 | 6.76 | 253.12 / 272.18 | 28.50 | 0.80 | 27.12 / 29.31 | +| 18432 | 264.8 | 256.54 | 9.01 | 242.75 / 268.11 | 27.04 | 0.51 | 26.40 / 27.78 | +| 20480 | 291.7 | 253.61 | 6.66 | 244.86 / 263.49 | 26.96 | 0.47 | 26.30 / 27.36 | +| 22528 | 318.6 | 251.17 | 6.63 | 240.68 / 259.51 | 26.57 | 0.63 | 25.41 / 27.12 | +| 24576 | 345.5 | 247.47 | 7.40 | 237.63 / 256.34 | 26.30 | 0.67 | 25.01 / 26.88 | +| 26624 | 372.4 | 240.08 | 8.75 | 229.31 / 250.25 | 25.73 | 0.73 | 24.30 / 26.28 | +| 28672 | 399.2 | 234.19 | 10.30 | 221.75 / 246.17 | 25.46 | 0.82 | 23.84 / 26.08 | +| 30720 | 426.1 | 229.06 | 11.66 | 213.45 / 241.76 | 24.78 | 0.94 | 22.95 / 25.42 | +| 32768 | 453.0 | 222.39 | 13.50 | 202.83 / 238.00 | 24.29 | 1.27 | 21.81 / 25.22 | +| 34816 | 479.9 | 212.49 | 16.40 | 186.43 / 230.12 | 23.50 | 1.57 | 20.45 / 24.86 | +| 36864 | 506.8 | 203.31 | 18.22 | 172.81 / 222.65 | 22.92 | 2.13 | 18.80 / 24.67 | +| 38912 | 533.7 | 195.13 | 19.74 | 160.96 / 214.76 | 22.17 | 2.22 | 17.89 / 24.17 | +| 40960 | 560.5 | 189.39 | 18.41 | 157.57 / 208.13 | 21.89 | 2.00 | 18.09 / 23.75 | +| 43008 | 587.4 | 183.36 | 14.86 | 159.23 / 200.00 | 21.36 | 1.71 | 18.16 / 23.16 | +| 45056 | 614.3 | 178.57 | 13.04 | 156.37 / 194.05 | 21.01 | 1.63 | 18.01 / 22.90 | +| 47104 | 641.2 | 174.81 | 12.31 | 154.00 / 188.24 | 20.68 | 1.53 | 17.81 / 22.32 | +| 49152 | 668.1 | 171.82 | 10.44 | 153.41 / 182.24 | 20.45 | 1.34 | 17.89 / 21.78 | +| 51200 | 695.0 | 167.86 | 8.79 | 151.50 / 175.61 | 20.20 | 1.34 | 17.61 / 21.30 | +| 53248 | 721.8 | 165.84 | 7.91 | 150.99 / 172.05 | 20.17 | 1.14 | 17.99 / 21.05 | +| 55296 | 748.7 | 164.62 | 6.72 | 151.74 / 170.83 | 20.07 | 1.15 | 17.90 / 20.99 | +| 57344 | 775.6 | 164.83 | 6.44 | 152.20 / 170.15 | 20.15 | 1.03 | 18.20 / 20.90 | +| 59392 | 802.5 | 162.03 | 6.05 | 150.97 / 168.85 | 20.14 | 1.09 | 18.12 / 21.19 | +| 61440 | 829.4 | 161.08 | 5.72 | 151.35 / 168.03 | 20.29 | 1.09 | 18.32 / 21.43 | +| 63488 | 856.3 | 160.05 | 6.72 | 151.03 / 168.35 | 20.12 | 1.18 | 18.22 / 21.50 | +| 65536 | 883.1 | 161.51 | 5.86 | 150.88 / 167.57 | 20.30 | 1.13 | 18.31 / 21.53 | + +## Observations + +- **Cold context is fast.** At `ctx = 2048` Metal hits **~370 tok/s prefill / + ~31.5 tok/s generation** with σ < 2 tok/s and σ < 0.1 tok/s respectively — + the M5 Max is very repeatable when nothing is hot or memory-pressured. +- **Decode degrades smoothly with KV size.** Generation drops from 31.5 tok/s + at 2k to ~20 tok/s at 64k — a ~35% fall over a 32× context expansion. KV + attention reads dominate at the tail. +- **Prefill plateaus, doesn't collapse.** Prefill flattens at ~160 tok/s once + context passes ~50k. The high-context floor is stable and reproducible. +- **Mid-context (~30k–40k) shows the highest run-to-run variance** (σ up to + ~20 tok/s on prefill, ~2 tok/s on generation). The same rows in run 4 lag + the others — those frontiers happen to land inside the thermally-loaded + window of a run, while the head and tail of every sweep are tighter. +- **Best-row comparison to the README M3 Max single-run numbers**: + at the head, M5 Max prefill peaks at **371 t/s** (M3 Max README short-prompt + prefill: 58 t/s — measured a different way); the long-context generation + floor here is ~20 t/s vs the README's 21.47 t/s at 11.7k tokens for M3 Max. + These are not apples-to-apples (different prompt lengths and run shape), + but the M5 Max sustains its decode rate noticeably farther into the + context. + +## Reproducing + +The raw CSVs (one per run) and the captured `ds4-bench` stderr are in +`bench/results/run{1..5}.csv` and `bench/results/run{1..5}.log`. The aggregate +table above is regenerated from those CSVs. diff --git a/bench/results/run1.csv b/bench/results/run1.csv new file mode 100644 index 00000000..d6b43ca5 --- /dev/null +++ b/bench/results/run1.csv @@ -0,0 +1,33 @@ +ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes +2048,2048,367.41,128,31.37,52184460 +4096,2048,308.93,128,30.69,80373132 +6144,2048,287.41,128,29.88,108561804 +8192,2048,287.25,128,29.44,136750476 +10240,2048,271.40,128,29.50,164939148 +12288,2048,262.67,128,29.12,193127820 +14336,2048,256.75,128,28.55,221316492 +16384,2048,253.12,128,28.33,249505164 +18432,2048,242.75,128,27.18,277693836 +20480,2048,244.86,128,27.36,305882508 +22528,2048,240.68,128,26.99,334071180 +24576,2048,237.63,128,26.50,362259852 +26624,2048,229.31,128,25.87,390448524 +28672,2048,221.92,128,25.61,418637196 +30720,2048,216.64,128,24.85,446825868 +32768,2048,210.24,128,24.41,475014540 +34816,2048,200.57,128,23.65,503203212 +36864,2048,192.56,128,23.08,531391884 +38912,2048,185.42,128,22.32,559580556 +40960,2048,180.47,128,21.84,587769228 +43008,2048,173.72,128,21.34,615957900 +45056,2048,172.52,128,20.91,644146572 +47104,2048,168.20,128,20.85,672335244 +49152,2048,167.34,128,20.92,700523916 +51200,2048,165.98,128,20.88,728712588 +53248,2048,164.38,128,20.79,756901260 +55296,2048,165.11,128,20.99,785089932 +57344,2048,168.26,128,20.90,813278604 +59392,2048,165.40,128,21.19,841467276 +61440,2048,164.74,128,21.43,869655948 +63488,2048,165.02,128,21.50,897844620 +65536,2048,166.11,128,21.53,926033292 diff --git a/bench/results/run1.log b/bench/results/run1.log new file mode 100644 index 00000000..4856b115 --- /dev/null +++ b/bench/results/run1.log @@ -0,0 +1,5 @@ +ds4-bench: context buffers 1933.10 MiB (ctx=65665, backend=metal, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=16418) +ds4: Metal device Apple M5 Max, 128.00 GiB RAM +ds4: Metal model views created in 2.305 ms, residency requested in 286.884 ms, warmup 4.134 ms (mapped 82697.67 MiB from offset 5.08 MiB) +ds4: Metal mapped mmaped model as 2 overlapping shared buffers +ds4: metal backend initialized for graph diagnostics diff --git a/bench/results/run2.csv b/bench/results/run2.csv new file mode 100644 index 00000000..3a4141c5 --- /dev/null +++ b/bench/results/run2.csv @@ -0,0 +1,33 @@ +ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes +2048,2048,371.11,128,31.51,52184460 +4096,2048,320.03,128,30.95,80373132 +6144,2048,308.13,128,30.70,108561804 +8192,2048,299.70,128,30.58,136750476 +10240,2048,287.65,128,29.56,164939148 +12288,2048,284.56,128,28.82,193127820 +14336,2048,276.85,128,29.11,221316492 +16384,2048,272.18,128,29.30,249505164 +18432,2048,268.11,128,26.52,277693836 +20480,2048,263.49,128,26.47,305882508 +22528,2048,259.51,128,26.39,334071180 +24576,2048,256.34,128,26.32,362259852 +26624,2048,250.25,128,26.05,390448524 +28672,2048,246.17,128,25.87,418637196 +30720,2048,241.76,128,25.41,446825868 +32768,2048,238.00,128,25.22,475014540 +34816,2048,230.12,128,24.86,503203212 +36864,2048,222.65,128,24.67,531391884 +38912,2048,214.76,128,24.17,559580556 +40960,2048,208.13,128,23.75,587769228 +43008,2048,200.00,128,23.16,615957900 +45056,2048,194.05,128,22.90,644146572 +47104,2048,188.24,128,22.32,672335244 +49152,2048,182.24,128,21.78,700523916 +51200,2048,175.61,128,21.30,728712588 +53248,2048,171.45,128,21.05,756901260 +55296,2048,168.77,128,20.74,785089932 +57344,2048,166.39,128,20.86,813278604 +59392,2048,163.61,128,20.58,841467276 +61440,2048,162.49,128,20.55,869655948 +63488,2048,162.50,128,20.42,897844620 +65536,2048,162.05,128,20.56,926033292 diff --git a/bench/results/run2.log b/bench/results/run2.log new file mode 100644 index 00000000..7329947c --- /dev/null +++ b/bench/results/run2.log @@ -0,0 +1,5 @@ +ds4-bench: context buffers 1933.10 MiB (ctx=65665, backend=metal, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=16418) +ds4: Metal device Apple M5 Max, 128.00 GiB RAM +ds4: Metal model views created in 2.339 ms, residency requested in 259.419 ms, warmup 3.924 ms (mapped 82697.67 MiB from offset 5.08 MiB) +ds4: Metal mapped mmaped model as 2 overlapping shared buffers +ds4: metal backend initialized for graph diagnostics diff --git a/bench/results/run3.csv b/bench/results/run3.csv new file mode 100644 index 00000000..8bc6d06c --- /dev/null +++ b/bench/results/run3.csv @@ -0,0 +1,33 @@ +ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes +2048,2048,371.25,128,31.45,52184460 +4096,2048,316.06,128,30.61,80373132 +6144,2048,294.60,128,30.52,108561804 +8192,2048,294.50,128,30.37,136750476 +10240,2048,275.53,128,29.45,164939148 +12288,2048,268.76,128,29.61,193127820 +14336,2048,265.15,128,29.07,221316492 +16384,2048,262.22,128,29.31,249505164 +18432,2048,264.08,128,27.31,277693836 +20480,2048,258.86,128,27.33,305882508 +22528,2048,255.26,128,26.93,334071180 +24576,2048,252.61,128,26.77,362259852 +26624,2048,246.05,128,26.16,390448524 +28672,2048,240.65,128,25.88,418637196 +30720,2048,237.89,128,25.29,446825868 +32768,2048,232.44,128,24.88,475014540 +34816,2048,224.43,128,24.29,503203212 +36864,2048,214.85,128,24.19,531391884 +38912,2048,209.23,128,23.15,559580556 +40960,2048,202.43,128,22.78,587769228 +43008,2048,193.46,128,21.91,615957900 +45056,2048,184.53,128,21.48,644146572 +47104,2048,182.88,128,20.96,672335244 +49152,2048,178.03,128,20.57,700523916 +51200,2048,172.77,128,20.26,728712588 +53248,2048,170.35,128,20.11,756901260 +55296,2048,166.63,128,19.90,785089932 +57344,2048,167.17,128,20.00,813278604 +59392,2048,161.33,128,19.94,841467276 +61440,2048,158.77,128,20.06,869655948 +63488,2048,153.33,128,19.42,897844620 +65536,2048,160.92,128,19.93,926033292 diff --git a/bench/results/run3.log b/bench/results/run3.log new file mode 100644 index 00000000..6a8bffbb --- /dev/null +++ b/bench/results/run3.log @@ -0,0 +1,5 @@ +ds4-bench: context buffers 1933.10 MiB (ctx=65665, backend=metal, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=16418) +ds4: Metal device Apple M5 Max, 128.00 GiB RAM +ds4: Metal model views created in 2.183 ms, residency requested in 266.192 ms, warmup 3.796 ms (mapped 82697.67 MiB from offset 5.08 MiB) +ds4: Metal mapped mmaped model as 2 overlapping shared buffers +ds4: metal backend initialized for graph diagnostics diff --git a/bench/results/run4.csv b/bench/results/run4.csv new file mode 100644 index 00000000..1c9b20d8 --- /dev/null +++ b/bench/results/run4.csv @@ -0,0 +1,33 @@ +ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes +2048,2048,370.88,128,31.56,52184460 +4096,2048,319.03,128,31.02,80373132 +6144,2048,305.53,128,30.65,108561804 +8192,2048,290.21,128,30.39,136750476 +10240,2048,283.15,128,29.79,164939148 +12288,2048,269.26,128,28.24,193127820 +14336,2048,253.18,128,28.58,221316492 +16384,2048,256.40,128,27.12,249505164 +18432,2048,251.54,128,26.40,277693836 +20480,2048,250.63,128,26.30,305882508 +22528,2048,246.91,128,25.41,334071180 +24576,2048,239.75,128,25.01,362259852 +26624,2048,229.87,128,24.30,390448524 +28672,2048,221.75,128,23.84,418637196 +30720,2048,213.45,128,22.95,446825868 +32768,2048,202.83,128,21.81,475014540 +34816,2048,186.43,128,20.45,503203212 +36864,2048,172.81,128,18.80,531391884 +38912,2048,160.96,128,17.89,559580556 +40960,2048,157.57,128,18.09,587769228 +43008,2048,159.23,128,18.16,615957900 +45056,2048,156.37,128,18.01,644146572 +47104,2048,154.00,128,17.81,672335244 +49152,2048,153.41,128,17.89,700523916 +51200,2048,151.50,128,17.61,728712588 +53248,2048,150.99,128,17.99,756901260 +55296,2048,151.74,128,17.90,785089932 +57344,2048,152.20,128,18.20,813278604 +59392,2048,150.97,128,18.12,841467276 +61440,2048,151.35,128,18.32,869655948 +63488,2048,151.03,128,18.22,897844620 +65536,2048,150.88,128,18.31,926033292 diff --git a/bench/results/run4.log b/bench/results/run4.log new file mode 100644 index 00000000..bbe66d58 --- /dev/null +++ b/bench/results/run4.log @@ -0,0 +1,5 @@ +ds4-bench: context buffers 1933.10 MiB (ctx=65665, backend=metal, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=16418) +ds4: Metal device Apple M5 Max, 128.00 GiB RAM +ds4: Metal model views created in 2.492 ms, residency requested in 250.995 ms, warmup 3.962 ms (mapped 82697.67 MiB from offset 5.08 MiB) +ds4: Metal mapped mmaped model as 2 overlapping shared buffers +ds4: metal backend initialized for graph diagnostics diff --git a/bench/results/run5.csv b/bench/results/run5.csv new file mode 100644 index 00000000..99e942fd --- /dev/null +++ b/bench/results/run5.csv @@ -0,0 +1,33 @@ +ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes +2048,2048,367.94,128,31.39,52184460 +4096,2048,319.03,128,30.99,80373132 +6144,2048,306.48,128,30.55,108561804 +8192,2048,294.24,128,30.26,136750476 +10240,2048,282.32,128,29.58,164939148 +12288,2048,277.30,128,29.34,193127820 +14336,2048,265.96,128,29.07,221316492 +16384,2048,265.81,128,28.46,249505164 +18432,2048,256.22,128,27.78,277693836 +20480,2048,250.23,128,27.33,305882508 +22528,2048,253.48,128,27.12,334071180 +24576,2048,251.00,128,26.88,362259852 +26624,2048,244.92,128,26.28,390448524 +28672,2048,240.47,128,26.08,418637196 +30720,2048,235.57,128,25.42,446825868 +32768,2048,228.45,128,25.15,475014540 +34816,2048,220.91,128,24.25,503203212 +36864,2048,213.68,128,23.88,531391884 +38912,2048,205.29,128,23.32,559580556 +40960,2048,198.36,128,23.00,587769228 +43008,2048,190.39,128,22.23,615957900 +45056,2048,185.37,128,21.76,644146572 +47104,2048,180.74,128,21.47,672335244 +49152,2048,178.08,128,21.11,700523916 +51200,2048,173.44,128,20.96,728712588 +53248,2048,172.05,128,20.91,756901260 +55296,2048,170.83,128,20.83,785089932 +57344,2048,170.15,128,20.80,813278604 +59392,2048,168.85,128,20.88,841467276 +61440,2048,168.03,128,21.08,869655948 +63488,2048,168.35,128,21.04,897844620 +65536,2048,167.57,128,21.15,926033292 diff --git a/bench/results/run5.log b/bench/results/run5.log new file mode 100644 index 00000000..a2d3b9b3 --- /dev/null +++ b/bench/results/run5.log @@ -0,0 +1,5 @@ +ds4-bench: context buffers 1933.10 MiB (ctx=65665, backend=metal, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=16418) +ds4: Metal device Apple M5 Max, 128.00 GiB RAM +ds4: Metal model views created in 2.665 ms, residency requested in 8132.561 ms, warmup 4.000 ms (mapped 82697.67 MiB from offset 5.08 MiB) +ds4: Metal mapped mmaped model as 2 overlapping shared buffers +ds4: metal backend initialized for graph diagnostics