Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
114 changes: 114 additions & 0 deletions bench/m5_max_128gb_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# DS4 throughput on MacBook Pro M5 Max (128 GB) — Metal, IQ2

Five back-to-back `ds4-bench` sweeps with the documented long-context command,
spaced by a passive cool-down between runs to keep thermal throttling from
dominating any single sweep.

## Setup

| | |
|---|---|
| Host | MacBook Pro Apple M5 Max (`Mac17,7`), 18 cores, 128 GB unified memory |
| OS | macOS 26.5 (build `25F71`), Darwin `25.5.0` |
| Backend | Metal (default on macOS) |
| Model | `DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf` |
| Model SHA-256 | `31598c67c8b8744d3bcebcd19aa62253c6dc43cef3b8adf9f593656c9e86fd8c` |
| DS4 commit | `920f987` |
| Context buffers | 1933.10 MiB at `ctx=65665`, `prefill_chunk=2048`, `raw_kv_rows=2304`, `compressed_kv_rows=16418` |

Command used for every run:

```sh
./ds4-bench -m ds4flash.gguf \
--prompt-file bench/promessi_sposi.txt \
--ctx-start 2048 --ctx-max 65536 \
--step-incr 2048 --gen-tokens 128
```

Five sequential sweeps, each followed by an idle cool-down before the next was
launched, to keep tail rows comparable to head rows within a single run.

## Per-run summary

`prefill avg` and `gen avg` are arithmetic means across all 32 frontiers in the
sweep; the `@64k` columns are the last frontier (`ctx_tokens = 65536`).

| run | wall (s) | prefill avg | gen avg | prefill @64k | gen @64k |
|----:|---------:|------------:|--------:|-------------:|---------:|
| 1 | 491 | 215.77 | 24.68 | 166.11 | 21.53 |
| 2 | 467 | 231.78 | 25.08 | 162.05 | 20.56 |
| 3 | 477 | 226.16 | 24.68 | 160.92 | 19.93 |
| 4 | 527 | 209.87 | 22.66 | 150.88 | 18.31 |
| 5 | 479 | 227.23 | 25.01 | 167.57 | 21.15 |

Run 4 is visibly the hottest pass: wall time +12% over the median, both
averages and the `@64k` numbers are the worst of the five. Runs 1, 2, 3 and 5
are tightly clustered, suggesting the cool-down between runs is otherwise doing
its job; run 4 is included as-is — the spread is part of the result.

## Aggregated per-frontier throughput (n = 5)

`σ` is the population standard deviation across the five runs at that frontier.

| ctx | KV (MiB) | prefill mean | prefill σ | prefill min/max | gen mean | gen σ | gen min/max |
|----:|---------:|-------------:|----------:|----------------:|---------:|------:|------------:|
| 2048 | 49.8 | 369.72 | 1.68 | 367.41 / 371.25 | 31.46 | 0.07 | 31.37 / 31.56 |
| 4096 | 76.6 | 316.62 | 4.07 | 308.93 / 320.03 | 30.85 | 0.17 | 30.61 / 31.02 |
| 6144 | 103.5 | 300.43 | 8.07 | 287.41 / 308.13 | 30.46 | 0.30 | 29.88 / 30.70 |
| 8192 | 130.4 | 293.18 | 4.23 | 287.25 / 299.70 | 30.21 | 0.40 | 29.44 / 30.58 |
| 10240 | 157.3 | 280.01 | 5.79 | 271.40 / 287.65 | 29.58 | 0.12 | 29.45 / 29.79 |
| 12288 | 184.2 | 272.51 | 7.61 | 262.67 / 284.56 | 29.03 | 0.47 | 28.24 / 29.61 |
| 14336 | 211.1 | 263.58 | 8.23 | 253.18 / 276.85 | 28.88 | 0.25 | 28.55 / 29.11 |
| 16384 | 237.9 | 261.95 | 6.76 | 253.12 / 272.18 | 28.50 | 0.80 | 27.12 / 29.31 |
| 18432 | 264.8 | 256.54 | 9.01 | 242.75 / 268.11 | 27.04 | 0.51 | 26.40 / 27.78 |
| 20480 | 291.7 | 253.61 | 6.66 | 244.86 / 263.49 | 26.96 | 0.47 | 26.30 / 27.36 |
| 22528 | 318.6 | 251.17 | 6.63 | 240.68 / 259.51 | 26.57 | 0.63 | 25.41 / 27.12 |
| 24576 | 345.5 | 247.47 | 7.40 | 237.63 / 256.34 | 26.30 | 0.67 | 25.01 / 26.88 |
| 26624 | 372.4 | 240.08 | 8.75 | 229.31 / 250.25 | 25.73 | 0.73 | 24.30 / 26.28 |
| 28672 | 399.2 | 234.19 | 10.30 | 221.75 / 246.17 | 25.46 | 0.82 | 23.84 / 26.08 |
| 30720 | 426.1 | 229.06 | 11.66 | 213.45 / 241.76 | 24.78 | 0.94 | 22.95 / 25.42 |
| 32768 | 453.0 | 222.39 | 13.50 | 202.83 / 238.00 | 24.29 | 1.27 | 21.81 / 25.22 |
| 34816 | 479.9 | 212.49 | 16.40 | 186.43 / 230.12 | 23.50 | 1.57 | 20.45 / 24.86 |
| 36864 | 506.8 | 203.31 | 18.22 | 172.81 / 222.65 | 22.92 | 2.13 | 18.80 / 24.67 |
| 38912 | 533.7 | 195.13 | 19.74 | 160.96 / 214.76 | 22.17 | 2.22 | 17.89 / 24.17 |
| 40960 | 560.5 | 189.39 | 18.41 | 157.57 / 208.13 | 21.89 | 2.00 | 18.09 / 23.75 |
| 43008 | 587.4 | 183.36 | 14.86 | 159.23 / 200.00 | 21.36 | 1.71 | 18.16 / 23.16 |
| 45056 | 614.3 | 178.57 | 13.04 | 156.37 / 194.05 | 21.01 | 1.63 | 18.01 / 22.90 |
| 47104 | 641.2 | 174.81 | 12.31 | 154.00 / 188.24 | 20.68 | 1.53 | 17.81 / 22.32 |
| 49152 | 668.1 | 171.82 | 10.44 | 153.41 / 182.24 | 20.45 | 1.34 | 17.89 / 21.78 |
| 51200 | 695.0 | 167.86 | 8.79 | 151.50 / 175.61 | 20.20 | 1.34 | 17.61 / 21.30 |
| 53248 | 721.8 | 165.84 | 7.91 | 150.99 / 172.05 | 20.17 | 1.14 | 17.99 / 21.05 |
| 55296 | 748.7 | 164.62 | 6.72 | 151.74 / 170.83 | 20.07 | 1.15 | 17.90 / 20.99 |
| 57344 | 775.6 | 164.83 | 6.44 | 152.20 / 170.15 | 20.15 | 1.03 | 18.20 / 20.90 |
| 59392 | 802.5 | 162.03 | 6.05 | 150.97 / 168.85 | 20.14 | 1.09 | 18.12 / 21.19 |
| 61440 | 829.4 | 161.08 | 5.72 | 151.35 / 168.03 | 20.29 | 1.09 | 18.32 / 21.43 |
| 63488 | 856.3 | 160.05 | 6.72 | 151.03 / 168.35 | 20.12 | 1.18 | 18.22 / 21.50 |
| 65536 | 883.1 | 161.51 | 5.86 | 150.88 / 167.57 | 20.30 | 1.13 | 18.31 / 21.53 |

## Observations

- **Cold context is fast.** At `ctx = 2048` Metal hits **~370 tok/s prefill /
~31.5 tok/s generation** with σ < 2 tok/s and σ < 0.1 tok/s respectively —
the M5 Max is very repeatable when nothing is hot or memory-pressured.
- **Decode degrades smoothly with KV size.** Generation drops from 31.5 tok/s
at 2k to ~20 tok/s at 64k — a ~35% fall over a 32× context expansion. KV
attention reads dominate at the tail.
- **Prefill plateaus, doesn't collapse.** Prefill flattens at ~160 tok/s once
context passes ~50k. The high-context floor is stable and reproducible.
- **Mid-context (~30k–40k) shows the highest run-to-run variance** (σ up to
~20 tok/s on prefill, ~2 tok/s on generation). The same rows in run 4 lag
the others — those frontiers happen to land inside the thermally-loaded
window of a run, while the head and tail of every sweep are tighter.
- **Best-row comparison to the README M3 Max single-run numbers**:
at the head, M5 Max prefill peaks at **371 t/s** (M3 Max README short-prompt
prefill: 58 t/s — measured a different way); the long-context generation
floor here is ~20 t/s vs the README's 21.47 t/s at 11.7k tokens for M3 Max.
These are not apples-to-apples (different prompt lengths and run shape),
but the M5 Max sustains its decode rate noticeably farther into the
context.

## Reproducing

The raw CSVs (one per run) and the captured `ds4-bench` stderr are in
`bench/results/run{1..5}.csv` and `bench/results/run{1..5}.log`. The aggregate
table above is regenerated from those CSVs.
33 changes: 33 additions & 0 deletions bench/results/run1.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes
2048,2048,367.41,128,31.37,52184460
4096,2048,308.93,128,30.69,80373132
6144,2048,287.41,128,29.88,108561804
8192,2048,287.25,128,29.44,136750476
10240,2048,271.40,128,29.50,164939148
12288,2048,262.67,128,29.12,193127820
14336,2048,256.75,128,28.55,221316492
16384,2048,253.12,128,28.33,249505164
18432,2048,242.75,128,27.18,277693836
20480,2048,244.86,128,27.36,305882508
22528,2048,240.68,128,26.99,334071180
24576,2048,237.63,128,26.50,362259852
26624,2048,229.31,128,25.87,390448524
28672,2048,221.92,128,25.61,418637196
30720,2048,216.64,128,24.85,446825868
32768,2048,210.24,128,24.41,475014540
34816,2048,200.57,128,23.65,503203212
36864,2048,192.56,128,23.08,531391884
38912,2048,185.42,128,22.32,559580556
40960,2048,180.47,128,21.84,587769228
43008,2048,173.72,128,21.34,615957900
45056,2048,172.52,128,20.91,644146572
47104,2048,168.20,128,20.85,672335244
49152,2048,167.34,128,20.92,700523916
51200,2048,165.98,128,20.88,728712588
53248,2048,164.38,128,20.79,756901260
55296,2048,165.11,128,20.99,785089932
57344,2048,168.26,128,20.90,813278604
59392,2048,165.40,128,21.19,841467276
61440,2048,164.74,128,21.43,869655948
63488,2048,165.02,128,21.50,897844620
65536,2048,166.11,128,21.53,926033292
5 changes: 5 additions & 0 deletions bench/results/run1.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
ds4-bench: context buffers 1933.10 MiB (ctx=65665, backend=metal, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=16418)
ds4: Metal device Apple M5 Max, 128.00 GiB RAM
ds4: Metal model views created in 2.305 ms, residency requested in 286.884 ms, warmup 4.134 ms (mapped 82697.67 MiB from offset 5.08 MiB)
ds4: Metal mapped mmaped model as 2 overlapping shared buffers
ds4: metal backend initialized for graph diagnostics
33 changes: 33 additions & 0 deletions bench/results/run2.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes
2048,2048,371.11,128,31.51,52184460
4096,2048,320.03,128,30.95,80373132
6144,2048,308.13,128,30.70,108561804
8192,2048,299.70,128,30.58,136750476
10240,2048,287.65,128,29.56,164939148
12288,2048,284.56,128,28.82,193127820
14336,2048,276.85,128,29.11,221316492
16384,2048,272.18,128,29.30,249505164
18432,2048,268.11,128,26.52,277693836
20480,2048,263.49,128,26.47,305882508
22528,2048,259.51,128,26.39,334071180
24576,2048,256.34,128,26.32,362259852
26624,2048,250.25,128,26.05,390448524
28672,2048,246.17,128,25.87,418637196
30720,2048,241.76,128,25.41,446825868
32768,2048,238.00,128,25.22,475014540
34816,2048,230.12,128,24.86,503203212
36864,2048,222.65,128,24.67,531391884
38912,2048,214.76,128,24.17,559580556
40960,2048,208.13,128,23.75,587769228
43008,2048,200.00,128,23.16,615957900
45056,2048,194.05,128,22.90,644146572
47104,2048,188.24,128,22.32,672335244
49152,2048,182.24,128,21.78,700523916
51200,2048,175.61,128,21.30,728712588
53248,2048,171.45,128,21.05,756901260
55296,2048,168.77,128,20.74,785089932
57344,2048,166.39,128,20.86,813278604
59392,2048,163.61,128,20.58,841467276
61440,2048,162.49,128,20.55,869655948
63488,2048,162.50,128,20.42,897844620
65536,2048,162.05,128,20.56,926033292
5 changes: 5 additions & 0 deletions bench/results/run2.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
ds4-bench: context buffers 1933.10 MiB (ctx=65665, backend=metal, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=16418)
ds4: Metal device Apple M5 Max, 128.00 GiB RAM
ds4: Metal model views created in 2.339 ms, residency requested in 259.419 ms, warmup 3.924 ms (mapped 82697.67 MiB from offset 5.08 MiB)
ds4: Metal mapped mmaped model as 2 overlapping shared buffers
ds4: metal backend initialized for graph diagnostics
33 changes: 33 additions & 0 deletions bench/results/run3.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes
2048,2048,371.25,128,31.45,52184460
4096,2048,316.06,128,30.61,80373132
6144,2048,294.60,128,30.52,108561804
8192,2048,294.50,128,30.37,136750476
10240,2048,275.53,128,29.45,164939148
12288,2048,268.76,128,29.61,193127820
14336,2048,265.15,128,29.07,221316492
16384,2048,262.22,128,29.31,249505164
18432,2048,264.08,128,27.31,277693836
20480,2048,258.86,128,27.33,305882508
22528,2048,255.26,128,26.93,334071180
24576,2048,252.61,128,26.77,362259852
26624,2048,246.05,128,26.16,390448524
28672,2048,240.65,128,25.88,418637196
30720,2048,237.89,128,25.29,446825868
32768,2048,232.44,128,24.88,475014540
34816,2048,224.43,128,24.29,503203212
36864,2048,214.85,128,24.19,531391884
38912,2048,209.23,128,23.15,559580556
40960,2048,202.43,128,22.78,587769228
43008,2048,193.46,128,21.91,615957900
45056,2048,184.53,128,21.48,644146572
47104,2048,182.88,128,20.96,672335244
49152,2048,178.03,128,20.57,700523916
51200,2048,172.77,128,20.26,728712588
53248,2048,170.35,128,20.11,756901260
55296,2048,166.63,128,19.90,785089932
57344,2048,167.17,128,20.00,813278604
59392,2048,161.33,128,19.94,841467276
61440,2048,158.77,128,20.06,869655948
63488,2048,153.33,128,19.42,897844620
65536,2048,160.92,128,19.93,926033292
5 changes: 5 additions & 0 deletions bench/results/run3.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
ds4-bench: context buffers 1933.10 MiB (ctx=65665, backend=metal, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=16418)
ds4: Metal device Apple M5 Max, 128.00 GiB RAM
ds4: Metal model views created in 2.183 ms, residency requested in 266.192 ms, warmup 3.796 ms (mapped 82697.67 MiB from offset 5.08 MiB)
ds4: Metal mapped mmaped model as 2 overlapping shared buffers
ds4: metal backend initialized for graph diagnostics
33 changes: 33 additions & 0 deletions bench/results/run4.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes
2048,2048,370.88,128,31.56,52184460
4096,2048,319.03,128,31.02,80373132
6144,2048,305.53,128,30.65,108561804
8192,2048,290.21,128,30.39,136750476
10240,2048,283.15,128,29.79,164939148
12288,2048,269.26,128,28.24,193127820
14336,2048,253.18,128,28.58,221316492
16384,2048,256.40,128,27.12,249505164
18432,2048,251.54,128,26.40,277693836
20480,2048,250.63,128,26.30,305882508
22528,2048,246.91,128,25.41,334071180
24576,2048,239.75,128,25.01,362259852
26624,2048,229.87,128,24.30,390448524
28672,2048,221.75,128,23.84,418637196
30720,2048,213.45,128,22.95,446825868
32768,2048,202.83,128,21.81,475014540
34816,2048,186.43,128,20.45,503203212
36864,2048,172.81,128,18.80,531391884
38912,2048,160.96,128,17.89,559580556
40960,2048,157.57,128,18.09,587769228
43008,2048,159.23,128,18.16,615957900
45056,2048,156.37,128,18.01,644146572
47104,2048,154.00,128,17.81,672335244
49152,2048,153.41,128,17.89,700523916
51200,2048,151.50,128,17.61,728712588
53248,2048,150.99,128,17.99,756901260
55296,2048,151.74,128,17.90,785089932
57344,2048,152.20,128,18.20,813278604
59392,2048,150.97,128,18.12,841467276
61440,2048,151.35,128,18.32,869655948
63488,2048,151.03,128,18.22,897844620
65536,2048,150.88,128,18.31,926033292
5 changes: 5 additions & 0 deletions bench/results/run4.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
ds4-bench: context buffers 1933.10 MiB (ctx=65665, backend=metal, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=16418)
ds4: Metal device Apple M5 Max, 128.00 GiB RAM
ds4: Metal model views created in 2.492 ms, residency requested in 250.995 ms, warmup 3.962 ms (mapped 82697.67 MiB from offset 5.08 MiB)
ds4: Metal mapped mmaped model as 2 overlapping shared buffers
ds4: metal backend initialized for graph diagnostics
33 changes: 33 additions & 0 deletions bench/results/run5.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes
2048,2048,367.94,128,31.39,52184460
4096,2048,319.03,128,30.99,80373132
6144,2048,306.48,128,30.55,108561804
8192,2048,294.24,128,30.26,136750476
10240,2048,282.32,128,29.58,164939148
12288,2048,277.30,128,29.34,193127820
14336,2048,265.96,128,29.07,221316492
16384,2048,265.81,128,28.46,249505164
18432,2048,256.22,128,27.78,277693836
20480,2048,250.23,128,27.33,305882508
22528,2048,253.48,128,27.12,334071180
24576,2048,251.00,128,26.88,362259852
26624,2048,244.92,128,26.28,390448524
28672,2048,240.47,128,26.08,418637196
30720,2048,235.57,128,25.42,446825868
32768,2048,228.45,128,25.15,475014540
34816,2048,220.91,128,24.25,503203212
36864,2048,213.68,128,23.88,531391884
38912,2048,205.29,128,23.32,559580556
40960,2048,198.36,128,23.00,587769228
43008,2048,190.39,128,22.23,615957900
45056,2048,185.37,128,21.76,644146572
47104,2048,180.74,128,21.47,672335244
49152,2048,178.08,128,21.11,700523916
51200,2048,173.44,128,20.96,728712588
53248,2048,172.05,128,20.91,756901260
55296,2048,170.83,128,20.83,785089932
57344,2048,170.15,128,20.80,813278604
59392,2048,168.85,128,20.88,841467276
61440,2048,168.03,128,21.08,869655948
63488,2048,168.35,128,21.04,897844620
65536,2048,167.57,128,21.15,926033292
5 changes: 5 additions & 0 deletions bench/results/run5.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
ds4-bench: context buffers 1933.10 MiB (ctx=65665, backend=metal, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=16418)
ds4: Metal device Apple M5 Max, 128.00 GiB RAM
ds4: Metal model views created in 2.665 ms, residency requested in 8132.561 ms, warmup 4.000 ms (mapped 82697.67 MiB from offset 5.08 MiB)
ds4: Metal mapped mmaped model as 2 overlapping shared buffers
ds4: metal backend initialized for graph diagnostics