cuda: skip ordered f16 matmul on Blackwell by amarrmb · Pull Request #121 · antirez/ds4

amarrmb · 2026-05-13T18:25:41Z

Summary

On Blackwell-class CUDA GPUs tested so far, the regular 256-thread F16 decode matmul reduction is faster than the ordered 32-thread path. This patch records the CUDA compute capability at init and skips the ordered F16 decode matmul when sm_major >= 11. Older CUDA architectures keep the existing default unless explicitly overridden.

DS4_CUDA_FORCE_ORDERED_F16_MATMUL=1 keeps the old ordered path available for A/B testing. DS4_CUDA_NO_ORDERED_F16_MATMUL=1 continues to force the non-ordered path explicitly.

Speed-bench evidence

I used the repository speed-bench sweep for before/after comparisons on the same machine, backend, model file, context sweep, and idle background state. “Before” is the current ordered path forced with DS4_CUDA_FORCE_ORDERED_F16_MATMUL=1; “after” is the patched default.

./ds4-bench \
  --cuda \
  -m ds4flash.gguf \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 \
  --ctx-max 65536 \
  --step-incr 2048 \
  --gen-tokens 128 \
  --csv <output.csv>

Average generation throughput across the 2K-64K sweep:

device	GPU	before	after	gain
Jetson Thor	sm_110	8.17 tok/s	9.66 tok/s	+18.2%
DGX Spark / GB10	sm_121	12.72 tok/s	13.33 tok/s	+4.8%

Selected context points:

device	4K before	4K after	64K before	64K after
Jetson Thor	9.02	11.17	7.27	8.34
DGX Spark / GB10	13.88	14.55	11.66	12.19

CSV and SVG artifacts were generated locally with speed-bench/plot_speed.py; I did not include them in this PR to keep the change code-only, but can add standard speed-bench/*.csv / *_ts.svg files if you prefer benchmark artifacts in-tree.

Validation

On Jetson Thor using the clean branch on current upstream:

make ds4-bench ds4_test
./ds4_test --metal-kernels --server

Both passed.

The same clean branch was also built and benchmarked on DGX Spark with make cuda-spark.

cuda: skip ordered f16 matmul on Blackwell

aa47b2c

amarrmb force-pushed the thor-sm110-f16-dispatch branch from eab774e to aa47b2c Compare May 13, 2026 18:39

amarrmb changed the title ~~cuda: skip ordered f16 matmul on sm_110~~ cuda: skip ordered f16 matmul on Blackwell May 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda: skip ordered f16 matmul on Blackwell#121

cuda: skip ordered f16 matmul on Blackwell#121
amarrmb wants to merge 1 commit into
antirez:mainfrom
amarrmb:thor-sm110-f16-dispatch

amarrmb commented May 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

amarrmb commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Speed-bench evidence

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

amarrmb commented May 13, 2026 •

edited

Loading