Skip to content

cuda: skip ordered f16 matmul on Blackwell#121

Open
amarrmb wants to merge 1 commit into
antirez:mainfrom
amarrmb:thor-sm110-f16-dispatch
Open

cuda: skip ordered f16 matmul on Blackwell#121
amarrmb wants to merge 1 commit into
antirez:mainfrom
amarrmb:thor-sm110-f16-dispatch

Conversation

@amarrmb
Copy link
Copy Markdown

@amarrmb amarrmb commented May 13, 2026

Summary

On Blackwell-class CUDA GPUs tested so far, the regular 256-thread F16 decode matmul reduction is faster than the ordered 32-thread path. This patch records the CUDA compute capability at init and skips the ordered F16 decode matmul when sm_major >= 11. Older CUDA architectures keep the existing default unless explicitly overridden.

DS4_CUDA_FORCE_ORDERED_F16_MATMUL=1 keeps the old ordered path available for A/B testing. DS4_CUDA_NO_ORDERED_F16_MATMUL=1 continues to force the non-ordered path explicitly.

Speed-bench evidence

I used the repository speed-bench sweep for before/after comparisons on the same machine, backend, model file, context sweep, and idle background state. “Before” is the current ordered path forced with DS4_CUDA_FORCE_ORDERED_F16_MATMUL=1; “after” is the patched default.

./ds4-bench \
  --cuda \
  -m ds4flash.gguf \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 \
  --ctx-max 65536 \
  --step-incr 2048 \
  --gen-tokens 128 \
  --csv <output.csv>

Average generation throughput across the 2K-64K sweep:

device GPU before after gain
Jetson Thor sm_110 8.17 tok/s 9.66 tok/s +18.2%
DGX Spark / GB10 sm_121 12.72 tok/s 13.33 tok/s +4.8%

Selected context points:

device 4K before 4K after 64K before 64K after
Jetson Thor 9.02 11.17 7.27 8.34
DGX Spark / GB10 13.88 14.55 11.66 12.19

CSV and SVG artifacts were generated locally with speed-bench/plot_speed.py; I did not include them in this PR to keep the change code-only, but can add standard speed-bench/*.csv / *_ts.svg files if you prefer benchmark artifacts in-tree.

Validation

On Jetson Thor using the clean branch on current upstream:

make ds4-bench ds4_test
./ds4_test --metal-kernels --server

Both passed.

The same clean branch was also built and benchmarked on DGX Spark with make cuda-spark.

@amarrmb amarrmb force-pushed the thor-sm110-f16-dispatch branch from eab774e to aa47b2c Compare May 13, 2026 18:39
@amarrmb amarrmb changed the title cuda: skip ordered f16 matmul on sm_110 cuda: skip ordered f16 matmul on Blackwell May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant