cuda: skip ordered f16 matmul on Blackwell#121
Open
amarrmb wants to merge 1 commit into
Open
Conversation
eab774e to
aa47b2c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
On Blackwell-class CUDA GPUs tested so far, the regular 256-thread F16 decode matmul reduction is faster than the ordered 32-thread path. This patch records the CUDA compute capability at init and skips the ordered F16 decode matmul when
sm_major >= 11. Older CUDA architectures keep the existing default unless explicitly overridden.DS4_CUDA_FORCE_ORDERED_F16_MATMUL=1keeps the old ordered path available for A/B testing.DS4_CUDA_NO_ORDERED_F16_MATMUL=1continues to force the non-ordered path explicitly.Speed-bench evidence
I used the repository speed-bench sweep for before/after comparisons on the same machine, backend, model file, context sweep, and idle background state. “Before” is the current ordered path forced with
DS4_CUDA_FORCE_ORDERED_F16_MATMUL=1; “after” is the patched default.Average generation throughput across the 2K-64K sweep:
Selected context points:
CSV and SVG artifacts were generated locally with
speed-bench/plot_speed.py; I did not include them in this PR to keep the change code-only, but can add standardspeed-bench/*.csv/*_ts.svgfiles if you prefer benchmark artifacts in-tree.Validation
On Jetson Thor using the clean branch on current upstream:
Both passed.
The same clean branch was also built and benchmarked on DGX Spark with
make cuda-spark.