Optimize CUDA matmul for empty shards and MMQ dispatch#22170
Closed
nisparks wants to merge 2 commits intoggml-org:masterfrom
Closed
Optimize CUDA matmul for empty shards and MMQ dispatch#22170nisparks wants to merge 2 commits intoggml-org:masterfrom
nisparks wants to merge 2 commits intoggml-org:masterfrom
Conversation
Add early guards in the CUDA matmul entry points so empty work does not fall through into kernel setup. Also skip zero-sized recurrent-state branches in build_rs(). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Keep the standard stream-k path on the upstream MMQ kernel and reserve the specialized MMQ path for explicit non-stream-k dispatch and tensor-split MoE. The backend defaults now only enable the MMQ tuning on the cases that actually benefit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
JohannesGaessler
requested changes
Apr 20, 2026
Contributor
JohannesGaessler
left a comment
There was a problem hiding this comment.
Sorry but this is not acceptable in terms of code quality. Since the MMQ kernel is both high impact and high maintenance I will not accept PRs that duplicate large parts of the code like this. Also, from non-established contributors I will only accept performance optimizations if they are submitted with one PR per optimization where they can show for each individual optimization that it is impactful.
Contributor
Author
|
@JohannesGaessler I'll break it down further, but I'd encourage you to see the benefits for yourself. |
Contributor
Author
|
Closing since I have a follow up PR that will add in smaller chunks: #22174 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Increases P2P prompt by ~30-50% on Dual 3090 w/ NVLink on MoE, net neutral otherwise.
Additional information
Why: stream-k splits the K work across SMs and then has to merge partial tiles in a fixup step. That works fine in many normal cases, but in tensor-split MoE the work is already irregular and sparse because rows are routed per expert and split across GPUs. In this workload, the stream-k decomposition/fixup overhead was benchmarking worse than straight xy tiling. The faster combination was:
Requirements
yes
Yes. Been burning the midnight oil (and the unlimited tokens via benefit from my employer) to read CUDA Docs, try various heuristics, profile runs with nsys, and try out many different things. Importantly, I made sure that we didn't experience regressions on the non-P2P or Multi-GPU path.
Important to note, I went through many iterations with AI assistance, reviewed code and decisions, benchmarked locally over and over, and tested to be sure no performance was lost on non-P2P/MoE workloads.