cuda: disable MMQ stream-k by default for MoE#22174
cuda: disable MMQ stream-k by default for MoE#22174nisparks wants to merge 2 commits intoggml-org:masterfrom
Conversation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@JohannesGaessler, here is a smaller change. |
JohannesGaessler
left a comment
There was a problem hiding this comment.
The decision on whether or not to use stream-k should be done using only tensor properties such as the shape and explicitly not depend on --split-mode.
|
Let me benchmark the other options now just to make sure those paths don't regress if I remove that gate. |
|
A few days ago I started working on reducing the stream-k overhead in MMQ. I don't have Qwen 3.6 or 2x RTX 3090 handy but on a single RTX 3090 I get this for Qwen 3.5:
Please check the performance of a176086 vs. your changes, I created the table above using these commands (adapt to your environment): |
|
@JohannesGaessler, So I see similar results, and this slices into some of the gains I saw with dual GPU, but I still see gains on mine when compared. When I ran the benchmarks for an ungated (no tensor split mode check), row had a 0.1% regression, and layer had a 0.28% regression on prompt, averaged over 6 runs. Going to see if adding your changes fix the regression I saw. I know it generally improves things, but I didn't profile the losses to row/layer yet. |
|
Also that commit seems to crash when I use split-mode row. |
|
Looks like I don't need to gate on tensor split mode. Will be removing that now. |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Pushed update. Actually see large improvements on layer and row split, as well as none. Running the ./bench commands now for an A/B comparison. |
|
Initial results with Qwen3.6 IQ4 on Dual 3090s, still waiting for the other results. This is the delta between the first and second commit.
|
|

Overview
As the title says, I found that MMQ stream-k was inefficient with P2P + MoE. This disables by default when this scenario occurs. Tests all pass, and in MoE workloads, I see a 9.67% performance increase in prompt on my dual 3090 + NVLink setup.
Additional information
What it does
Why we need it
Performance
Requirements
Yes
Yes. Per my other PR Optimize CUDA matmul for empty shards and MMQ dispatch #22170, I have been working with AI to profile and fix bottlenecks in workloads I come across. This was one of those bottlenecks. Will have more standalone improvements after, but per the recommendation in that PR, breaking this up into smaller, standalone chunks.