cuda: disable MMQ stream-k by default for MoE by nisparks · Pull Request #22174 · ggml-org/llama.cpp

nisparks · 2026-04-20T15:46:15Z

Overview

As the title says, I found that MMQ stream-k was inefficient with P2P + MoE. This disables by default when this scenario occurs. Tests all pass, and in MoE workloads, I see a 9.67% performance increase in prompt on my dual 3090 + NVLink setup.

Additional information

Qwen3.6 dual-3090 tensor-split prompt: 2597.04 -> 2848.26 (+9.67%)
Single-GPU sanity set: effectively neutral (Qwen3.6-35B-A3B prompt -0.005%, decode -0.053%, Qwen3.5-35B-A3B prompt +0.043%, Gemma4-31B prompt +0.073%)
test-llama-archs passed
test-backend-ops -b CUDA0 -o MUL_MAT,MUL_MAT_ID passed (1858/1858)

What it does

ggml-cuda.cu / common.cuh store that flag in the CUDA backend context.
mmq.cu uses that flag when deciding the MMQ use_stream_k default.
mmq.cuh adds a compile-time use_stream_k=true/false specialization so the standalone non-stream-k path is stable and efficient.

Why we need it

Stream-k is a good generic MMQ default, but for multi-GPU tensor-split MoE prompt work it is the wrong default.

Performance

GPU	Model	Microbatch size	Test	t/s `4eac5b4`	t/s `42064fb`	Speedup
2x RTX 3090	qwen35moe 35B.A3B IQ4_NL	512	pp2048 (tensor)	2812.48	3117.56	1.11
2x RTX 3090	qwen35moe 35B.A3B IQ4_NL	512	pp2048 (row)	1949.66	2126.91	1.09
2x RTX 3090	qwen35moe 35B.A3B IQ4_NL	512	pp2048 (layer)	4362.49	5310.48	1.22
RTX 3090	qwen35moe 35B.A3B IQ4_NL	512	pp2048 (none)	3200.05	3668.18	1.15
RTX 3090	qwen35moe 35B.A3B IQ4_NL	512	tg128 (none)	136.14	165.75	1.22

Requirements

I have read and agree with the contributing guidelines
Yes
AI usage disclosure:
Yes. Per my other PR Optimize CUDA matmul for empty shards and MMQ dispatch #22170, I have been working with AI to profile and fix bottlenecks in workloads I come across. This was one of those bottlenecks. Will have more standalone improvements after, but per the recommendation in that PR, breaking this up into smaller, standalone chunks.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

nisparks · 2026-04-20T15:47:42Z

@JohannesGaessler, here is a smaller change.

JohannesGaessler

The decision on whether or not to use stream-k should be done using only tensor properties such as the shape and explicitly not depend on --split-mode.

nisparks · 2026-04-20T15:54:16Z

Let me benchmark the other options now just to make sure those paths don't regress if I remove that gate.

JohannesGaessler · 2026-04-20T16:23:02Z

A few days ago I started working on reducing the stream-k overhead in MMQ. I don't have Qwen 3.6 or 2x RTX 3090 handy but on a single RTX 3090 I get this for Qwen 3.5:

GPU	Model	Microbatch size	Test	t/s `a6cc43c`	t/s `a176086`	Speedup
RTX 3090	qwen35moe 35B.A3B Q4_0	16	pp2048	650.60	745.07	1.15
RTX 3090	qwen35moe 35B.A3B Q4_0	32	pp2048	929.85	1042.95	1.12
RTX 3090	qwen35moe 35B.A3B Q4_0	64	pp2048	1266.45	1360.01	1.07
RTX 3090	qwen35moe 35B.A3B Q4_0	128	pp2048	1449.18	1509.50	1.04
RTX 3090	qwen35moe 35B.A3B Q4_0	256	pp2048	2158.26	2241.32	1.04
RTX 3090	qwen35moe 35B.A3B Q4_0	512	pp2048	2958.78	3066.14	1.04
RTX 3090	qwen35moe 35B.A3B Q4_0	1024	pp2048	3667.64	3825.24	1.04
RTX 3090	qwen35moe 35B.A3B Q4_0	2048	pp2048	4148.81	4347.48	1.05

Please check the performance of a176086 vs. your changes, I created the table above using these commands (adapt to your environment):

./bench --model models/opt/${mn}-${q}.gguf -r 10 -fa 1 -n 0 -p 2048 -ub "16-2048*2" -o sql|sqlite3 llama-bench.sqlite 
py scripts/compare-llama-bench.py -s gpu_info,model_type,n_ubatch -i llama-bench.sqlite -b a6cc43c286a2ebc42 -c a1760869a8c49c5e5|tee bench.txt

nisparks · 2026-04-20T17:27:41Z

@JohannesGaessler, So I see similar results, and this slices into some of the gains I saw with dual GPU, but I still see gains on mine when compared.

When I ran the benchmarks for an ungated (no tensor split mode check), row had a 0.1% regression, and layer had a 0.28% regression on prompt, averaged over 6 runs.

Going to see if adding your changes fix the regression I saw. I know it generally improves things, but I didn't profile the losses to row/layer yet.

nisparks · 2026-04-20T17:31:33Z

Also that commit seems to crash when I use split-mode row.

nisparks · 2026-04-20T18:40:53Z

Looks like I don't need to gate on tensor split mode. Will be removing that now.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

nisparks · 2026-04-20T19:15:11Z

Pushed update. Actually see large improvements on layer and row split, as well as none. Running the ./bench commands now for an A/B comparison.

nisparks · 2026-04-20T19:21:09Z

Initial results with Qwen3.6 IQ4 on Dual 3090s, still waiting for the other results. This is the delta between the first and second commit.

Case	Tensor-only PR2a	All-split PR2a	Delta
dual tensor	3114.66	3125.14	+0.34%
dual row	1949.21	2129.35	+9.24%
dual layer	4361.89	5315.90	+21.87%
single prompt	3197.55	3670.41	+14.79%
single decode	134.87	145.64	+7.98%

nisparks · 2026-04-20T19:34:06Z

GPU	Model	Microbatch size	Test	t/s `4eac5b4`	t/s `42064fb`	Speedup
2x RTX 3090	qwen35moe 35B.A3B IQ4_NL	512	pp2048 (tensor)	2812.48	3117.56	1.11
2x RTX 3090	qwen35moe 35B.A3B IQ4_NL	512	pp2048 (row)	1949.66	2126.91	1.09
2x RTX 3090	qwen35moe 35B.A3B IQ4_NL	512	pp2048 (layer)	4362.49	5310.48	1.22
RTX 3090	qwen35moe 35B.A3B IQ4_NL	512	pp2048 (none)	3200.05	3668.18	1.15
RTX 3090	qwen35moe 35B.A3B IQ4_NL	512	tg128 (none)	136.14	165.75	1.22

cuda: disable MMQ stream-k by default for tensor-split MoE

0c3f79b

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

nisparks requested review from a team and ggerganov as code owners April 20, 2026 15:46

nisparks mentioned this pull request Apr 20, 2026

Optimize CUDA matmul for empty shards and MMQ dispatch #22170

Closed

JohannesGaessler reviewed Apr 20, 2026

View reviewed changes

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 20, 2026

fixup! cuda: disable MMQ stream-k by default for tensor-split MoE

42064fb

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

nisparks changed the title ~~cuda: disable MMQ stream-k by default for tensor-split MoE~~ cuda: disable MMQ stream-k by default for MoE Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda: disable MMQ stream-k by default for MoE#22174

cuda: disable MMQ stream-k by default for MoE#22174
nisparks wants to merge 2 commits intoggml-org:masterfrom
nisparks:pr2a-moe-streamk-gate

nisparks commented Apr 20, 2026 •

edited

Loading

Uh oh!

nisparks commented Apr 20, 2026

Uh oh!

JohannesGaessler left a comment

Uh oh!

nisparks commented Apr 20, 2026

Uh oh!

JohannesGaessler commented Apr 20, 2026

Uh oh!

nisparks commented Apr 20, 2026

Uh oh!

nisparks commented Apr 20, 2026

Uh oh!

nisparks commented Apr 20, 2026

Uh oh!

nisparks commented Apr 20, 2026 •

edited

Loading

Uh oh!

nisparks commented Apr 20, 2026

Uh oh!

nisparks commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nisparks commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

nisparks commented Apr 20, 2026

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

nisparks commented Apr 20, 2026

Uh oh!

JohannesGaessler commented Apr 20, 2026

Uh oh!

nisparks commented Apr 20, 2026

Uh oh!

nisparks commented Apr 20, 2026

Uh oh!

nisparks commented Apr 20, 2026

Uh oh!

nisparks commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nisparks commented Apr 20, 2026

Uh oh!

nisparks commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nisparks commented Apr 20, 2026 •

edited

Loading

nisparks commented Apr 20, 2026 •

edited

Loading