Skip to content

cuda: disable MMQ stream-k by default for MoE#22174

Open
nisparks wants to merge 2 commits intoggml-org:masterfrom
nisparks:pr2a-moe-streamk-gate
Open

cuda: disable MMQ stream-k by default for MoE#22174
nisparks wants to merge 2 commits intoggml-org:masterfrom
nisparks:pr2a-moe-streamk-gate

Conversation

@nisparks
Copy link
Copy Markdown
Contributor

@nisparks nisparks commented Apr 20, 2026

Overview

As the title says, I found that MMQ stream-k was inefficient with P2P + MoE. This disables by default when this scenario occurs. Tests all pass, and in MoE workloads, I see a 9.67% performance increase in prompt on my dual 3090 + NVLink setup.

Additional information

  • Qwen3.6 dual-3090 tensor-split prompt: 2597.04 -> 2848.26 (+9.67%)
  • Single-GPU sanity set: effectively neutral (Qwen3.6-35B-A3B prompt -0.005%, decode -0.053%, Qwen3.5-35B-A3B prompt +0.043%, Gemma4-31B prompt +0.073%)
  • test-llama-archs passed
  • test-backend-ops -b CUDA0 -o MUL_MAT,MUL_MAT_ID passed (1858/1858)

What it does

  1. ggml-cuda.cu / common.cuh store that flag in the CUDA backend context.
  2. mmq.cu uses that flag when deciding the MMQ use_stream_k default.
  3. mmq.cuh adds a compile-time use_stream_k=true/false specialization so the standalone non-stream-k path is stable and efficient.

Why we need it

  • Stream-k is a good generic MMQ default, but for multi-GPU tensor-split MoE prompt work it is the wrong default.

Performance

GPU Model Microbatch size Test t/s 4eac5b4 t/s 42064fb Speedup
2x RTX 3090 qwen35moe 35B.A3B IQ4_NL 512 pp2048 (tensor) 2812.48 3117.56 1.11
2x RTX 3090 qwen35moe 35B.A3B IQ4_NL 512 pp2048 (row) 1949.66 2126.91 1.09
2x RTX 3090 qwen35moe 35B.A3B IQ4_NL 512 pp2048 (layer) 4362.49 5310.48 1.22
RTX 3090 qwen35moe 35B.A3B IQ4_NL 512 pp2048 (none) 3200.05 3668.18 1.15
RTX 3090 qwen35moe 35B.A3B IQ4_NL 512 tg128 (none) 136.14 165.75 1.22

Requirements

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@nisparks
Copy link
Copy Markdown
Contributor Author

@JohannesGaessler, here is a smaller change.

Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decision on whether or not to use stream-k should be done using only tensor properties such as the shape and explicitly not depend on --split-mode.

@nisparks
Copy link
Copy Markdown
Contributor Author

Let me benchmark the other options now just to make sure those paths don't regress if I remove that gate.

@JohannesGaessler
Copy link
Copy Markdown
Contributor

A few days ago I started working on reducing the stream-k overhead in MMQ. I don't have Qwen 3.6 or 2x RTX 3090 handy but on a single RTX 3090 I get this for Qwen 3.5:

GPU Model Microbatch size Test t/s a6cc43c t/s a176086 Speedup
RTX 3090 qwen35moe 35B.A3B Q4_0 16 pp2048 650.60 745.07 1.15
RTX 3090 qwen35moe 35B.A3B Q4_0 32 pp2048 929.85 1042.95 1.12
RTX 3090 qwen35moe 35B.A3B Q4_0 64 pp2048 1266.45 1360.01 1.07
RTX 3090 qwen35moe 35B.A3B Q4_0 128 pp2048 1449.18 1509.50 1.04
RTX 3090 qwen35moe 35B.A3B Q4_0 256 pp2048 2158.26 2241.32 1.04
RTX 3090 qwen35moe 35B.A3B Q4_0 512 pp2048 2958.78 3066.14 1.04
RTX 3090 qwen35moe 35B.A3B Q4_0 1024 pp2048 3667.64 3825.24 1.04
RTX 3090 qwen35moe 35B.A3B Q4_0 2048 pp2048 4148.81 4347.48 1.05

Please check the performance of a176086 vs. your changes, I created the table above using these commands (adapt to your environment):

./bench --model models/opt/${mn}-${q}.gguf -r 10 -fa 1 -n 0 -p 2048 -ub "16-2048*2" -o sql|sqlite3 llama-bench.sqlite 
py scripts/compare-llama-bench.py -s gpu_info,model_type,n_ubatch -i llama-bench.sqlite -b a6cc43c286a2ebc42 -c a1760869a8c49c5e5|tee bench.txt

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 20, 2026
@nisparks
Copy link
Copy Markdown
Contributor Author

@JohannesGaessler, So I see similar results, and this slices into some of the gains I saw with dual GPU, but I still see gains on mine when compared.
image

When I ran the benchmarks for an ungated (no tensor split mode check), row had a 0.1% regression, and layer had a 0.28% regression on prompt, averaged over 6 runs.

Going to see if adding your changes fix the regression I saw. I know it generally improves things, but I didn't profile the losses to row/layer yet.

@nisparks
Copy link
Copy Markdown
Contributor Author

Also that commit seems to crash when I use split-mode row.

@nisparks
Copy link
Copy Markdown
Contributor Author

Looks like I don't need to gate on tensor split mode. Will be removing that now.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@nisparks
Copy link
Copy Markdown
Contributor Author

nisparks commented Apr 20, 2026

Pushed update. Actually see large improvements on layer and row split, as well as none. Running the ./bench commands now for an A/B comparison.

@nisparks
Copy link
Copy Markdown
Contributor Author

Initial results with Qwen3.6 IQ4 on Dual 3090s, still waiting for the other results. This is the delta between the first and second commit.

Case Tensor-only PR2a All-split PR2a Delta
dual tensor 3114.66 3125.14 +0.34%
dual row 1949.21 2129.35 +9.24%
dual layer 4361.89 5315.90 +21.87%
single prompt 3197.55 3670.41 +14.79%
single decode 134.87 145.64 +7.98%

@nisparks
Copy link
Copy Markdown
Contributor Author

GPU Model Microbatch size Test t/s 4eac5b4 t/s 42064fb Speedup
2x RTX 3090 qwen35moe 35B.A3B IQ4_NL 512 pp2048 (tensor) 2812.48 3117.56 1.11
2x RTX 3090 qwen35moe 35B.A3B IQ4_NL 512 pp2048 (row) 1949.66 2126.91 1.09
2x RTX 3090 qwen35moe 35B.A3B IQ4_NL 512 pp2048 (layer) 4362.49 5310.48 1.22
RTX 3090 qwen35moe 35B.A3B IQ4_NL 512 pp2048 (none) 3200.05 3668.18 1.15
RTX 3090 qwen35moe 35B.A3B IQ4_NL 512 tg128 (none) 136.14 165.75 1.22

@nisparks nisparks changed the title cuda: disable MMQ stream-k by default for tensor-split MoE cuda: disable MMQ stream-k by default for MoE Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants