Optimize CUDA matmul for empty shards and MMQ dispatch by nisparks · Pull Request #22170 · ggml-org/llama.cpp

nisparks · 2026-04-20T13:52:24Z

Overview

Increases P2P prompt by ~30-50% on Dual 3090 w/ NVLink on MoE, net neutral otherwise.

Additional information

llama-context.cpp passes disable_mmq_stream_k_default=1 only when split_mode == tensor && n_expert > 0.
For ordinary dense / single-GPU cases, the standard stream-k path stays effectively upstream-style.

Why: stream-k splits the K work across SMs and then has to merge partial tiles in a fixup step. That works fine in many normal cases, but in tensor-split MoE the work is already irregular and sparse because rows are routed per expert and split across GPUs. In this workload, the stream-k decomposition/fixup overhead was benchmarking worse than straight xy tiling. The faster combination was:

disable stream-k for that specific workload, and
use cp.async to make the remaining tile loads cheaper.

Requirements

I have read and agree with the contributing guidelines
yes
AI usage disclosure:
Yes. Been burning the midnight oil (and the unlimited tokens via benefit from my employer) to read CUDA Docs, try various heuristics, profile runs with nsys, and try out many different things. Importantly, I made sure that we didn't experience regressions on the non-P2P or Multi-GPU path.

Important to note, I went through many iterations with AI assistance, reviewed code and decisions, benchmarked locally over and over, and tested to be sure no performance was lost on non-P2P/MoE workloads.

Scenario	Model	Metric	Upstream	Full changes	Delta	Notes
1x3090	Qwen3.5-27B IQ4_NL	prompt	1344.37	1345.18	+0.06%	effectively neutral
1x3090	Gemma4-31B IQ4_NL	prompt	1360.80	1361.33	+0.04%	effectively neutral
1x3090	Qwen3.6-35B-A3B IQ4_NL-pure	prompt	3254.25	3254.02	-0.01%	effectively neutral
1x3090	Qwen3.6-35B-A3B IQ4_NL-pure	decode	176.08	176.47	+0.22%	effectively neutral
2x3090, tensor split	Qwen3.6-35B-A3B IQ4_NL-pure	prompt	3021.22	3886.13	+28.63%	main win
2x3090, tensor split + P2P	Qwen3.6-35B-A3B IQ4_NL-pure	prompt	4154.61	6209.02	+49.45%	main win with peer access enabled
2x3090, tensor split	Qwen3.6-35B-A3B Q6_K	prompt	2856.65	3844.98	+34.60%	gain holds at higher precision
2x3090, tensor split	Qwen3.6-35B-A3B Q8_0	prompt	3062.58	4035.17	+31.76%	gain holds at higher precision

Add early guards in the CUDA matmul entry points so empty work does not fall through into kernel setup. Also skip zero-sized recurrent-state branches in build_rs(). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Keep the standard stream-k path on the upstream MMQ kernel and reserve the specialized MMQ path for explicit non-stream-k dispatch and tensor-split MoE. The backend defaults now only enable the MMQ tuning on the cases that actually benefit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

JohannesGaessler

Sorry but this is not acceptable in terms of code quality. Since the MMQ kernel is both high impact and high maintenance I will not accept PRs that duplicate large parts of the code like this. Also, from non-established contributors I will only accept performance optimizations if they are submitted with one PR per optimization where they can show for each individual optimization that it is impactful.

nisparks · 2026-04-20T14:56:27Z

@JohannesGaessler I'll break it down further, but I'd encourage you to see the benefits for yourself.

nisparks · 2026-04-20T15:46:47Z

Closing since I have a follow up PR that will add in smaller chunks: #22174

nisparks and others added 2 commits April 20, 2026 12:46

nisparks requested review from a team, CISC and ggerganov as code owners April 20, 2026 13:52

JohannesGaessler requested changes Apr 20, 2026

View reviewed changes

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 20, 2026

nisparks mentioned this pull request Apr 20, 2026

cuda: disable MMQ stream-k by default for MoE #22174

Open

nisparks closed this Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize CUDA matmul for empty shards and MMQ dispatch#22170

Optimize CUDA matmul for empty shards and MMQ dispatch#22170
nisparks wants to merge 2 commits intoggml-org:masterfrom
nisparks:pr3-mmq-cpasync

nisparks commented Apr 20, 2026

Uh oh!

JohannesGaessler left a comment

Uh oh!

nisparks commented Apr 20, 2026

Uh oh!

nisparks commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nisparks commented Apr 20, 2026

Overview

Additional information

Requirements

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

nisparks commented Apr 20, 2026

Uh oh!

nisparks commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants