Skip to content

Optimize CUDA matmul for empty shards and MMQ dispatch#22170

Closed
nisparks wants to merge 2 commits intoggml-org:masterfrom
nisparks:pr3-mmq-cpasync
Closed

Optimize CUDA matmul for empty shards and MMQ dispatch#22170
nisparks wants to merge 2 commits intoggml-org:masterfrom
nisparks:pr3-mmq-cpasync

Conversation

@nisparks
Copy link
Copy Markdown
Contributor

Overview

Increases P2P prompt by ~30-50% on Dual 3090 w/ NVLink on MoE, net neutral otherwise.

Additional information

  • llama-context.cpp passes disable_mmq_stream_k_default=1 only when split_mode == tensor && n_expert > 0.
  • For ordinary dense / single-GPU cases, the standard stream-k path stays effectively upstream-style.

Why: stream-k splits the K work across SMs and then has to merge partial tiles in a fixup step. That works fine in many normal cases, but in tensor-split MoE the work is already irregular and sparse because rows are routed per expert and split across GPUs. In this workload, the stream-k decomposition/fixup overhead was benchmarking worse than straight xy tiling. The faster combination was:

  • disable stream-k for that specific workload, and
  • use cp.async to make the remaining tile loads cheaper.

Requirements

  • I have read and agree with the contributing guidelines
    yes
  • AI usage disclosure:
    Yes. Been burning the midnight oil (and the unlimited tokens via benefit from my employer) to read CUDA Docs, try various heuristics, profile runs with nsys, and try out many different things. Importantly, I made sure that we didn't experience regressions on the non-P2P or Multi-GPU path.

Important to note, I went through many iterations with AI assistance, reviewed code and decisions, benchmarked locally over and over, and tested to be sure no performance was lost on non-P2P/MoE workloads.

Scenario Model Metric Upstream Full changes Delta Notes
1x3090 Qwen3.5-27B IQ4_NL prompt 1344.37 1345.18 +0.06% effectively neutral
1x3090 Gemma4-31B IQ4_NL prompt 1360.80 1361.33 +0.04% effectively neutral
1x3090 Qwen3.6-35B-A3B IQ4_NL-pure prompt 3254.25 3254.02 -0.01% effectively neutral
1x3090 Qwen3.6-35B-A3B IQ4_NL-pure decode 176.08 176.47 +0.22% effectively neutral
2x3090, tensor split Qwen3.6-35B-A3B IQ4_NL-pure prompt 3021.22 3886.13 +28.63% main win
2x3090, tensor split + P2P Qwen3.6-35B-A3B IQ4_NL-pure prompt 4154.61 6209.02 +49.45% main win with peer access enabled
2x3090, tensor split Qwen3.6-35B-A3B Q6_K prompt 2856.65 3844.98 +34.60% gain holds at higher precision
2x3090, tensor split Qwen3.6-35B-A3B Q8_0 prompt 3062.58 4035.17 +31.76% gain holds at higher precision

nisparks and others added 2 commits April 20, 2026 12:46
Add early guards in the CUDA matmul entry points so empty work does not
fall through into kernel setup. Also skip zero-sized recurrent-state
branches in build_rs().

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Keep the standard stream-k path on the upstream MMQ kernel and reserve
the specialized MMQ path for explicit non-stream-k dispatch and
tensor-split MoE. The backend defaults now only enable the MMQ tuning
on the cases that actually benefit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@nisparks nisparks requested review from a team, CISC and ggerganov as code owners April 20, 2026 13:52
Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but this is not acceptable in terms of code quality. Since the MMQ kernel is both high impact and high maintenance I will not accept PRs that duplicate large parts of the code like this. Also, from non-established contributors I will only accept performance optimizations if they are submitted with one PR per optimization where they can show for each individual optimization that it is impactful.

@nisparks
Copy link
Copy Markdown
Contributor Author

@JohannesGaessler I'll break it down further, but I'd encourage you to see the benefits for yourself.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 20, 2026
@nisparks
Copy link
Copy Markdown
Contributor Author

Closing since I have a follow up PR that will add in smaller chunks: #22174

@nisparks nisparks closed this Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants