Skip to content

fix(gemm_a8w8_bpreshuffle): pass splitK/KBatch to CK kernels#2335

Open
AviralGoelAMD wants to merge 3 commits intomainfrom
aviralgoel/fix-splitk-passthrough-bpreshuffle
Open

fix(gemm_a8w8_bpreshuffle): pass splitK/KBatch to CK kernels#2335
AviralGoelAMD wants to merge 3 commits intomainfrom
aviralgoel/fix-splitk-passthrough-bpreshuffle

Conversation

@AviralGoelAMD
Copy link

@AviralGoelAMD AviralGoelAMD commented Mar 18, 2026

Summary

  • The splitK parameter was accepted by the tune entry points and KBatch = 2^splitK was computed, but never forwarded to the CK/CKTile kernels — it was hardcoded to 1 in both paths
  • Thread KBatch through the full dispatch chain: tune.cu → dispatch → generated kernel wrapper → impl → CK MakeArgument / CKTile args.k_batch
  • Inference (non-tune) paths are unaffected via int KBatch = 1 default parameter

@AviralGoelAMD AviralGoelAMD requested review from a team and Copilot March 18, 2026 17:56
@github-actions
Copy link
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2335 --add-label <label>

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a KBatch (split-K batch) parameter plumbing into the FP8 a8w8 bpreshuffle GEMM implementations for both CK and CKTile backends, enabling the tuning path to run kernels with split-K.

Changes:

  • Add KBatch parameter to CK/CKTile GEMM implementation entrypoints and pass it into kernel args.
  • Update code generation (gen_instances.py) so generated kernel wrappers accept KBatch and forward it to the impl.
  • Update tune dispatchers to pass the computed KBatch into the selected kernel.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
csrc/cktile_gemm_a8w8_bpreshuffle/include/gemm_a8w8_bpreshuffle_cktile_common.cuh Add KBatch argument and plumb into args.k_batch.
csrc/cktile_gemm_a8w8_bpreshuffle/gen_instances.py Generate kernel wrappers/manifest declarations with KBatch.
csrc/cktile_gemm_a8w8_bpreshuffle/gemm_a8w8_bpreshuffle_cktile_tune.cu Update dispatch function types/calls to pass KBatch.
csrc/ck_gemm_a8w8_bpreshuffle/include/gemm_a8w8_bpreshuffle_common.cuh Add KBatch argument and plumb into CK MakeArgument.
csrc/ck_gemm_a8w8_bpreshuffle/gen_instances.py Generate kernel wrappers/manifest declarations with KBatch.
csrc/ck_gemm_a8w8_bpreshuffle/gemm_a8w8_bpreshuffle_tune.cu Update dispatch function types/calls to pass KBatch.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@AviralGoelAMD
Copy link
Author

I;ve resolved all the copilot comments.

@yzhou103
Copy link
Contributor

I think the non-tune path should keep same with tuned path. If a splitK >1 pattern is the best config, the splitK parameter should also be passed to ck kernels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants