Add FP8 Support For CK Tile Group GEMM by aris134 · Pull Request #475 · ROCm/TransformerEngine

aris134 · 2026-03-06T14:09:29Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes https://github.com/ROCm/frameworks-internal/issues/15787

TODO:

Add support for other architectures (i.e., MI350X)
Add support for other quantization modes
Performance analysis and tuning

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Enables mixed precision (fp8/bf8 FNUZ variants) support for CK tile grouped GEMM with tensor quantization

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

This reverts commit 86fbbac.

…mTest

Align GemmRowColTensorQuantPipelineProblem with ck_tile V3 requirements by using AccType for intermediate C results. Specific to TensorQuant (per-tensor scaling); limited to e4m3/e5m2 FNUZ formats. Updates test_numerics.py to exercise FP8 inputs in the grouped linear accuracy suite.

… param_types in test_numerics

…acy tests

Enable mixed FP8/BF8 grouped GEMM for the CK backend used by GroupedLinear backward. Certain mixed-type combinations normalize to (AType=bf8_t, BType=fp8_t), but CK currently lacks a corresponding warp GEMM specialization for WarpGemmMfma_f32_32x32x32_bf8_fp8. This prevents the default FP8 tile configuration (K_Warp_Tile=32) from compiling or dispatching correctly. To address this, a fallback tile policy is introduced that routes the (bf8_t, fp8_t) case to a supported kernel configuration using K_Warp_Tile=16. This preserves correct GEMM operand ordering and avoids unsafe operand-swapping workarounds. Notes: - Only tensor quantization mode is currently supported. - Implementation targets MI300X (CDNA3) FP8/BF8 kernels. - Additional kernel coverage may be required for MI350X (CDNA4). With this change, mixed FP8/BF8 backprop paths are supported and all parametrized unit tests in test_grouped_linear_accuracy_cutlass() pass successfully.

… grouped gemm

…e shared runner abstraction This commit restructures the CK grouped GEMM implementation to improve maintainability and better separate datatype-specific logic. Key changes: • Split the original single-source implementation into separate files for FP16 and FP8 grouped GEMM kernels. • Introduced a shared header defining a common abstraction for grouped GEMM runners. The design is similar in spirit to the Primus Turbo dispatch. • Added an abstract parent class that encapsulates the common interface and provides an overloaded operator() / run() entrypoint for launching kernels. Concrete runners implement datatype-specific behavior while sharing the same invocation path. • Introduced a GroupedGemmRunContext structure that carries runtime configuration (layout, splits, pointers, etc.) through the dispatch pipeline. This removes large argument lists and centralizes execution state. • Refactored dispatch code to construct the appropriate runner and invoke it through the unified interface. • Added documentation comments explaining the new structure and the responsibilities of each component (context, runner base class, and datatype-specific implementations). Functional behavior is unchanged. The refactor preserves the previous execution paths and continues to pass all existing Transformer Engine tests that exercised the original implementation.

Introduce extern template declarations and dedicated instantiation translation units for GroupedGemmRunner and QuantGroupedGemmRunner. This moves template instantiation for the supported dtype/layout/tile combinations into separate compilation units to reduce duplicate template instantiation across translation units and better isolate kernel codegen. No functional changes; this is purely a build/compile-time refactor.

Break up the previous monolithic templated header into separate FP16 and FP8 implementation paths. Key changes: - Introduce a lightweight common header (ck_grouped_gemm_common.h) containing shared utilities, the run context, and the RunnerInterface abstraction. - Move heavy template definitions into dtype-specific implementation headers: ck_grouped_gemm_fp16_impl.h and ck_grouped_gemm_fp8_impl.h. - Add FP16/FP8 factory source files responsible for constructing the correct runner instances based on dtype/layout/tile configuration. - Keep dispatch entry points thin and dependent only on the lightweight header. This isolates the heavy CK template code to a smaller number of translation units and prevents unnecessary template parsing across the codebase, improving build scalability without changing runtime behavior.

Refactor the FP8 and FP16 explicit template instantiations into smaller, more manageable translation units. Key changes: - Move explicit instantiations into a new subdirectory: gemm/instantiations/. - Split FP8 and FP16 instantiations across multiple source files organized by operand data type combinations. - Reduce the size of individual translation units to improve build parallelism and avoid long single-TU compile bottlenecks. No functional changes; this is a build-structure refactor only.

…950) Introduce runtime GPU architecture detection and split CK FP8 grouped GEMM factories into arch-specific implementations for gfx942 and gfx950. Key changes: - Added GPUArch detection helper using hipGetDeviceProperties. - Introduced common factory dispatcher that selects the correct arch-specific runner factory at runtime. - Split FP8 runner factories into gfx942 and gfx950 implementations. - Added arch-specific kernel instantiation translation units for each arch. - Updated build to compile both arch implementations into the same library while selecting the correct one at runtime. This enables a single TransformerEngine build to support both MI300 (gfx942) and MI350 (gfx950) while avoiding invalid tile configurations during kernel instantiation.

…Also tweaked tile configs slightly, and gave them clearer names for gfx942

matthiasdiener added 30 commits December 9, 2025 17:01

GEMM reference HIP implementation

ad748da

blockwise amax

11e090b

Merge branch 'dev' into compute-ref-offload

9006224

Change to use Tensor arguments, combine mxfp8/non-mxfp8 paths

3ecea7f

Merge remote-tracking branch 'origin/dev' into compute-ref-offload

cafee59

skip on SwizzleScale limitation on gfx950

86fbbac

Revert "skip on SwizzleScale limitation on gfx950"

54de3db

This reverts commit 86fbbac.

MXFP8 fix

311ddfe

Merge remote-tracking branch 'origin/dev' into compute-ref-offload

306e432

correct scale_inv packing and exp2(biased−127) conversion

445e64f

cleanups

462945f

Merge branch 'dev' into compute-ref-offload

e32fb3d

Merge remote-tracking branch 'origin/dev' into compute-ref-offload

7bf8adb

use Tensor class for more device objects

e11e400

Pass D Tensor into run_reference and move RefD allocation into Perfor…

325ece6

…mTest

[WIP] proof-of-concept: grouped GEMM with ck_tile

fc64b8c

Merge branch 'dev' into ck-grouped-gemm

134b350

restructure and enable tests

9091e6c

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

7435062

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

a00a1c8

grid improvements

4e9ead9

restructure

259645c

reduce code duplication & simplify

9986bd4

make the code more similar to nv, check emopty gelu/bias

355ec2f

Merge branch 'dev' into ck-grouped-gemm

df5e3ea

further simplify & make closer to nv

a42f7ca

add ck_tile reference

fac7c11

rename in error messages

71b97e0

allow flattened higher-D tensors

dd3ed2f

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

7b0413e

matthiasdiener and others added 18 commits February 23, 2026 13:10

simplify calls in dispatch_grouped

6d85088

remove is_mi3*0_class

7910038

disable unused constants

e8ebb0e

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

deb7474

add another fallback

e866bc6

implement Primus-Turbo selection logic, persistent descs

ee438fb

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

b65dbfa

tighten tolerances

0cbf1cd

use namespace, various cleanups

98e0c66

avoid creating vector with Tensors

36bd68e

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

070c58d

merge dispatch_grouped into ck_tile_grouped_gemm

c5d83a4

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

56afb04

same tolerances for gfx950

26dfbb6

Include Float8 E4M3/E5M2 in is_supported_dtype and remove float8 from…

54da682

… param_types in test_numerics

forward pass ck_tile with matching FP8 data type inputs passing accur…

78a702f

…acy tests

aris134 self-assigned this Mar 6, 2026

aris134 and others added 6 commits March 6, 2026 15:31

include more descriptive comment regarding tensor normalization in ck…

6b24be2

… grouped gemm

aris134 force-pushed the amartin/ck-grouped-gemm-fp8 branch from 9f308a9 to e9cd6b8 Compare March 11, 2026 15:37

Merge dev and fix FP8 logic/CMake

32f2ac3

aris134 force-pushed the amartin/ck-grouped-gemm-fp8 branch from c834302 to 32f2ac3 Compare March 11, 2026 15:51

aris134 added 2 commits March 11, 2026 16:32

Fix dev merge conflicts in CMakeLists.txt

5275ac6

add copyright headers and add blank line at bottom of each new file. …

3940748

…Also tweaked tile configs slightly, and gave them clearer names for gfx942

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FP8 Support For CK Tile Group GEMM#475

Add FP8 Support For CK Tile Group GEMM#475
aris134 wants to merge 70 commits intodevfrom
amartin/ck-grouped-gemm-fp8

aris134 commented Mar 6, 2026 •

edited by matthiasdiener

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aris134 commented Mar 6, 2026 • edited by matthiasdiener Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aris134 commented Mar 6, 2026 •

edited by matthiasdiener

Loading