[Feat] Add CUTLASS matmul-epilogue fusion path by wtr0504 · Pull Request #30 · SandAI-org/MagiCompiler

wtr0504 · 2026-04-28T12:22:04Z

🗂️ PR Category

📝 Description

Add a CUTLASS-based matmul + epilogue fusion pass that fuses aten.mm followed by elementwise chains (activations, scalar ops, bias-add, residual loads) into a single GPU kernel, eliminating intermediate global memory round-trips.

Two fusion backends

Generic EVT (Epilogue Visitor Tree) — builds an IR tree from the FX epilogue chain and JIT-compiles a CUTLASS kernel templated on that tree. Supports unary activations (SiLU, Sigmoid, GeLU, ReLU, etc.), scalar arithmetic, 1-D bias (RowBroadcast), column scaling (ColBroadcast), and full (M, N) auxiliary loads. Codegen renders to CUTLASS 3.x Sm90EVT on H100 (sm_90) and CUTLASS 2.x Sm80EVT on RTX 5090 / Blackwell consumer (sm_120).
SwiGLU DualGemm — pattern-matches the canonical SwiGLU recipe (slice-stride-2 → dual clamp → scaled SiLU → multiply) and dispatches to a vendored CUTLASS DualGemm kernel that runs both GEMMs (A @ W_gate.T and A @ W_linear.T) in the same threadblock, sharing A's SMEM stages. Writes (M, N/2) directly. Routes to SM80 cp.async multistage on sm_120 and to SM90 TMA + WGMMA warp-specialized path on sm_90. Static constants (alpha, limit, one) are captured dynamically from the FX graph.

Key design points

Per-node compute_dtype: the IR walker tracks _to_copy / convert_element_type nodes and stamps each Compute node with the precision active at that point (float32, bfloat16, or float16). Codegen emits per-node element types in VisitorCompute / Sm90Compute.
Greedy alignment: tries 128-bit loads first, falls back to 64-bit when K or N only meets 8-byte alignment. The runtime pads D's row stride to 16-byte boundaries for TMA compatibility.
Autotune: both DualGemm paths register multiple (TileShape, Stages) candidates and time them at first call per shape bucket; the winner is cached for all subsequent calls.
Build robustness: _track_build / _untrack_build + atexit + SIGTERM/SIGINT/SIGHUP handlers ensure interrupted cpp_extension.load calls don't leave stale lock files. Warm-cache fast path (_try_dlopen_prebuilt) skips cpp_extension.load entirely when the .so already exists, preventing multi-rank FileBaton hangs.
SM90 multi-AuxLoad: multiple (M, N) auxiliary tensors are loaded via inline ld.global (Sm90AuxLoad<0>) — no SMEM staging needed — enabling patterns like mm + R1 + R2 + R3.

Files added / changed

Area	Files	What
IR	`evt_ir.py`	Dataclass IR (Accum, Compute, Store, RowBroadcast, ColBroadcast, AuxLoad) + canonical JSON serialization
FX pass	`matmul_epilogue_fusion.py`	Graph walker, epilogue chain absorption, SwiGLU pattern matching, B-layout classification
Runtime	`evt_runtime.py`	`torch.library` op, JIT compile cache, DualGemm loader, dispatch fast-cache, build cleanup
SM80 codegen	`sm80/evt_codegen.py`	CUTLASS 2.x Sm80EVT `.cu` renderer
SM90 codegen	`sm90/evt_codegen.py`	CUTLASS 3.x Sm90EVT `.cu` renderer with Sm90Compute + Sm90AuxLoad
SM80 DualGemm	`sm80/cutlass_kernels/swiglu_one_stage.cu`	Vendored DualGemm + SwigluCombine epilogue + autotune runner
SM90 DualGemm	`sm90/cutlass_kernels/swiglu_one_stage.cu`	TMA + WGMMA DualGemm + SwigluCombine + autotune runner
SM90 device wrapper	`sm90/cutlass_kernels/hopper_dual_gemm/`	Vendored Sm90DualGemm with LayoutTraits for 2.x→3.x layout translation
Shared	`common/cutlass_kernels/swiglu_combine.h`, `common/codegen_shared.py`	SwigluCombine functor, shared codegen utilities
Config	`config.py`	`enable_mm_epilogue_fusion` flag in PassConfig
Infra	`Dockerfile`, `.pre-commit-config.yaml`	CUTLASS install, copyright hook
Tests	`test_matmul_epilogue_fusion.py`	30+ tests: positive (activations, scalar ops, bias, AuxLoad, SwiGLU), negative (escape, misalign, bare mm), out_dtype matrix, compute_dtype, D-stride padding, SM90 parity
Tests	`test_build_cleanup.py`	5 tests for the `_track_build`/`_untrack_build` + signal-handler cleanup mechanism

fix ci

jiahy0825 mentioned this pull request May 8, 2026

MagiCompiler v1.1.0 Release RoadMap (2026 Q2) #31

Open

23 tasks

jiahy0825 reviewed May 8, 2026

View reviewed changes

jiahy0825 linked an issue May 8, 2026 that may be closed by this pull request

MagiCompiler v1.1.0 Release RoadMap (2026 Q2) #31

Open

23 tasks

jiahy0825 removed a link to an issue May 8, 2026

MagiCompiler v1.1.0 Release RoadMap (2026 Q2) #31

Open

23 tasks

jiahy0825 reviewed May 8, 2026

View reviewed changes

Comment thread magi_compiler/passes/piecewise_graph/fusion/blackwell_geforce/evt_runtime.py Outdated

Comment thread magi_compiler/passes/piecewise_graph/fusion/matmul_epilogue_fusion.py Outdated

jiahy0825 added this to the MagiCompiler v1.1.0 (2026 Q2) milestone May 8, 2026

wtr0504 force-pushed the feat/matmul_epilogue branch from 548ae66 to 7a3bed3 Compare May 19, 2026 06:57

wtr0504 added 23 commits May 28, 2026 11:08

add triton matmul fusion

cdf5979

add cute kernel

20054d1

[Feat] Add CUTLASS matmul-epilogue fusion path for sm_120

292e5cd

add cutlass install in Dockerfile & update

ce3f7b4

add enable_mm_epilogue_fusion & chore

4474bbd

chore

bd5a2e6

update .github/codestyle/copyright.hook

2239be7

Fix: unify Alignment and padding D Tensor

68ecbee

add more flexible align for matrix

efd5193

refactor & add sm90 c++ code

0868864

add sm90 multi-extra

0a3d89f

refactor & handle type conversion in epilogue

0e77026

fix ci

fix static param handling in swiglu

75004b4

refactor & fix sm90 ldd & chore

7a0b3b5

Improve cleanup handling for interrupted C++ compilation

0b8082c

chore & add ci test

4ea07f6

Update Dockerfile

6535f96

Update README.md

0f65438

fix matmul epilogue fusion correctness

3242d8d

change cutlass root path

16a1679

chore

0ddef80

chore

ded0b34

rm some tests

468e936

wtr0504 added 4 commits May 28, 2026 11:08

rm some tests

3e65cbd

chore

97ddf50

rm some tests

8b84c6f

rm some tests

d546b70

wtr0504 force-pushed the feat/matmul_epilogue branch from b92c6e7 to d546b70 Compare May 28, 2026 07:31

chore

3587de6

wtr0504 changed the title ~~[Feat] Add CUTLASS matmul-epilogue fusion path for sm_120~~ [Feat] Add CUTLASS matmul-epilogue fusion path May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Add CUTLASS matmul-epilogue fusion path#30

[Feat] Add CUTLASS matmul-epilogue fusion path#30
wtr0504 wants to merge 28 commits into
SandAI-org:mainfrom
wtr0504:feat/matmul_epilogue

wtr0504 commented Apr 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wtr0504 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🗂️ PR Category

📝 Description

Two fusion backends

Key design points

Files added / changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wtr0504 commented Apr 28, 2026 •

edited

Loading