Enable tcgen05 blockscaled ops on Thor SM110 by xiangg-nv · Pull Request #3283 · NVIDIA/cutlass

xiangg-nv · 2026-05-28T15:49:33Z

Edge-LLM NvFP4 MoE CuTeDSL kernels on Thor use tcgen05 blockscaled MMA and SMEM-to-TMEM scale-factor copies. (Currently use the patch to WR)

The existing checks only admitted the SM100/SM103 paths, so source-built CuTeDSL rejected SM110.

Admit Thor's blockscaled MMA arch aliases sm_101a and sm_110a, and allow the SM110f family for S2T tcgen05 copy ops.

Validation:

git diff --check
python3 -m py_compile python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/mma.py python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/copy.py
DKG grouped_blockscaled_gemm.py documented 4-group example on Thor SM110: PASS
Edge-LLM nvfp4_moe AOT for sm_110/aarch64: 12/12 variants PASS

Edge-LLM NvFP4 MoE CuTeDSL kernels on Thor use tcgen05 blockscaled MMA and SMEM-to-TMEM scale-factor copies. The existing checks only admitted the SM100/SM103 paths, so source-built CuTeDSL rejected SM110. Admit Thor's blockscaled MMA arch aliases sm_101a and sm_110a, and allow the SM110f family for S2T tcgen05 copy ops. Validation: - git diff --check - python3 -m py_compile python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/mma.py python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/copy.py - DKG grouped_blockscaled_gemm.py documented 4-group example on Thor SM110: PASS - Edge-LLM nvfp4_moe AOT for sm_110/aarch64: 12/12 variants PASS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable tcgen05 blockscaled ops on Thor SM110#3283

Enable tcgen05 blockscaled ops on Thor SM110#3283
xiangg-nv wants to merge 1 commit into
NVIDIA:mainfrom
xiangg-nv:feature/cutedsl-thor-sm110

xiangg-nv commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xiangg-nv commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant