Skip to content

Enable tcgen05 blockscaled ops on Thor SM110#3283

Open
xiangg-nv wants to merge 1 commit into
NVIDIA:mainfrom
xiangg-nv:feature/cutedsl-thor-sm110
Open

Enable tcgen05 blockscaled ops on Thor SM110#3283
xiangg-nv wants to merge 1 commit into
NVIDIA:mainfrom
xiangg-nv:feature/cutedsl-thor-sm110

Conversation

@xiangg-nv
Copy link
Copy Markdown

Edge-LLM NvFP4 MoE CuTeDSL kernels on Thor use tcgen05 blockscaled MMA and SMEM-to-TMEM scale-factor copies. (Currently use the patch to WR)

The existing checks only admitted the SM100/SM103 paths, so source-built CuTeDSL rejected SM110.

Admit Thor's blockscaled MMA arch aliases sm_101a and sm_110a, and allow the SM110f family for S2T tcgen05 copy ops.

Validation:

  • git diff --check

  • python3 -m py_compile python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/mma.py python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/copy.py

  • DKG grouped_blockscaled_gemm.py documented 4-group example on Thor SM110: PASS

  • Edge-LLM nvfp4_moe AOT for sm_110/aarch64: 12/12 variants PASS

Edge-LLM NvFP4 MoE CuTeDSL kernels on Thor use tcgen05 blockscaled MMA and SMEM-to-TMEM scale-factor copies. The existing checks only admitted the SM100/SM103 paths, so source-built CuTeDSL rejected SM110.

Admit Thor's blockscaled MMA arch aliases sm_101a and sm_110a, and allow the SM110f family for S2T tcgen05 copy ops.

Validation:

- git diff --check

- python3 -m py_compile python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/mma.py python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/copy.py

- DKG grouped_blockscaled_gemm.py documented 4-group example on Thor SM110: PASS

- Edge-LLM nvfp4_moe AOT for sm_110/aarch64: 12/12 variants PASS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant