Skip to content

feat: add INT8/INT4 quantization support for 2-stage ASM MoE kernels#2340

Open
amd-zfyu wants to merge 23 commits intoROCm:mainfrom
amd-zfyu:asm_moe_2stages_int8_v2
Open

feat: add INT8/INT4 quantization support for 2-stage ASM MoE kernels#2340
amd-zfyu wants to merge 23 commits intoROCm:mainfrom
amd-zfyu:asm_moe_2stages_int8_v2

Conversation

@amd-zfyu
Copy link
Contributor

Summary

Add INT8 per-token and INT4 (LQQ) quantization support for the 2-stage ASM MoE pipeline.

Changes

  • aiter/fused_moe_bf16_asm.py: Add asm_moe_stage2() wrapper, 2-stage ASM MoE pipeline with INT8/INT4 support, CSV-based kernel config lookup via pandas, refactored _run_asm_moe_a16() helper
  • csrc/py_itfs_cu/asm_moe_2stage.cu: Add Kernel2Args struct for stage2 kernels, INT8/INT4 kernel launch paths with splitk support, new fields (total_tgs, ps_deno, ptr_Qscl, ptr_Qzero, eLQQs)
  • csrc/include/moe_op.h: Add moe_stage2_g1u1 declaration
  • csrc/include/rocm_ops.hpp: Add INT8/INT4 MoE bindings
  • aiter/ops/moe_op.py: Register moe_stage2_g1u1 op
  • Pre-compiled .co kernels: Stage1 and stage2 binaries for INT8 per-token and INT4 LQQ quantization (gfx942, various tile sizes: 32x128 to 80x128)
  • op_tests/test_moe_ep.py: Add INT8/FP8 smoothquant EP test cases
  • Smoothquant fix: Use smooth_per_token_scaled_quant for both INT8 and FP8 smoothquant paths in EP mode
  • Backward compat: Fix legacy .co kernel loading in asm_moe_2stage

Test plan

  • CI: gfx942/gfx950 op tests
  • test_moe_ep.py INT8 smoothquant tests
  • test_moe_ep.py FP8 smoothquant tests

amd-zfyu and others added 23 commits March 19, 2026 02:28
…_2stage

Old .co files expect the legacy KernelArgs size (up to ptr_SW + padding)
and gdy = sub_X_cnt. The extended args (total_tgs, ps_deno, ptr_Qscl,
ptr_Qzero, eLQQs) and gdy = sz_stp are only used by multix kernels.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: root <root@hjbog-srdc-24.amd.com>
The 1-stage g1u0 int8 smoothquant path was using moe_smoothquant_fwd with
3D tensor shapes (topk, M, model_dim), causing GPU memory access fault
when passed to fmoe_int8_g1u0 which expects 2D input. Align with main
branch behavior by using smooth_per_token_scaled_quant for all int8
smoothquant cases, with shape adjusted per 1-stage (2D) vs 2-stage (3D).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
In expert parallelism mode, moe_smoothquant_fwd crashes with a GPU
memory access fault because local_expert_hash remaps masked experts
to -1, which causes out-of-bounds indexing into fc1_smooth_scale.
Use smooth_per_token_scaled_quant instead, which safely handles
expert mapping via smooth_scale_map_hash.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@amd-zfyu amd-zfyu requested a review from a team March 19, 2026 02:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant