feat: add INT8/INT4 quantization support for 2-stage ASM MoE kernels by amd-zfyu · Pull Request #2340 · ROCm/aiter

amd-zfyu · 2026-03-19T02:33:48Z

Summary

Add INT8 per-token and INT4 (LQQ) quantization support for the 2-stage ASM MoE pipeline.

Changes

aiter/fused_moe_bf16_asm.py: Add asm_moe_stage2() wrapper, 2-stage ASM MoE pipeline with INT8/INT4 support, CSV-based kernel config lookup via pandas, refactored _run_asm_moe_a16() helper
csrc/py_itfs_cu/asm_moe_2stage.cu: Add Kernel2Args struct for stage2 kernels, INT8/INT4 kernel launch paths with splitk support, new fields (total_tgs, ps_deno, ptr_Qscl, ptr_Qzero, eLQQs)
csrc/include/moe_op.h: Add moe_stage2_g1u1 declaration
csrc/include/rocm_ops.hpp: Add INT8/INT4 MoE bindings
aiter/ops/moe_op.py: Register moe_stage2_g1u1 op
Pre-compiled .co kernels: Stage1 and stage2 binaries for INT8 per-token and INT4 LQQ quantization (gfx942, various tile sizes: 32x128 to 80x128)
op_tests/test_moe_ep.py: Add INT8/FP8 smoothquant EP test cases
Smoothquant fix: Use smooth_per_token_scaled_quant for both INT8 and FP8 smoothquant paths in EP mode
Backward compat: Fix legacy .co kernel loading in asm_moe_2stage

Test plan

CI: gfx942/gfx950 op tests
test_moe_ep.py INT8 smoothquant tests
test_moe_ep.py FP8 smoothquant tests

…_2stage Old .co files expect the legacy KernelArgs size (up to ptr_SW + padding) and gdy = sub_X_cnt. The extended args (total_tgs, ps_deno, ptr_Qscl, ptr_Qzero, eLQQs) and gdy = sz_stp are only used by multix kernels. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: root <root@hjbog-srdc-24.amd.com>

The 1-stage g1u0 int8 smoothquant path was using moe_smoothquant_fwd with 3D tensor shapes (topk, M, model_dim), causing GPU memory access fault when passed to fmoe_int8_g1u0 which expects 2D input. Align with main branch behavior by using smooth_per_token_scaled_quant for all int8 smoothquant cases, with shape adjusted per 1-stage (2D) vs 2-stage (3D). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

In expert parallelism mode, moe_smoothquant_fwd crashes with a GPU memory access fault because local_expert_hash remaps masked experts to -1, which causes out-of-bounds indexing into fc1_smooth_scale. Use smooth_per_token_scaled_quant instead, which safely handles expert mapping via smooth_scale_map_hash. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

amd-zfyu and others added 23 commits March 19, 2026 02:28

build pass

c7a1e92

fix1

ef678d9

fix2

e6bc662

fix3

f4b3722

fix4

248bb7e

fix 5

27b30e9

format 1

f0d9d4e

format 2

039ab3b

format 3

044098d

add int4

fd05762

int4 bugs fix 1

748af64

fix2

5f038b3

update co and merge stage1 out bf16 pip

917ecc2

add stage2 co

dd3e9ef

format

913e4e0

fromat 2

8012ab5

bug fix

cc1ff7a

bug fix ntile128

39ad499

bug fix for g1u0

ea6319e

bug fix

4ce9128

amd-zfyu requested a review from a team March 19, 2026 02:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add INT8/INT4 quantization support for 2-stage ASM MoE kernels#2340

feat: add INT8/INT4 quantization support for 2-stage ASM MoE kernels#2340
amd-zfyu wants to merge 23 commits intoROCm:mainfrom
amd-zfyu:asm_moe_2stages_int8_v2

amd-zfyu commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

amd-zfyu commented Mar 19, 2026

Summary

Changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant