Skip to content

[ROCm] Enable VLLM triton FP8 moe for gfx1201, tuned for Qwen3-30B-A3B-FP8 tp=2#79

Open
big-yellow-duck wants to merge 5 commits intomainfrom
rdna4-moe
Open

[ROCm] Enable VLLM triton FP8 moe for gfx1201, tuned for Qwen3-30B-A3B-FP8 tp=2#79
big-yellow-duck wants to merge 5 commits intomainfrom
rdna4-moe

Conversation

@big-yellow-duck
Copy link

@big-yellow-duck big-yellow-duck commented Mar 12, 2026

Purpose

gfx12xx cards support FP8, so we enabled FP8 Triton MoE in VLLM and tuned for Qwen/Qwen3-30B-A3B-Instruct-2507-FP8.

Test Plan

benchmark Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 with triton moe tuned on 2 Radeon PRO 9700

VLLM_ROCM_USE_AITER=0 vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 -tp 2 --enable-expert-parallel

Test Results

Benchmark using Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 on 2x Radeon PRO 9700

TTFT (ms)

ISL-OSL Triton MoE Tuned
512-512 982.67
1024-1024 711.35
2048-2048 1250.83
4096-4096 4329.35
8192-1024 11351.34
16384-2048 131106.68
Average 24955.37

TPOT (ms)

ISL-OSL Triton MoE Tuned
512-512 33.18
1024-1024 34.94
2048-2048 36.65
4096-4096 45.05
8192-1024 73.46
16384-2048 70.46
Average 48.95

E2E Latency (ms)

ISL-OSL Triton MoE Tuned
512-512 17936.15
1024-1024 36449.97
2048-2048 76264.82
4096-4096 188823.78
8192-1024 86497.21
16384-2048 275340.95
Average 113552.15

Accuracy checks

GSM8K Accuracy

Metric Triton MoE Tuned
exact_match,strict-match 83.40%
exact_match,flexible-extract 86.43%

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

"""
if current_platform.is_rocm() and IS_AITER_FOUND:
from vllm.platforms.rocm import on_gfx9
from vllm.platforms.rocm import on_gfx9, on_gfx12x
Copy link
Member

@tjtanaa tjtanaa Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be include here because this is only about pushing the triton tuned config json

@big-yellow-duck big-yellow-duck changed the title [ROCm] Enable VLLM triton FP8 moe for gfx1201 [ROCm] Enable VLLM triton FP8 moe for gfx1201, tuned for Qwen3-30B-A3B-FP8 tp=2 Mar 12, 2026
@big-yellow-duck big-yellow-duck marked this pull request as ready for review March 16, 2026 04:52
Signed-off-by: big-yellow-duck <jeffaw99@hotmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants