[ROCm] Enable VLLM triton FP8 moe for gfx1201, tuned for Qwen3-30B-A3B-FP8 tp=2 by big-yellow-duck · Pull Request #79 · EmbeddedLLM/vllm

big-yellow-duck · 2026-03-12T08:13:56Z

Purpose

gfx12xx cards support FP8, so we enabled FP8 Triton MoE in VLLM and tuned for Qwen/Qwen3-30B-A3B-Instruct-2507-FP8.

Test Plan

benchmark Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 with triton moe tuned on 2 Radeon PRO 9700

VLLM_ROCM_USE_AITER=0 vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 -tp 2 --enable-expert-parallel

Test Results

Benchmark using Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 on 2x Radeon PRO 9700

TTFT (ms)

ISL-OSL	Triton MoE Tuned
512-512	982.67
1024-1024	711.35
2048-2048	1250.83
4096-4096	4329.35
8192-1024	11351.34
16384-2048	131106.68
Average	24955.37

TPOT (ms)

ISL-OSL	Triton MoE Tuned
512-512	33.18
1024-1024	34.94
2048-2048	36.65
4096-4096	45.05
8192-1024	73.46
16384-2048	70.46
Average	48.95

E2E Latency (ms)

ISL-OSL	Triton MoE Tuned
512-512	17936.15
1024-1024	36449.97
2048-2048	76264.82
4096-4096	188823.78
8192-1024	86497.21
16384-2048	275340.95
Average	113552.15

Accuracy checks

GSM8K Accuracy

Metric	Triton MoE Tuned
exact_match,strict-match	83.40%
exact_match,flexible-extract	86.43%

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

tjtanaa · 2026-03-12T08:44:20Z

vllm/_aiter_ops.py

    """
    if current_platform.is_rocm() and IS_AITER_FOUND:
-        from vllm.platforms.rocm import on_gfx9
+        from vllm.platforms.rocm import on_gfx9, on_gfx12x


This should not be include here because this is only about pushing the triton tuned config json

Signed-off-by: big-yellow-duck <jeffaw99@hotmail.com>

big-yellow-duck added 2 commits March 11, 2026 03:28

add aiter gemm_a8w8_blockscale support for gfx1201

3e9d168

add tuned moe configs for gfx1201

ca40896

tjtanaa reviewed Mar 12, 2026

View reviewed changes

big-yellow-duck changed the title ~~[ROCm] Enable VLLM triton FP8 moe for gfx1201~~ [ROCm] Enable VLLM triton FP8 moe for gfx1201, tuned for Qwen3-30B-A3B-FP8 tp=2 Mar 12, 2026

big-yellow-duck and others added 2 commits March 16, 2026 04:47

remove unneceesary gfx12 in aiter_ops

6fb9a3e

Merge branch 'main' into rdna4-moe

f108f89

big-yellow-duck marked this pull request as ready for review March 16, 2026 04:52

fix formatting

826b9f1

Signed-off-by: big-yellow-duck <jeffaw99@hotmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Enable VLLM triton FP8 moe for gfx1201, tuned for Qwen3-30B-A3B-FP8 tp=2#79

[ROCm] Enable VLLM triton FP8 moe for gfx1201, tuned for Qwen3-30B-A3B-FP8 tp=2#79
big-yellow-duck wants to merge 5 commits intomainfrom
rdna4-moe

big-yellow-duck commented Mar 12, 2026 •

edited by github-actions bot

Loading

Uh oh!

tjtanaa Mar 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

big-yellow-duck commented Mar 12, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Results

TTFT (ms)

TPOT (ms)

E2E Latency (ms)

Accuracy checks

GSM8K Accuracy

Uh oh!

tjtanaa Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

big-yellow-duck commented Mar 12, 2026 •

edited by github-actions bot

Loading

tjtanaa Mar 12, 2026 •

edited

Loading