[ROCm] Enable aiter group quant FP8 for RDNA4 gpus by big-yellow-duck · Pull Request #78 · EmbeddedLLM/vllm

big-yellow-duck · 2026-03-12T06:31:15Z

Purpose

This PR combines two AITER kernel patches for gfx12xx support:

RMSNorm kernel - Enables AITER RMSNorm via VLLM_ROCM_USE_AITER_RMSNORM=1
GroupQuant FP8 kernel - Enables W8A8BlockFp8LinearOp full _run_aiter pipeline following the aiter enablement in [ROCm] Enable Aiter ck_gemm_a8w8_blockscale for RDNA4 gpus. Qwen3.5-27B-FP8 tp=2, Qwen3-0.6B-FP8 tp=1 #77

Note: Aiter RMSNorm patch is only on Aiter's side.

Test Plan

Benchmark Qwen3.5-27B-FP8 with default vLLM and vLLM with AITER enabled on 2x Radeon PRO 9700

Default (baseline)

VLLM_ROCM_USE_AITER=0 vllm serve Qwen/Qwen3.5-27B-FP8 -tp 2 --gpu-memory-utilization 0.98 --max-model-len 65536

AITER with CK gemm_a8w8_blockscale + GroupQuant FP8

VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION=0 \
VLLM_ROCM_USE_AITER_MHA=0 \
VLLM_ROCM_USE_AITER_RMSNORM=0 \
vllm serve Qwen/Qwen3.5-27B-FP8 -tp 2 --gpu-memory-utilization 0.98 --max-model-len 65536

AITER with CK gemm_a8w8_blockscale RMSNorm + GroupQuant FP8

VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION=0 \
VLLM_ROCM_USE_AITER_MHA=0 \
VLLM_ROCM_USE_AITER_RMSNORM=1 \
vllm serve Qwen/Qwen3.5-27B-FP8 -tp 2 --gpu-memory-utilization 0.95 --max-model-len 65536

Test Results

Aiter ck_gemm_a8w8_blockscale + GroupQuant fp8

TTFT (ms)

ISL-OSL	Default	Aiter	Speedup
512-512	7540.37	7764.62	-3.0%
1024-1024	3728.69	4394.07	-17.8%
2048-2048	5666.84	6830.08	-20.5%
4096-4096	10358.16	12438.51	-20.1%
8192-1024	21272.21	25201.75	-18.5%
16384-2048	49513.83	56903.10	-14.9%
Average	16346.68	18922.02	-15.8%

TPOT (ms)

ISL-OSL	Default	Aiter	Speedup
512-512	62.51	42.85	+31.5%
1024-1024	65.42	45.78	+30.0%
2048-2048	65.82	46.37	+29.6%
4096-4096	69.03	49.65	+28.1%
8192-1024	108.26	96.45	+10.9%
16384-2048	125.64	113.91	+9.3%
Average	82.78	65.83	+20.5%

E2E Latency (ms)

ISL-OSL	Default	Aiter	Speedup
512-512	39483.28	29659.66	+24.9%
1024-1024	70657.76	51230.64	+27.5%
2048-2048	140390.78	101740.90	+27.5%
4096-4096	293054.02	215751.43	+26.4%
8192-1024	132024.15	123866.82	+6.2%
16384-2048	306707.17	290067.59	+5.4%
Average	163719.53	135386.17	+17.3%

Aiter ck_gemm_a8w8_blockscale + GroupQuant fp8 + RMSNorm

TTFT (ms)

ISL-OSL	Default	Aiter	Speedup
512-512	7540.37	7598.87	-0.8%
1024-1024	3728.69	3903.31	-4.7%
2048-2048	5666.84	7068.94	-24.7%
4096-4096	10358.16	12432.28	-20.0%
8192-1024	21272.21	25171.34	-18.3%
16384-2048	49513.83	57003.19	-15.1%
Average	16346.68	18862.99	-15.4%

TPOT (ms)

ISL-OSL	Default	Aiter	Speedup
512-512	62.51	42.99	+31.2%
1024-1024	65.42	44.36	+32.2%
2048-2048	65.82	47.17	+28.3%
4096-4096	69.03	49.63	+28.1%
8192-1024	108.26	96.54	+10.8%
16384-2048	125.64	113.97	+9.3%
Average	82.78	65.78	+20.5%

E2E Latency (ms)

ISL-OSL	Default	Aiter	Speedup
512-512	39483.28	29565.19	+25.1%
1024-1024	70657.76	49287.97	+30.2%
2048-2048	140390.78	103622.45	+26.2%
4096-4096	293054.02	215669.96	+26.4%
8192-1024	132024.15	123931.44	+6.1%
16384-2048	306707.17	290296.08	+5.4%
Average	163719.53	135395.51	+17.3%

Accuracy checks

GSM8K Accuracy

Aiter ck_gemm_a8w8_blockscale + GroupQuant fp8

Metric	Default	AITER	Diff
exact_match,strict-match	86.13%	86.05%	-0.07%
exact_match,flexible-extract	87.79%	87.41%	-0.38%

Aiter ck_gemm_a8w8_blockscale + GroupQuant fp8 + RMSNorm

Metric	Default	Aiter	Diff
exact_match,strict-match	86.13%	85.29%	-0.84%
exact_match,flexible-extract	87.79%	86.95%	-0.84%

All accuracy differences are not statistically significant.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

github-actions · 2026-03-12T06:31:25Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Signed-off-by: big-yellow-duck <jeffaw99@hotmail.com>

tjtanaa · 2026-03-16T07:45:55Z

@big-yellow-duck @BadrBasowid

We need to guard all of the other ops with additional condition on_mi3xx() so that users don't need to know which flag to switch off. On Radeon user can just do VLLM_ROCM_USE_AITER=1.

https://github.com/vllm-project/vllm/blob/8d3f8f485efc0b812f91ecf19a3a12232587550c/vllm/_aiter_ops.py#L1129-L1201

Signed-off-by: big-yellow-duck <jeffaw99@hotmail.com>

tjtanaa · 2026-03-19T03:59:32Z

vllm/_aiter_ops.py


 import vllm.envs as envs
 from vllm.platforms import current_platform
+from vllm.platforms.rocm import on_gfx12x, on_mi3xx


we cannot do this a module level import
from vllm.platforms.rocm import on_gfx12x, on_mi3xx

It will break our contract that we can always import _aiter_ops.py even on non-ROCm platform

Signed-off-by: big-yellow-duck <jeffaw99@hotmail.com>

big-yellow-duck added 3 commits March 11, 2026 03:28

add aiter gemm_a8w8_blockscale support for gfx1201

3e9d168

use triton quant fp8

0c8b931

enable aiter quant fp8 for gfx1201

c4b46fd

big-yellow-duck mentioned this pull request Mar 12, 2026

[ROCm] Enable VLLM triton FP8 moe for gfx1201, tuned for Qwen3-30B-A3B-FP8 tp=2 #79

Open

5 tasks

big-yellow-duck marked this pull request as ready for review March 16, 2026 04:50

big-yellow-duck requested a review from tjtanaa as a code owner March 16, 2026 04:50

fix formattingg

cd54e64

Signed-off-by: big-yellow-duck <jeffaw99@hotmail.com>

big-yellow-duck added 3 commits March 16, 2026 09:21

add conditional aiter_ops for gfx12x

d95a214

Signed-off-by: big-yellow-duck <jeffaw99@hotmail.com>

change to explicit aiter support archs

998b366

Signed-off-by: big-yellow-duck <jeffaw99@hotmail.com>

check supported arch at register custom ops

4afe02e

Signed-off-by: big-yellow-duck <jeffaw99@hotmail.com>

tjtanaa reviewed Mar 19, 2026

View reviewed changes

fix aiter_ops dynamic check gpu arch

63ae8d7

Signed-off-by: big-yellow-duck <jeffaw99@hotmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Enable aiter group quant FP8 for RDNA4 gpus#78

[ROCm] Enable aiter group quant FP8 for RDNA4 gpus#78
big-yellow-duck wants to merge 8 commits intomainfrom
rdna4-aiter-quantfp8

big-yellow-duck commented Mar 12, 2026 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

tjtanaa commented Mar 16, 2026

Uh oh!

tjtanaa Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

big-yellow-duck commented Mar 12, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Results

Aiter ck_gemm_a8w8_blockscale + GroupQuant fp8

TTFT (ms)

TPOT (ms)

E2E Latency (ms)

Aiter ck_gemm_a8w8_blockscale + GroupQuant fp8 + RMSNorm

TTFT (ms)

TPOT (ms)

E2E Latency (ms)

Accuracy checks

GSM8K Accuracy

Aiter ck_gemm_a8w8_blockscale + GroupQuant fp8

Aiter ck_gemm_a8w8_blockscale + GroupQuant fp8 + RMSNorm

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

tjtanaa commented Mar 16, 2026

Uh oh!

tjtanaa Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

big-yellow-duck commented Mar 12, 2026 •

edited by github-actions bot

Loading