[ROCm] Enable Aiter ck_gemm_a8w8_blockscale for RDNA4 gpus. Qwen3.5-27B-FP8 tp=2, Qwen3-0.6B-FP8 tp=1 #77
[ROCm] Enable Aiter ck_gemm_a8w8_blockscale for RDNA4 gpus. Qwen3.5-27B-FP8 tp=2, Qwen3-0.6B-FP8 tp=1 #77big-yellow-duck wants to merge 10 commits intomainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
Signed-off-by: big-yellow-duck <jeffaw99@hotmail.com>
Signed-off-by: big-yellow-duck <jeffaw99@hotmail.com>
|
We need to guard all of the other ops with additional condition |
Signed-off-by: big-yellow-duck <jeffaw99@hotmail.com>
Signed-off-by: big-yellow-duck <jeffaw99@hotmail.com>
Signed-off-by: big-yellow-duck <jeffaw99@hotmail.com>
Signed-off-by: big-yellow-duck <jeffaw99@hotmail.com>
vllm/_aiter_ops.py
Outdated
|
|
||
| import vllm.envs as envs | ||
| from vllm.platforms import current_platform | ||
| from vllm.platforms.rocm import on_gfx12x, on_mi3xx |
There was a problem hiding this comment.
do not import here. It has to be lazy import in functions
Signed-off-by: big-yellow-duck <jeffaw99@hotmail.com>
Purpose
We enabled ck gemm_a8w8_blockscale in Aiter for gfx1201 but vllm has not enabled support yet. this pr enables aiter for gfx12xx for FP8 inference. the ck gemm_a8w8_blockscale from aiter provides better performance than the default untuned triton kernel in vllm.
enabled the aiter FP8 path for gfx12x card to use tuned ck gemm configs from aiter.
Test Plan
benchmark Qwen3.5-27B-FP8 with default vllm and vllm with Aiter enabled on 2x Radeon PRO 9700
default
using aiter ck gemm_a8w8_blockscale
Test Results
Benchmark using Qwen/Qwen3.5-27B-FP8
TTFT (ms)
TPOT (ms)
E2E Latency (ms)
Accuracy checks
GSM8K Accuracy
All accuracy differences are not statistically significant
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.