add tuned config for qwen3.5 fp8,a8w8 blockscale gemm by zovonoir · Pull Request #2324 · ROCm/aiter

zovonoir · 2026-03-18T06:53:00Z

Add tuned config for Qwen3.5-397B-A17B-FP8 CK blockscale GEMM

Motivation

According to the data on https://inferencex.semianalysis.com/, the throughput performance of Qwen3.5 FP8 is significantly worse than BF16. Here are our benchmark results (docker image: rocm/sgl-dev:v0.5.9-rocm720-mi35x-20260316):

DTYPE	CONC	IN	OUT	PROMPTS	Input Tput(tok/s)	Output Tput(tok/s)	Total Tput(tok/s)
bf16	32	1024	1024	128	1244.21	1259.29	2503.49
fp8	32	1024	1024	128	466.32	471.97	938.28
bf16	32	8192	1024	128	8280.12	1025.66	9305.77
fp8	32	8192	1024	128	3515.85	435.51	3951.35
bf16	32	1024	1024	32	1330.38	1298.37	2628.75
fp8	32	1024	1024	32	162.51	158.60	321.11
bf16	32	8192	1024	32	8299.27	1009.76	9309.04
fp8	32	8192	1024	32	1249.79	152.06	1401.85
bf16	32	1024	1024	64	1269.08	1261.13	2530.21
fp8	32	1024	1024	64	287.22	285.43	572.65
bf16	32	8192	1024	64	8352.56	1022.88	9375.43
fp8	32	8192	1024	64	2198.10	269.19	2467.29
bf16	64	1024	1024	128	1967.29	1991.13	3958.43
fp8	64	1024	1024	128	533.33	539.80	1073.13
bf16	64	8192	1024	128	11605.85	1437.61	13043.47
fp8	64	8192	1024	128	3870.51	479.44	4349.95
bf16	64	1024	1024	32	1328.27	1296.31	2624.57
fp8	64	1024	1024	32	161.48	157.59	319.07
bf16	64	8192	1024	32	8259.75	1004.95	9264.70
fp8	64	8192	1024	32	1264.76	153.88	1418.65
bf16	64	1024	1024	64	2124.91	2111.61	4236.52
fp8	64	1024	1024	64	311.12	309.17	620.28
bf16	64	8192	1024	64	11684.25	1430.88	13115.13
fp8	64	8192	1024	64	2355.39	288.45	2643.83

Root Cause

The root cause is that the FP8 model uses CK blockscale GEMM, which queries tuning configurations at runtime. However, most GEMM shapes do not have matching tuned configs, causing the query overhead to be wasted and a large amount of warning messages to be printed. All this time is counted by sglang, resulting in the FP8 throughput in sglang's performance report being far behind BF16 — but this does not reflect the actual compute performance. Below are the runtime logs from identical configurations (concurrency=32, prompts=32, input=1024, output=1024), comparing BF16 and FP8:

bf16

[2026-03-17 08:16:26 TP0] Decode batch, #running-req: 32, #full token: 52307, full token usage: 0.01, mamba num: 64, mamba usage: 0.02, cuda graph: True, gen throughput (token/s): 1502.25, #queue-req: 0
[2026-03-17 08:16:27 TP0] Decode batch, #running-req: 32, #full token: 53587, full token usage: 0.01, mamba num: 64, mamba usage: 0.02, cuda graph: True, gen throughput (token/s): 1507.70, #queue-req: 0
[2026-03-17 08:16:28 TP0] Decode batch, #running-req: 32, #full token: 54867, full token usage: 0.01, mamba num: 64, mamba usage: 0.02, cuda graph: True, gen throughput (token/s): 1503.64, #queue-req: 0
[2026-03-17 08:16:28 TP0] Decode batch, #running-req: 30, #full token: 52574, full token usage: 0.01, mamba num: 60, mamba usage: 0.02, cuda graph: True, gen throughput (token/s): 1493.49, #queue-req: 0
[2026-03-17 08:16:29 TP0] Decode batch, #running-req: 22, #full token: 39301, full token usage: 0.01, mamba num: 44, mamba usage: 0.01, cuda graph: True, gen throughput (token/s): 1184.83, #queue-req: 0
[2026-03-17 08:16:30 TP0] Decode batch, #running-req: 13, #full token: 23915, full token usage: 0.00, mamba num: 26, mamba usage: 0.01, cuda graph: True, gen throughput (token/s): 893.73, #queue-req: 0
[2026-03-17 08:16:31 TP0] Decode batch, #running-req: 9, #full token: 16936, full token usage: 0.00, mamba num: 18, mamba usage: 0.01, cuda graph: True, gen throughput (token/s): 675.51, #queue-req: 0
[2026-03-17 08:16:31 TP0] Decode batch, #running-req: 5, #full token: 9565, full token usage: 0.00, mamba num: 10, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 359.44, #queue-req: 0
#Input tokens: 29678
#Output tokens: 28964
#Input tokens: 4096
#Output tokens: 256

====== Offline Throughput Benchmark Result =======
Backend:                                 engine    
Successful requests:                     32        
Benchmark duration (s):                  22.31     
Total input tokens:                      29678     
Total generated tokens:                  28964     
Last generation throughput (tok/s):      359.44    
Request throughput (req/s):              1.43      
Input token throughput (tok/s):          1330.38   
Output token throughput (tok/s):         1298.37   
Total token throughput (tok/s):          2628.75   
==================================================

fp8

[2026-03-17 07:53:39 TP0] Decode batch, #running-req: 32, #full token: 51027, full token usage: 0.01, mamba num: 64, mamba usage: 0.02, cuda graph: True, gen throughput (token/s): 1634.73, #queue-req: 0
[2026-03-17 07:53:40 TP0] Decode batch, #running-req: 32, #full token: 52307, full token usage: 0.01, mamba num: 64, mamba usage: 0.02, cuda graph: True, gen throughput (token/s): 1640.24, #queue-req: 0
[2026-03-17 07:53:41 TP0] Decode batch, #running-req: 32, #full token: 53587, full token usage: 0.01, mamba num: 64, mamba usage: 0.02, cuda graph: True, gen throughput (token/s): 1640.73, #queue-req: 0
[2026-03-17 07:53:41 TP0] Decode batch, #running-req: 32, #full token: 54867, full token usage: 0.01, mamba num: 64, mamba usage: 0.02, cuda graph: True, gen throughput (token/s): 1638.80, #queue-req: 0
[2026-03-17 07:53:42 TP0] Decode batch, #running-req: 30, #full token: 52574, full token usage: 0.01, mamba num: 60, mamba usage: 0.02, cuda graph: True, gen throughput (token/s): 1618.10, #queue-req: 0
[2026-03-17 07:53:43 TP0] Decode batch, #running-req: 22, #full token: 39301, full token usage: 0.01, mamba num: 44, mamba usage: 0.01, cuda graph: True, gen throughput (token/s): 1251.72, #queue-req: 0
[2026-03-17 07:53:44 TP0] Decode batch, #running-req: 13, #full token: 23915, full token usage: 0.00, mamba num: 26, mamba usage: 0.01, cuda graph: True, gen throughput (token/s): 902.71, #queue-req: 0
[2026-03-17 07:53:45 TP0] Decode batch, #running-req: 9, #full token: 16936, full token usage: 0.00, mamba num: 18, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 665.28, #queue-req: 0
[2026-03-17 07:53:45 TP0] Decode batch, #running-req: 5, #full token: 9565, full token usage: 0.00, mamba num: 10, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 346.18, #queue-req: 0
#Input tokens: 29678
#Output tokens: 28964
#Input tokens: 4096
#Output tokens: 256

====== Offline Throughput Benchmark Result =======
Backend:                                 engine    
Successful requests:                     32        
Benchmark duration (s):                  182.62    
Total input tokens:                      29678     
Total generated tokens:                  28964     
Last generation throughput (tok/s):      346.18    
Request throughput (req/s):              0.18      
Input token throughput (tok/s):          162.51    
Output token throughput (tok/s):         158.60    
Total token throughput (tok/s):          321.11    
==================================================

As shown above, while maintaining 32 concurrent requests, FP8 actually outperforms BF16 in generation throughput (1640 vs. 1503 tok/s). However, sglang's final performance report shows FP8 far behind BF16 (158 vs. 1298 tok/s) due to the overhead from missing tuned GEMM configs.

Solution

This PR provides CK blockscale GEMM tuning configurations for the following common benchmark scenarios:

CONCURRENT	PROMPTS	INPUT	OUTPUT
32	32	1024	1024
32	64	1024	1024
32	128	1024	1024
64	32	1024	1024
64	64	1024	1024
64	128	1024	1024
32	32	8192	1024
32	64	8192	1024
32	128	8192	1024
64	32	8192	1024
64	64	8192	1024
64	128	8192	1024
32	32	1024	8192
32	64	1024	8192

Performance

Below are the FP8 benchmark results after applying the tuning configurations:

CONC	IN	OUT	PROMPTS	Input Tput Not Tune(tok/s)	Input Tput Tuned(tok/s)	Output Tput Not Tune(tok/s)	Output Tput Tuned(tok/s)	Total Tput Not Tune(tok/s)	Total Tput Tuned(tok/s)	Total Tput Diff(%)
32	1024	1024	32	162.51	1459.79	158.60	1424.67	321.11	2884.46	+798.3%
32	1024	1024	64	287.22	1391.59	285.43	1382.88	572.65	2774.46	+384.5%
32	1024	1024	128	466.32	1371.08	471.97	1387.70	938.28	2758.78	+194.0%
32	1024	8192	32	90.96	187.06	726.90	1494.93	817.86	1681.99	+105.7%
32	1024	8192	64	120.63	185.36	985.52	1514.33	1106.15	1699.68	+53.7%
32	8192	1024	32	1249.79	8728.92	152.06	1062.04	1401.85	9790.96	+598.4%
32	8192	1024	64	2198.10	9027.97	269.19	1105.59	2467.29	10133.55	+310.7%
32	8192	1024	128	3515.85	8848.01	435.51	1096.00	3951.35	9944.01	+151.7%
64	1024	1024	32	161.48	1460.76	157.59	1425.62	319.07	2886.39	+804.6%
64	1024	1024	64	311.12	2219.75	309.17	2205.86	620.28	4425.62	+613.5%
64	1024	1024	128	533.33	2022.49	539.80	2047.00	1073.13	4069.50	+279.2%
64	8192	1024	32	1264.76	8882.65	153.88	1080.74	1418.65	9963.40	+602.3%
64	8192	1024	64	2355.39	12042.77	288.45	1474.79	2643.83	13517.56	+411.3%
64	8192	1024	128	3870.51	11891.75	479.44	1473.03	4349.95	13364.78	+207.2%

After applying the tuning config, sglang now reports correct performance numbers for FP8. The runtime logs also show that FP8 achieves slightly higher generation throughput than BF16. Using the same configuration (concurrency=32, prompts=32, input=1024, output=1024) as an example, here are the logs after tuning:

cuda graph: True, gen throughput (token/s): 1687.39, #queue-req: 0
[2026-03-17 13:35:54 TP0] Decode batch, #running-req: 32, #full token: 53587, full token usage: 0.01, mamba num: 64, mamba usage: 0.02, cuda graph: True, gen throughput (token/s): 1686.10, #queue-req: 0
[2026-03-17 13:35:55 TP0] Decode batch, #running-req: 32, #full token: 54867, full token usage: 0.01, mamba num: 64, mamba usage: 0.02, cuda graph: True, gen throughput (token/s): 1689.84, #queue-req: 0
[2026-03-17 13:35:56 TP0] Decode batch, #running-req: 30, #full token: 52574, full token usage: 0.01, mamba num: 60, mamba usage: 0.02, cuda graph: True, gen throughput (token/s): 1669.94, #queue-req: 0
[2026-03-17 13:35:57 TP0] Decode batch, #running-req: 22, #full token: 39301, full token usage: 0.01, mamba num: 44, mamba usage: 0.01, cuda graph: True, gen throughput (token/s): 1279.35, #queue-req: 0
[2026-03-17 13:35:57 TP0] Decode batch, #running-req: 13, #full token: 23915, full token usage: 0.00, mamba num: 26, mamba usage: 0.01, cuda graph: True, gen throughput (token/s): 931.22, #queue-req: 0
[2026-03-17 13:35:58 TP0] Decode batch, #running-req: 9, #full token: 16936, full token usage: 0.00, mamba num: 18, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 691.06, #queue-req: 0
[2026-03-17 13:35:59 TP0] Decode batch, #running-req: 5, #full token: 9565, full token usage: 0.00, mamba num: 10, mamba usage: 0.00, cuda graph: True, gen throughput (token/s): 357.70, #queue-req: 0
#Input tokens: 29678
#Output tokens: 28964
#Input tokens: 4096
#Output tokens: 256

====== Offline Throughput Benchmark Result =======
Backend:                                 engine    
Successful requests:                     32        
Benchmark duration (s):                  20.33     
Total input tokens:                      29678     
Total generated tokens:                  28964     
Last generation throughput (tok/s):      357.70    
Request throughput (req/s):              1.57      
Input token throughput (tok/s):          1459.79   
Output token throughput (tok/s):         1424.67   
Total token throughput (tok/s):          2884.46   
==================================================

github-actions · 2026-03-18T06:53:55Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2324 --add-label <label>

zovonoir · 2026-03-18T06:57:32Z

gh pr edit 2324 --add-label ci:sglang

Copilot

Pull request overview

This PR adds a much larger set of tuned CK blockscale GEMM configurations intended to eliminate “missing tuned config” lookup/logging overhead for common Qwen3.5 FP8 benchmark shapes, so sglang’s reported throughput better reflects actual compute performance.

Changes:

Expands the a8w8_blockscale_tuned_gemm.csv tuned-shape database from a handful of entries to a comprehensive set covering many small/medium M values and common N/K combinations.
Adds both ck and cktile libtype entries for various shapes to improve hit rate in the runtime config lookup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

yzhou103 · 2026-03-18T08:28:11Z

aiter/configs/a8w8_blockscale_tuned_gemm.csv

 256,128,4096,1280,ck,7,0,7.4194,a8w8_blockscale_1x128x128_256x16x128x256_16x16_16x16_1x2_16x16x1_16x16x1_1x16x1x16_8_1x2_intrawave_v1,180.9,870.06,0.0
+256,1,256,4096,ck,8,0,11.6484,a8w8_blockscale_1x128x128_256x16x64x256_16x16_16x16_1x1_16x16x1_16x16x1_1x16x1x16_4_1x1_intrawave_v1,0.18,90.41,0.0
+256,2,256,4096,ck,8,0,11.5873,a8w8_blockscale_1x128x128_256x16x64x256_16x16_16x16_1x1_16x16x1_16x16x1_1x16x1x16_4_1x1_intrawave_v1,0.36,91.29,0.0
+256,3,256,4096,ck,8,0,11.6965,a8w8_blockscale_1x128x128_256x16x64x256_16x16_16x16_1x1_16x16x1_16x16x1_1x16x1x16_4_1x1_intrawave_v1,0.54,90.83,0.0


we have padded M to match in configs so you can just tune shapes base as these

OK, Let me check the shapes and delete some redundant config item.

zovonoir · 2026-03-19T06:18:41Z

@yzhou103
Hi, I have deleted redundant config items, and move config file to a model specific csv.

yzhou103 · 2026-03-19T06:30:59Z

aiter/configs/model_configs/a8w8_blockscale_tuned_gemm_qwen3_5_397b_a13b.csv

+256,15872,4096,128,cktile,10,0,39.3741,a8w8_blockscale_cktile_128x128x128_2x2x1_16x16x128_intrawave_0x1x0_2,419.89,3344.98,0.0
+256,16128,4096,128,cktile,10,0,39.0049,a8w8_blockscale_cktile_128x128x128_2x2x1_16x16x128_intrawave_0x1x0_2,430.16,3426.56,0.0
+256,16256,4096,128,cktile,10,0,38.733,a8w8_blockscale_cktile_128x128x128_2x2x1_16x16x128_intrawave_0x1x0_2,436.91,3480.26,0.0
+256,16384,4096,128,cktile,10,0,39.0112,a8w8_blockscale_cktile_128x128x128_2x2x1_16x16x128_intrawave_0x1x0_2,439.31,3499.16,0.0


it seems it choose same kernel when M=1536 - 16384, so we can keep M=1536,2048,4096,8192,16384 and remove others

yzhou103 · 2026-03-19T06:32:00Z

aiter/configs/model_configs/a8w8_blockscale_tuned_gemm_qwen3_5_397b_a13b.csv

+256,15872,2560,4096,cktile,11,0,225.2821,a8w8_blockscale_cktile_192x256x128_4x2x1_16x16x128_intrawave_0x1x0_1,1467.75,691.55,0.0
+256,16128,2560,4096,cktile,11,0,225.3868,a8w8_blockscale_cktile_192x256x128_4x2x1_16x16x128_intrawave_0x1x0_1,1488.84,700.8,0.0
+256,16256,2560,4096,cktile,11,0,224.6703,a8w8_blockscale_cktile_192x256x128_4x2x1_16x16x128_intrawave_0x1x0_1,1506.47,708.7,0.0
+256,16384,2560,4096,cktile,11,0,226.3441,a8w8_blockscale_cktile_192x256x128_4x2x1_16x16x128_intrawave_0x1x0_1,1514.32,711.8,0.0


it seems it choose same kernel when M=4096 - 16384, so we can keep M=4096,8192,16384 and remove others

Yes, I will modify this config, and run test again locally.

zovonoir · 2026-03-19T12:11:21Z

Great, all test case doesn't report warnings anymore.

add tuned config for qwen3.5 fp8,a8w8 blockscale gemm

4291fb6

zovonoir requested review from a team and Copilot March 18, 2026 06:53

Copilot started reviewing on behalf of zovonoir March 18, 2026 06:55 View session

zovonoir added ci:sglang ci:vllm labels Mar 18, 2026

Copilot AI reviewed Mar 18, 2026

View reviewed changes

yzhou103 reviewed Mar 18, 2026

View reviewed changes

zovonoir added 2 commits March 19, 2026 05:54

delete redundant tuning configs

a1d641c

delete redundant tuning configs

5a35771

yzhou103 reviewed Mar 19, 2026

View reviewed changes

update tuning config, keep M=1536,2048,4096,8192,16384

ac33081

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add tuned config for qwen3.5 fp8,a8w8 blockscale gemm#2324

add tuned config for qwen3.5 fp8,a8w8 blockscale gemm#2324
zovonoir wants to merge 4 commits intomainfrom
qwen3.5_gemm_a8w8_tuned

zovonoir commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

zovonoir commented Mar 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

yzhou103 Mar 18, 2026

Uh oh!

zovonoir Mar 18, 2026

Uh oh!

zovonoir commented Mar 19, 2026

Uh oh!

yzhou103 Mar 19, 2026

Uh oh!

yzhou103 Mar 19, 2026

Uh oh!

zovonoir Mar 19, 2026

Uh oh!

zovonoir commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zovonoir commented Mar 18, 2026

Add tuned config for Qwen3.5-397B-A17B-FP8 CK blockscale GEMM

Motivation

Root Cause

Solution

Performance

Uh oh!

github-actions bot commented Mar 18, 2026

🏷️ CI Guide

Uh oh!

zovonoir commented Mar 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

yzhou103 Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

zovonoir Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

zovonoir commented Mar 19, 2026

Uh oh!

yzhou103 Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

yzhou103 Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

zovonoir Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

zovonoir commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants