add tuned config for qwen3.5 fp8,a8w8 blockscale gemm#2324
add tuned config for qwen3.5 fp8,a8w8 blockscale gemm#2324
Conversation
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
|
gh pr edit 2324 --add-label ci:sglang |
There was a problem hiding this comment.
Pull request overview
This PR adds a much larger set of tuned CK blockscale GEMM configurations intended to eliminate “missing tuned config” lookup/logging overhead for common Qwen3.5 FP8 benchmark shapes, so sglang’s reported throughput better reflects actual compute performance.
Changes:
- Expands the
a8w8_blockscale_tuned_gemm.csvtuned-shape database from a handful of entries to a comprehensive set covering many small/mediumMvalues and commonN/Kcombinations. - Adds both
ckandcktilelibtype entries for various shapes to improve hit rate in the runtime config lookup.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| 256,128,4096,1280,ck,7,0,7.4194,a8w8_blockscale_1x128x128_256x16x128x256_16x16_16x16_1x2_16x16x1_16x16x1_1x16x1x16_8_1x2_intrawave_v1,180.9,870.06,0.0 | ||
| 256,1,256,4096,ck,8,0,11.6484,a8w8_blockscale_1x128x128_256x16x64x256_16x16_16x16_1x1_16x16x1_16x16x1_1x16x1x16_4_1x1_intrawave_v1,0.18,90.41,0.0 | ||
| 256,2,256,4096,ck,8,0,11.5873,a8w8_blockscale_1x128x128_256x16x64x256_16x16_16x16_1x1_16x16x1_16x16x1_1x16x1x16_4_1x1_intrawave_v1,0.36,91.29,0.0 | ||
| 256,3,256,4096,ck,8,0,11.6965,a8w8_blockscale_1x128x128_256x16x64x256_16x16_16x16_1x1_16x16x1_16x16x1_1x16x1x16_4_1x1_intrawave_v1,0.54,90.83,0.0 |
There was a problem hiding this comment.
OK, Let me check the shapes and delete some redundant config item.
|
@yzhou103 |
| 256,15872,4096,128,cktile,10,0,39.3741,a8w8_blockscale_cktile_128x128x128_2x2x1_16x16x128_intrawave_0x1x0_2,419.89,3344.98,0.0 | ||
| 256,16128,4096,128,cktile,10,0,39.0049,a8w8_blockscale_cktile_128x128x128_2x2x1_16x16x128_intrawave_0x1x0_2,430.16,3426.56,0.0 | ||
| 256,16256,4096,128,cktile,10,0,38.733,a8w8_blockscale_cktile_128x128x128_2x2x1_16x16x128_intrawave_0x1x0_2,436.91,3480.26,0.0 | ||
| 256,16384,4096,128,cktile,10,0,39.0112,a8w8_blockscale_cktile_128x128x128_2x2x1_16x16x128_intrawave_0x1x0_2,439.31,3499.16,0.0 |
There was a problem hiding this comment.
it seems it choose same kernel when M=1536 - 16384, so we can keep M=1536,2048,4096,8192,16384 and remove others
| 256,15872,2560,4096,cktile,11,0,225.2821,a8w8_blockscale_cktile_192x256x128_4x2x1_16x16x128_intrawave_0x1x0_1,1467.75,691.55,0.0 | ||
| 256,16128,2560,4096,cktile,11,0,225.3868,a8w8_blockscale_cktile_192x256x128_4x2x1_16x16x128_intrawave_0x1x0_1,1488.84,700.8,0.0 | ||
| 256,16256,2560,4096,cktile,11,0,224.6703,a8w8_blockscale_cktile_192x256x128_4x2x1_16x16x128_intrawave_0x1x0_1,1506.47,708.7,0.0 | ||
| 256,16384,2560,4096,cktile,11,0,226.3441,a8w8_blockscale_cktile_192x256x128_4x2x1_16x16x128_intrawave_0x1x0_1,1514.32,711.8,0.0 |
There was a problem hiding this comment.
it seems it choose same kernel when M=4096 - 16384, so we can keep M=4096,8192,16384 and remove others
There was a problem hiding this comment.
Yes, I will modify this config, and run test again locally.
|
Great, all test case doesn't report warnings anymore. |

Add tuned config for Qwen3.5-397B-A17B-FP8 CK blockscale GEMM
Motivation
According to the data on
https://inferencex.semianalysis.com/, the throughput performance of Qwen3.5 FP8 is significantly worse than BF16. Here are our benchmark results (docker image: rocm/sgl-dev:v0.5.9-rocm720-mi35x-20260316):Root Cause
The root cause is that the FP8 model uses CK blockscale GEMM, which queries tuning configurations at runtime. However, most GEMM shapes do not have matching tuned configs, causing the query overhead to be wasted and a large amount of warning messages to be printed. All this time is counted by sglang, resulting in the FP8 throughput in sglang's performance report being far behind BF16 — but this does not reflect the actual compute performance. Below are the runtime logs from identical configurations (concurrency=32, prompts=32, input=1024, output=1024), comparing BF16 and FP8:
As shown above, while maintaining 32 concurrent requests, FP8 actually outperforms BF16 in generation throughput (1640 vs. 1503 tok/s). However, sglang's final performance report shows FP8 far behind BF16 (158 vs. 1298 tok/s) due to the overhead from missing tuned GEMM configs.
Solution
This PR provides CK blockscale GEMM tuning configurations for the following common benchmark scenarios:
Performance
Below are the FP8 benchmark results after applying the tuning configurations:
After applying the tuning config, sglang now reports correct performance numbers for FP8. The runtime logs also show that FP8 achieves slightly higher generation throughput than BF16. Using the same configuration (concurrency=32, prompts=32, input=1024, output=1024) as an example, here are the logs after tuning: