[gfx1201] Enable quantization kernels for gfx1201 by vllmellm · Pull Request #2351 · ROCm/aiter

vllmellm · 2026-03-19T08:51:23Z

Motivation

FP8 quantization operations fail on AMD gfx1201 (RDNA4) architecture due to three compatibility issues:

FP8 dtype is not registered for gfx1201 in the dtype mapping
v_pk_mul_f32 assembly instruction is not supported on gfx11/gfx12
DPP broadcast operations (0x142, 0x143) used in hip reduce are not supported on gfx11/gfx12

This PR enables FP8 quantization support on gfx1201 by addressing these incompatibilities.

Technical Details

1. FP8 Dtype Registration (`aiter/utility/dtypes.py`)

Added gfx1201 to the default FP8 dtype mapping to enable torch.float8_e4m3fn support on RDNA4.

2. Scalar Multiplication Fallback (`csrc/include/ck_tile/vec_convert.h`)

The v_pk_mul_f32 assembly instruction is not supported on gfx11/gfx12. Added amd_scalar_mul_f32() function as a portable fallback:

CK_TILE_DEVICE fp32x2_v amd_scalar_mul_f32(fp32x2_v a, fp32x2_t b){
    fp32x2_v c;
    c[0] = a[0] * b[0];
    c[1] = a[1] * b[1];
    return c;
}

The conversion functions fp32x2_t_to_fp8x2_t and fp32x2_t_to_int8x2_t now conditionally use the scalar path:

#if defined(__gfx11__) || defined(__gfx12__)
    tmp = amd_scalar_mul_f32(x, fp32x2_t{inverted_scale, inverted_scale});
#else
    tmp = amd_assembly_pk_mul_f32(x, fp32x2_t{inverted_scale, inverted_scale});
#endif

3. DPP Broadcast Replacement (`csrc/include/hip_reduce.h`)

DPP broadcast operations are not supported on gfx11/gfx12. Replaced with rocprim::warp_shuffle() for cross-lane communication in:

wave_reduce() - for WarpSize > 16 and WarpSize > 32 reductions
multithread_reduce() - for 16-thread and 32-thread reduction paths

Example change:

#if defined(__gfx12__) || defined(__gfx11__)
    // Use shuffle for gfx12 instead of DPP broadcast
    T v_remote = rocprim::warp_shuffle(local, 15, WarpSize);
    local      = reduce_op(v_remote, local);
#else
    // row_bcast:15
    local = reduce_op(rocprim::detail::warp_move_dpp<T, 0x142>(local), local);
#endif

4. Naive load to LDS fallback (`csrc/kernels/quant_kernels.cu`)

gfx12x Fallback to naive loading from global memory to LDS in smooth_per_token_scaled_quant_kernel.

for(int i = 0; i < async_load_num; i++)
        {
            #if defined(__gfx12__)
                int idx = threadIdx.x + i * block_size;
                if(idx < smooth_scale_map_hash_size)
                {
                    // RDNA4 doesn't support buffer_load_* with LDS modifier
                    // Use standard global load to VGPR then write to LDS
                    smooth_scale_map_hash_shared[idx] = smooth_scale_map_hash[idx];
                }
            #else
                const int lds_ptr_sgpr = __builtin_amdgcn_readfirstlane((reinterpret_cast<uintptr_t>((smooth_scale_map_hash_shared + threadIdx.x / WARP_SIZE * WARP_SIZE + i * block_size))));
                uint32_t offset = threadIdx.x * sizeof(int) + i * block_size * sizeof(int);
                asm volatile( "s_mov_b32 m0 %0\n\t"
                "buffer_load_dword %1, %2, 0 offen offset:0 lds\n\t"
                ::"s"(lds_ptr_sgpr), "v"(offset), "s"(buffer_hash.cached_rsrc): "memory", "m0");
            #endif
        }

Test Plan

Run the quantization test suite with various tensor sizes:

python op_tests/test_quant.py -m 1 2 16 32 64 128 192 256 512 1024 16384

Test Result

All quantization tests pass successfully on gfx1201:

m	n	q_type	q_dtype	h_dtype	triton dq	triton dq err	hip dq	hip dq err
1	4096	4	torch.float8_e4m3fn	torch.bfloat16	1.88473	0.00219727	2.24066	0
2	4096	4	torch.float8_e4m3fn	torch.bfloat16	1.91516	0.000244141	2.24869	0
16	4096	4	torch.float8_e4m3fn	torch.bfloat16	12.9245	0.00135803	2.34457	0
32	4096	4	torch.float8_e4m3fn	torch.bfloat16	15.0222	0.00146484	2.55607	0
64	4096	4	torch.float8_e4m3fn	torch.bfloat16	5.16941	0.00187302	2.96935	0
128	4096	4	torch.float8_e4m3fn	torch.bfloat16	8.66827	0.00178909	9.95423	0
192	4096	4	torch.float8_e4m3fn	torch.bfloat16	11.9403	0.00161235	5.87333	0
256	4096	4	torch.float8_e4m3fn	torch.bfloat16	15.25	0.00166798	9.63588	0
512	4096	4	torch.float8_e4m3fn	torch.bfloat16	29.3212	0.00158978	13.3969	0
1024	4096	4	torch.float8_e4m3fn	torch.bfloat16	56.4606	0.00173402	27.7032	0
16384	4096	4	torch.float8_e4m3fn	torch.bfloat16	815.536	0.00166075	336	8.9407e-08
1	8192	4	torch.float8_e4m3fn	torch.bfloat16	1.90162	0.00354004	2.22905	0
2	8192	4	torch.float8_e4m3fn	torch.bfloat16	1.9646	0.00140381	2.06146	0
16	8192	4	torch.float8_e4m3fn	torch.bfloat16	14.8775	0.00149536	2.45573	0
32	8192	4	torch.float8_e4m3fn	torch.bfloat16	5.19413	0.00179291	3.01099	0
64	8192	4	torch.float8_e4m3fn	torch.bfloat16	8.6691	0.00170708	6.05578	0
128	8192	4	torch.float8_e4m3fn	torch.bfloat16	15.2624	0.0015707	9.57887	0
192	8192	4	torch.float8_e4m3fn	torch.bfloat16	22.3951	0.00170898	7.50138	0
256	8192	4	torch.float8_e4m3fn	torch.bfloat16	29.3114	0.00168657	13.3278	0
512	8192	4	torch.float8_e4m3fn	torch.bfloat16	56.4138	0.00164104	27.7895	0
1024	8192	4	torch.float8_e4m3fn	torch.bfloat16	110.548	0.00167239	39.1202	1.19209e-07
16384	8192	4	torch.float8_e4m3fn	torch.bfloat16	1602.16	0.00165743	670.129	7.45058e-08
The scalar multiplication fallback and warp shuffle replacements provide correct functionality while maintaining compatibility with the RDNA4 architecture.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

patch quant kernels for gfx12

53025a8

vllmellm requested a review from a team March 19, 2026 08:51

vllmellm changed the title ~~[gfx1201] Enable quantization kernels for gfx12xx~~ [gfx1201] Enable quantization kernels for gfx1201 Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[gfx1201] Enable quantization kernels for gfx1201#2351

[gfx1201] Enable quantization kernels for gfx1201#2351
vllmellm wants to merge 1 commit intoROCm:mainfrom
EmbeddedLLM:rdna4_quant-support

vllmellm commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vllmellm commented Mar 19, 2026

Motivation

Technical Details

1. FP8 Dtype Registration (aiter/utility/dtypes.py)

2. Scalar Multiplication Fallback (csrc/include/ck_tile/vec_convert.h)

3. DPP Broadcast Replacement (csrc/include/hip_reduce.h)

4. Naive load to LDS fallback (csrc/kernels/quant_kernels.cu)

Test Plan

Test Result

Submission Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. FP8 Dtype Registration (`aiter/utility/dtypes.py`)

2. Scalar Multiplication Fallback (`csrc/include/ck_tile/vec_convert.h`)

3. DPP Broadcast Replacement (`csrc/include/hip_reduce.h`)

4. Naive load to LDS fallback (`csrc/kernels/quant_kernels.cu`)