[gfx1201] Enable RMSNorm support for gfx1201 by big-yellow-duck · Pull Request #4 · EmbeddedLLM/aiter

big-yellow-duck · 2026-03-13T09:50:19Z

Motivation

The RMSNorm kernels in csrc/kernels/rmsnorm_quant_kernels.cu use CDNA-specific inline assembly instructions that are not supported on RDNA4 (gfx1201) architecture. This prevents the RMSNorm operation from working on gfx1201 GPUs. This PR aims to enable RMSNorm support on gfx1201 by replacing unsupported assembly instructions with portable HIP/C++ alternatives.

Technical Details

Changes Overview

Modified csrc/kernels/rmsnorm_quant_kernels.cu to replace CDNA-specific inline assembly with portable implementations for gfx11/gfx12:

Replaced v_pk_mul_f32 inline assembly (lines 146-151, 196-201)
- Changed from: asm volatile("v_pk_mul_f32 %0, %1, %2" ...)
- Changed to: Standard float multiplication with __gfx11__ || __gfx12__ guard
- This instruction is not supported on RDNA4 (gfx12xx) architecture
Replaced bf16 unpacking inline assembly (lines 162-176)
- Changed from: v_lshlrev_b32_e32 and v_and_b32_e32 instructions
- Changed to: ck_tile::bit_cast with shift operations for unpacking bf16 values
- Provides equivalent functionality using portable HIP/C++ code
Replaced fp16 unpacking inline assembly (lines 180-194)
- Changed from: v_cvt_f32_f16_e32 and v_cvt_f32_f16_sdwa instructions
- Changed to: ck_tile::bit_cast with shift operations for unpacking fp16 values
- SDWA (Sub-Dword Addressing) instructions are CDNA-specific

Compatibility

CDNA (gfx90a, gfx942): No functional change - continues to use optimized inline assembly
RDNA4 (gfx1201): Now uses portable HIP/C++ implementation
All changes are guarded by preprocessor conditions (#if defined(__gfx11__) || defined(__gfx12__)) to ensure optimal performance on both architectures

Test Plan

Run the RMSNorm test suites to validate the changes:

# Test fused RMSNorm with add and quantization (FP8)
python op_tests/test_rmsnorm2dFusedAddQuant.py --mode 7 -q fp8

# Test standard RMSNorm operations
python op_tests/test_rmsnorm2d.py

Tests cover:

RMSNorm with residual addition
FP8 quantization paths
Various hidden dimension sizes (up to 8192)
Both bf16 and fp16 input data types
Per-token and per-channel quantization modes

Test Result

Test result for python op_tests/test_rmsnorm2dFusedAddQuant.py --mode 7 -q fp8

m	n	quant_type	add_residual	dtype	quant_dtype	smoothquant	torch us	hip us	hip err	hip bw(GB/s)
8	1024	4	True	torch.bfloat16	torch.float8_e4m3fn	False	33.2036	2.18041	0.0317383	25.4776
256	1024	4	True	torch.bfloat16	torch.float8_e4m3fn	False	100.013	4.89922	0.0306206	350.774
2048	1024	4	True	torch.bfloat16	torch.float8_e4m3fn	False	163.156	22.8312	0.0308919	601.581
2560	1024	4	True	torch.bfloat16	torch.float8_e4m3fn	False	141.778	23.6688	0.0309967	725.346
32768	1024	4	True	torch.bfloat16	torch.float8_e4m3fn	False	2402.92	390.831	0.0310231	562.208
8	2048	4	True	torch.bfloat16	torch.float8_e4m3fn	False	24.3369	2.30625	0.0294189	48.1748
256	2048	4	True	torch.bfloat16	torch.float8_e4m3fn	False	84.8793	14.8431	0.0314503	231.558
2048	2048	4	True	torch.bfloat16	torch.float8_e4m3fn	False	257.192	39.1746	0.0309095	701.209
2560	2048	4	True	torch.bfloat16	torch.float8_e4m3fn	False	308.477	47.5147	0.0309946	722.641
32768	2048	4	True	torch.bfloat16	torch.float8_e4m3fn	False	5155.18	777.68	0.0310699	565.087
8	4096	4	True	torch.bfloat16	torch.float8_e4m3fn	False	53.5412	2.74501	0.0287476	80.9491
256	4096	4	True	torch.bfloat16	torch.float8_e4m3fn	False	105.035	18.4179	0.0307817	373.228
2048	4096	4	True	torch.bfloat16	torch.float8_e4m3fn	False	400.869	102.266	0.0310012	537.218
2560	4096	4	True	torch.bfloat16	torch.float8_e4m3fn	False	590.208	126.793	0.0310267	541.608
32768	4096	4	True	torch.bfloat16	torch.float8_e4m3fn	False	9250.57	1574.4	0.0310284	558.254
8	8192	4	True	torch.bfloat16	torch.float8_e4m3fn	False	36.7148	2.96228	0.0305786	150.024
256	8192	4	True	torch.bfloat16	torch.float8_e4m3fn	False	116.373	31.7526	0.0309405	432.977
2048	8192	4	True	torch.bfloat16	torch.float8_e4m3fn	False	1132.48	200.516	0.0310664	547.979
2560	8192	4	True	torch.bfloat16	torch.float8_e4m3fn	False	1471.56	249.542	0.0309757	550.385
32768	8192	4	True	torch.bfloat16	torch.float8_e4m3fn	False	18541.1	3156.56	0.0310378	556.88

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Aiter fails import test with error ModuleNotFoundError: No module named 'packaging' The aiter package imports and uses 'packaging' module at runtime in multiple files, but only declares it in setup_requires (build-time) instead of also declaring in install_requires (runtime). This causes "ModuleNotFoundError: No module named 'packaging'" when importing aiter in environments where 'packaging' is not already installed. This PR looks to fix this issue by patching setup.py to include aiter packaging as a runtime dependency. Signed-off-by: Anu Oguntayo <aoguntay@redhat.com>

GitHub Actions CI pipeline is aborted if a process exits with a code other than zero. This commit fixes a bug in Triton test selection script, no matter the test selection outcome, CI pipeline shouldn't be aborted.

* enable gptoss_sink Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update csrc/py_itfs_ck/mha_batch_prefill_kernels.cu Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update mha_batch_prefill_kernels.cu * update mha_bwd parameter Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update mha.py * Fix formatting for bias argument in rocm_ops.hpp * fix some format error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update mha.py * update args Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update mha_fwd.cpp * update ck commit Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * use atier main branch ck commit Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * update ck commit Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update mha_batch_prefill_kernels.cu --------- Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Co-authored-by: solin <bingzhou@amd.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* fix(paps): fix support for multi kheads Signed-off-by: Double Young <yang.yang2@amd.com> * fix(paps): fix reset work_indptr and use empty init in ut Signed-off-by: Double Young <yang.yang2@amd.com> --------- Signed-off-by: Double Young <yang.yang2@amd.com>

…ROCm#1762) * [Docs] Add README for Triton Ops detailing general maintenance points

* initial commit * fix * test ck tile tuning * temp save * tem save * refactor * fix tile * support ck tile abquant * fix error * fix error * fix error * fix error * fix error * test tuning * fix tile compile error * add more tile instance * test tile instance tuning * add more valid instances * fix test bug * fix default tile instance * fix * fix actions error * format code style * Apply Black 25.12.0 formatting to match CI * fix CI * fix CI * rename lagacy * add profile result * update ck * code format * fix mismatch ck kernel * fix CI * delete tune flag * update ck * merge aiter main branch

* Testing fake_tensor fix * Same logic for var len attn * Fix --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

…ernally (ROCm#1821) * Implement a new api that will be switching between asm and hip pa Inference engines should be calling paged_attention_common now with shuffled kv cache layout and aiter internally will decide between asm or hip kernel. HIP is more performant for lower concurrencies ( < 128). Also a unit test has been updated to include the new interface. Note that support for the shuffled scales in HIP is not supported and is always redirected to asm now when KV cache is in int8 or fp8 formats. * Delete op_tests/README_pa_merged_tests.md * Delete op_tests/test_pa_merged.py * Fix formatting according to Black requirements * Fix one last place with broken formatting * Remove modification to pa_v1, we already have pa for 5D kv cache * Fix another formatting issue * Add proper quant support for the common API * Apply formatting * Remove redundant parameters * Remove redundant parameters --------- Co-authored-by: Sergey Solo <ssolovye@amd.com> Co-authored-by: Mikko Tukiainen <mikko.tukiainen@amd.com>

* add_tune_dsfp4_gemm * update * Update aiter/jit/core.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fused rope_kv and bmm * Apply suggestion from @github-actions[bot] Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Apply suggestion from @github-actions[bot] Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Update fused_bmm_rope_kv_cache.py * Update fused_bmm_rope_kv_cache.py * add test * update * update * parse bmm config * fp8 API and kernel change * fp8 UT * Apply suggestion from @github-actions[bot] Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Apply suggestion from @github-actions[bot] Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Formatting with black * pytest skip if fp4/8 is not avail on device * code format with black --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com>

TODO: improve moe tuner.

… clang version (ROCm#1860)

) * [CK_TILE][FMHA] Support page size 16 for batch prefill kernel * handle GQA cases in reference outputs. --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* fix accuracy issue on triton paged_pa_mqa Signed-off-by: ganyi <ygan@amd.com> * add int64 annotation for input stride Signed-off-by: ganyi <ygan@amd.com> --------- Signed-off-by: ganyi <ygan@amd.com>

* Fix code style after updating Black to 26.1.0 * Update aiter/ops/mha.py Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

…x950 (ROCm#1857) * first commit for mla prefill * add test for mla prefill * support ut and mla python reduce verison * support reduce in torch and triton * push 350 .co file * add triton reduce op and reconstruct mla dispatch * first commit for mla prefill * support ut and mla python reduce verison * support reduce in torch and triton * push 350 .co file * fix gqa bug * first commit for mla prefill * support ut and mla python reduce verison * support reduce in torch and triton * push 350 .co file * rebase and force * use hip reduce for test * change ut * support triton reduce kernel without head info * feat(ps): add host v1_2 generate_ps_metadata & ut for mla_prefill Signed-off-by: Double Young <yang.yang2@amd.com> * fix(ps): fix ps metadata allocation * fix triton reduce fallback error * fix(ps): fix OOM in mla prefill Signed-off-by: Double Young <yang.yang2@amd.com> * fix(ps): reduce pre-allocation Signed-off-by: Double Young <yang.yang2@amd.com> * test(mla_prefill): enhance mla_prefill_ps ut & generate_ps_metadata Signed-off-by: Double Young <yang.yang2@amd.com> * slove asm_mla.cu conflict and add conflict co in csv * format code * reformat * test(mla_prefill): fix reduce perf measure and format Signed-off-by: Double Young <yang.yang2@amd.com> * refactor(mla_prefill): fix ruff format Signed-off-by: Double Young <yang.yang2@amd.com> * refactor(mla_prefill): fix ruff format Signed-off-by: Double Young <yang.yang2@amd.com> * fix(mla_prefill): fix nan in sp3 Signed-off-by: Double Young <yang.yang2@amd.com> * fix new pr rename .co and update cu and csv --------- Signed-off-by: Double Young <yang.yang2@amd.com> Co-authored-by: ZhangLirong-amd <lirzhang@amd.com> Co-authored-by: ZhangLirong-amd <Lirong.Zhang@amd.com>

… improvement 18% across all configs (ROCm#1869)

* opt_unit_test * remove test_gemm_a8w8_blockscale_mi350.py

* spkil mla_prefill_ps when gfx942 * use chip info

* Causal conv1d triton sglang

…rage speedup on MI300X (ROCm#1879)

* Upload wheel to S3 in CI test workflow * Reuse wheel from image build instead of rebuilding * Restrict wheel upload to main branch only * Simplify wheel step conditions

A8W8 is missing support for splitk. Added support and a unit test.

…#2243)

If the `git` binary is missing, `FileNotFoundError` is raised in place of `subprocess.CalledProcessError`, causing a crash on import.

* opt prefill * fix atomic mode bugs * Apply suggestions from code format review Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * update moe tunner stride * format * format * update flydsl tunner * update flydsl tunner config --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* revert to unfused quant kernels for perf * int64 offsets to avoid bhsd overflow of int32

* support activation input in mxfp4 format. * support activation input in mxfp4 format.

…p4 (ROCm#2169) * Walk around "BLOCK_SIZE_S3" error * Remove workaround for "BLOCK_SIZE_S3" key in GEMM configuration functions * Revert "Copy config before mutate (ROCm#2173)" This reverts commit 213b76f.

* add hipblaslt error log * Update gradlib/csrc/hipbsolgemm.cu Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…Cm#2259) Add 4 missing fields in fmha_fwd_args aggregate initialization in mha_fwd.cu to match new CK struct layout: - seqstart_v_scale_ptr (nullptr) - stride_q_descale, stride_k_descale, stride_v_descale (0)

@gyohuangxin

ROCm#2167) * Prepare repository for size optimization This commit introduces safeguards and documentation to prepare for a major repository cleanup that will reduce the repo size from 547 MB to ~130 MB (76% reduction). Changes: - Enhanced .gitignore to prevent large files (test data, build artifacts) - Created test data download script framework - Documented cleanup plan and migration process The actual history cleanup will be performed separately during a scheduled maintenance window, requiring all contributors to re-clone. See REPO_CLEANUP_PLAN.md for full details. Impact: No immediate changes to functionality. Protective measures only. * Address Copilot code review feedback - Fix migration steps to preserve local changes using patch files instead of git stash - Update size reduction numbers to match actual test results (105MB vs aspirational 50MB) - Clarify that pre-commit hook for size checks is not included (to avoid conflict with existing hook) - Update hook installation instructions to align with existing CONTRIBUTE.md workflow - Fix test data download script to exit with error code when unconfigured - Remove references to non-existent files (paths_to_remove.txt, aiter_cleanup_results.md) All changes address feedback from Copilot code review. * docs: add documentation website Add comprehensive Sphinx-based documentation website for AITER. Features: - Installation guide with 3 installation methods - Quick start tutorial with runnable examples - API reference for attention, GEMM, and operators - Basic usage tutorial with performance comparisons - Configuration for doc.aiter.amd.com hosting Structure: - docs/conf.py: Sphinx configuration with AMD branding - docs/index.rst: Main documentation landing page - docs/installation.rst: Detailed installation instructions - docs/quickstart.rst: 5-minute getting started guide - docs/api/: Complete API reference documentation - docs/tutorials/: Hands-on tutorials with code examples The documentation can be built locally with: cd docs && pip install -r requirements.txt && make html This brings AITER documentation quality on par with FlashInfer. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * ci: add GitHub Actions workflow for documentation Add automated build and deployment workflow for Sphinx documentation. Features: - Automatic build on push to docs-website and main branches - Deploys to GitHub Pages via gh-pages branch - Build artifacts available for PR previews - Uses sphinx-build with all extensions - Caches pip dependencies for faster builds Workflow: 1. Checkout code 2. Install Python and dependencies 3. Build Sphinx HTML documentation 4. Upload build artifacts 5. Deploy to gh-pages branch (on push) Documentation will be available at: https://sunway513.github.io/aiter/ Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * ci: trigger documentation workflow * fix: trigger workflow on docs-website branch * fix: add missing _static and _templates directories for Sphinx * fix: simplify Sphinx build to avoid treating warnings as errors * docs: add comprehensive 'How to Add a New Operator' tutorial Add detailed step-by-step guide for adding custom operators to AITER. Features: - Complete workflow from Python interface to ROCm kernel - Real code examples for each step - PyBind11 bindings setup - Testing and benchmarking guidelines - Best practices and debugging tips - Complete RMSNorm example as reference This addresses team feedback: "搞个how to add new op之类的就完美了" Includes: - Step 1: Define operator interface (Python) - Step 2: Implement ROCm/HIP kernel - Step 3: Create PyBind11 bindings - Step 4: Update build configuration - Step 5: Add comprehensive tests - Step 6: Build and install - Step 7: Register in main module Advanced topics: - CK (Composable Kernel) integration - Triton kernel development - Fused operations pattern - In-place operations - Autograd support for training Also updated: - docs/index.rst: Added Quick Links section highlighting the tutorial - docs/tutorials/index.rst: Added to Advanced Topics section Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * docs: fix critical factual errors in documentation Fix high-priority documentation errors discovered in factual accuracy audit: ## Critical Fixes - Fix incorrect package name (aiter → amd-aiter) in installation instructions - Replace non-working verification code with functional examples - Fix MOE quickstart example to use actual fmoe() API instead of non-existent grouped_gemm() ## Changes - docs/installation.rst: Update pip install command and verification code - docs/quickstart.rst: Replace grouped_gemm with working fmoe example - docs/DOCUMENTATION_AUDIT_REPORT.md: Add comprehensive audit findings ## Audit Summary Discovered 22 factual errors across documentation. This commit addresses the 3 highest-priority issues that would immediately block users. See DOCUMENTATION_AUDIT_REPORT.md for complete findings and recommendations. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * ci: add path filters to docs workflow Only trigger docs build/deploy when docs/** or the workflow file itself is modified, avoiding unnecessary CI runs on non-doc changes. Addresses review feedback from @gyohuangxin. * fix: apply black formatting to docs/conf.py Replace single quotes with double quotes and add trailing commas to pass CI code style check. --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: valarLip <340077269@qq.com>

…m#2203)" (ROCm#2268) This reverts commit 30d1040.

* opt kernel when batch>=32 and add more decode test * Apply suggestions from code review Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * add max_tokens_per_batch interface to support more Scenarios * Update test_fused_qk_norm_rope_cache_quant.py * Apply suggestions from code review Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

github-actions · 2026-03-13T09:50:33Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 4 --add-label <label>

csrc/include/hip_reduce.h

csrc/kernels/quant_kernels.cu

csrc/include/hip_reduce.h

aoguntayo and others added 30 commits January 14, 2026 21:09

fix async instruction not supported issue (ROCm#1842)

8c4e833

[TRITON] Fix the mistake in the unified attention comments (ROCm#1832)

f2ec99e

[TRITON][CI] Fix test selection script exit code bug (ROCm#1846)

5d75052

GitHub Actions CI pipeline is aborted if a process exits with a code other than zero. This commit fixes a bug in Triton test selection script, no matter the test selection outcome, CI pipeline shouldn't be aborted.

fix test_gemm_a8w8.py error on gfx942 (ROCm#1841)

17755d1

Co-authored-by: solin <bingzhou@amd.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com>

CI: Enable Aiter tests on MI325 nodes (ROCm#1836)

fe743df

[Docs] Add README for Triton Ops detailing general maintenance points (…

d8d05d0

…ROCm#1762) * [Docs] Add README for Triton Ops detailing general maintenance points

Fix fake_tensor implementation for mha (ROCm#1844)

7c0c84f

* Testing fake_tensor fix * Same logic for var len attn * Fix --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

CI: Fix vllm benchmark test (ROCm#1852)

4af0fa8

add_tune_dsfp4_gemm (ROCm#1856)

545afc9

* add_tune_dsfp4_gemm * update * Update aiter/jit/core.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix moe regression. (ROCm#1861)

3095230

TODO: improve moe tuner.

[OPUS] hot fix check __builtin_amdgcn_raw_ptr_buffer_load_lds support…

cf346a3

… clang version (ROCm#1860)

[CK_TILE][FMHA] Support page size 16 for batch prefill kernel (ROCm#1850

b738b1b

) * [CK_TILE][FMHA] Support page size 16 for batch prefill kernel * handle GQA cases in reference outputs. --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

fix tiled_mma_adaptor expd-k for acc c (ROCm#1839)

cea8030

Rename .cpp file to .cu for cpp_itfs (ROCm#1826)

63f588c

fix crash issue on triton paged_pa_mqa (ROCm#1853)

371a22f

* fix accuracy issue on triton paged_pa_mqa Signed-off-by: ganyi <ygan@amd.com> * add int64 annotation for input stride Signed-off-by: ganyi <ygan@amd.com> --------- Signed-off-by: ganyi <ygan@amd.com>

Optimize moe_align_block_size kernel with O(E) shared memory, average…

6265334

… improvement 18% across all configs (ROCm#1869)

opt_unit_test (ROCm#1867)

9c3925b

* opt_unit_test * remove test_gemm_a8w8_blockscale_mi350.py

spkil mla_prefill_ps when gfx942 (ROCm#1870)

f870052

* spkil mla_prefill_ps when gfx942 * use chip info

fix error (ROCm#1871)

b77f2e5

CI: Enabling Aiter test on MI325 in PR tests (ROCm#1872)

ed73205

[Triton] Causal conv1d fwd (ROCm#1855)

cbf2df7

* Causal conv1d triton sglang

Optimize sigmoid activation with AMD fast math intrinsics for 20% ave…

daa23b1

…rage speedup on MI300X (ROCm#1879)

gyohuangxin and others added 20 commits March 10, 2026 19:32

CI: Upload wheel to S3 in CI test workflow (ROCm#2239)

84964a6

* Upload wheel to S3 in CI test workflow * Reuse wheel from image build instead of rebuilding * Restrict wheel upload to main branch only * Simplify wheel step conditions

[CK][GEMM] update ck eightwarps pipleline name (ROCm#2232)

f80d9c0

[TRITON] Add support for splitk to A8W8 kernel (ROCm#2180)

9c2fef9

A8W8 is missing support for splitk. Added support and a unit test.

Fix multi-gpu tests only running one shard instead of all tests (ROCm…

b76d8d6

…#2243)

disable v_fmac_f32 for sglang fp16 ci test() (ROCm#2241)

45533c2

fix crash on import if git is missing (ROCm#2226)

43561d7

If the `git` binary is missing, `FileNotFoundError` is raised in place of `subprocess.CalledProcessError`, causing a crash on import.

update hip moe smoothquant support expert_num <= 1024 (ROCm#2231)

3f94d29

add nhead=128 condition (ROCm#2246)

3f6d770

[TRITON] Sagev2 patch (ROCm#2240)

3903934

* revert to unfused quant kernels for perf * int64 offsets to avoid bhsd overflow of int32

update dsv3 ptpc a8w8 gemm config (ROCm#2253)

dc7fefc

support FP8/MXFP4-quantized activation dtype (ROCm#2188)

811f6a7

* support activation input in mxfp4 format. * support activation input in mxfp4 format.

Fix LRU cache pollution causing BLOCK_SIZE_S3 KeyError in gemm_afp4wf…

7d29fce

…p4 (ROCm#2169) * Walk around "BLOCK_SIZE_S3" error * Remove workaround for "BLOCK_SIZE_S3" key in GEMM configuration functions * Revert "Copy config before mutate (ROCm#2173)" This reverts commit 213b76f.

add hipblaslt error log (ROCm#2252)

2cf9214

* add hipblaslt error log * Update gradlib/csrc/hipbsolgemm.cu Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[CK_TILE][FMHA] fix fmha_fwd_args alignment with CK struct update (RO…

3a217fc

…Cm#2259) Add 4 missing fields in fmha_fwd_args aggregate initialization in mha_fwd.cu to match new CK struct layout: - seqstart_v_scale_ptr (nullptr) - stride_q_descale, stride_k_descale, stride_v_descale (0)

Revert "CI: enable multi-gpu tests via ci:multi-gpu label on PRs (ROC…

0246740

…m#2203)" (ROCm#2268) This reverts commit 30d1040.

add quant kernel support for gfx1201

ecec44e

add rmsnorm support for gfx1201

706c05c

tjtanaa reviewed Mar 16, 2026

View reviewed changes

csrc/include/hip_reduce.h Outdated Show resolved Hide resolved

tjtanaa reviewed Mar 16, 2026

View reviewed changes

csrc/kernels/quant_kernels.cu Outdated Show resolved Hide resolved

big-yellow-duck added 2 commits March 16, 2026 14:48

fix formatting and smooth quant kernel route

5474a90

merge smooth quant kernel gfx12x patch

57ab087

tjtanaa reviewed Mar 19, 2026

View reviewed changes

csrc/include/hip_reduce.h Outdated Show resolved Hide resolved

fix format

fd6a4f4

big-yellow-duck closed this Mar 19, 2026

big-yellow-duck force-pushed the main branch from cd082f5 to cb0b0c8 Compare March 19, 2026 06:20

big-yellow-duck deleted the rdna4-rmsnorm-support branch March 19, 2026 08:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[gfx1201] Enable RMSNorm support for gfx1201#4

[gfx1201] Enable RMSNorm support for gfx1201#4
big-yellow-duck wants to merge 1460 commits intomainfrom
rdna4-rmsnorm-support

big-yellow-duck commented Mar 13, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

big-yellow-duck commented Mar 13, 2026

Motivation

Technical Details

Changes Overview

Compatibility

Test Plan

Test Result

Submission Checklist

Uh oh!

github-actions bot commented Mar 13, 2026

🏷️ CI Guide

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants