[ROCm] Add AMD MI300X agent optimization support: HIP compat, knowledge base, program playbook#6
Open
ZJLi2013 wants to merge 1 commit intoRightNow-AI:mainfrom
Open
Conversation
…, rocm-smi fallback
Enable AutoKernel on AMD Instinct MI300X (gfx942, CDNA3) with ROCm/HIP backend.
Changes:
- kernels/fused_mlp.py: replace tl.math.tanh with sigmoid identity
(tl.math.tanh is unavailable on Triton HIP backend, crashes at compile)
- knowledge/amd_cdna3_optimization.md: MI300X architecture guide for agents
(304 CUs, 64-thread wavefronts, MFMA, LDS, HBM3 hierarchy, perf counters)
- knowledge/workload_guidance.md: bottleneck-aware optimization strategies
(memory/compute/latency-bound decision framework for Triton and HIP)
- program.md: add AMD CDNA3 (gfx942) Tier 5 optimization playbook
- prepare.py: add rocm-smi fallback when nvidia-smi is unavailable
Tested on MI300X (gfx942), ROCm 6.4, PyTorch 2.6.0, Triton 3.2.0 HIP backend.
All 9 Triton starter kernels pass correctness checks on MI300X.
AMD optimization knowledge sourced from AMD-AGI/GEAK with attribution.
Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Enable AutoKernel's optimization agent to work effectively on AMD Instinct MI300X (gfx942, CDNA3) with ROCm/HIP backend. This complements PR #3 (GPU detection) by adding the pieces needed for agents to actually optimize kernels on MI300X — compatibility fixes, architecture knowledge, and an optimization playbook.
Builds on top of v1.3.0 AMD GPU detection by @andyluo7.
Changes
HIP Compatibility Fix
kernels/fused_mlp.py: Replacetl.math.tanhwith sigmoid identity (tanh(x) = 2·sigmoid(2x) − 1).tl.math.tanhis unavailable on the Triton HIP backend and crashes at compile time. Numerically equivalent on both CUDA and ROCm.Agent Knowledge Base (new files)
knowledge/amd_cdna3_optimization.md(249 lines): MI300X architecture reference — 304 CUs, 64-thread wavefronts, MFMA instructions, LDS/HBM3 memory hierarchy,waves_per_eutuning, ROCProfiler counters. Curated from AMD-AGI/GEAK with attribution.knowledge/workload_guidance.md(134 lines): Bottleneck-aware optimization strategy framework — guides agents to prioritize kernel-body changes over parameter sweeps, with specific "Prefer First / Consider / Deprioritize" lists for memory-bound, compute-bound, and latency-bound workloads.Agent Playbook
program.md: Add MI300X (CDNA3, gfx942) sections to both Triton Tier 5 and CUDA Tier 5, covering wavefront sizing, MFMA, LDS limits,waves_per_eu,tl.math.tanhworkaround, profiling counters, and HIP compilation flags.ROCm Environment Support
prepare.py: Addrocm-smifallback for driver detection whennvidia-smiis unavailable. Also detects ROCm/HIP version fromtorch.version.hip.Testing
Tested end-to-end on AMD Instinct MI300X (gfx942, ROCm 6.4, PyTorch 2.6.0, Triton 3.2.0 HIP backend):
bench.pycorrectness checks (smoke test, shape sweep, numerical stability, determinism, edge cases)prepare.pycorrectly detects ROCm driver viarocm-smiOptimization results from closed-loop agent testing (for reference, not included in this PR):
| Kernel | Starter → Optimized | Key technique |
|--------|---------------------|---------------|
| flash_attention | 0.50x → 2.20x | Remove
.to(tl.float32)ontl.dotinputs → enable MFMA FP16 || softmax | 1.16x → 2.26x | Multi-row processing (4 rows/program) |
| matmul | 0.44x → 0.53x | 3-arg
tl.dot(a, b, acc)+ autotune || rotary_embedding | 0.82x → 1.09x | Native dtype computation |
| fused_mlp | 0.92x → 1.02x | Grouped tile ordering + autotune |
Zero NVIDIA impact
fused_mlp.py: sigmoid identity is mathematically equivalent totl.math.tanh; works on both backendsprepare.py:rocm-smipath only triggers whennvidia-smifails (never on NVIDIA)program.md/knowledge/: additive content in new sections, no existing content modified