Open
Conversation
Imported from GitHub PR openxla#38507 📝 Summary of Changes Updated functions from triton/support.c and triton/support_test to execute correctly on ROCm. 🎯 Justification support_test was failing on ROCm. 🚀 Kind of Contribution Please remove what does not apply: 🐛 Bug Fix, ⚡️ Performance Improvement, ✨ New Feature, ♻️ Cleanup, 📚 Documentation, 🧪 Tests 📊 Benchmark (for Performance Improvements) Please measure and include speedups for one of the public HLOs in `compiler/xla/tools/benchmarks/hlo/`. 🧪 Unit Tests: Used existing triton/support_test 🧪 Execution Tests: What execution tests were added? For example, a new optimization should be tested with an end-to-end execution test triggering the optimization and asserting correctness. Please provide test cases running with at most 2 GPUs. Copybara import of the project: -- e067431 by zoranjovanovic-ns <126815388+zoranjovanovic-ns@users.noreply.github.com>: Fixed triton_support_test on rocm. -- 16f1f07 by Zoran Jovanovic <zjovanov@amd.com>: Review comments. -- 731fb44 by Zoran Jovanovic <zjovanov@amd.com>: Code review 2 Merging this change closes openxla#38507 COPYBARA_INTEGRATE_REVIEW=openxla#38507 from ROCm:ci_rocm-fix-triton-support-4 731fb44 PiperOrigin-RevId: 877900520
…evice_test for ROCm. Imported from GitHub PR openxla#38742 📝 Summary of Changes Created expected output for FuseSubchannelDequantizationWithTranspose in triton/fusion_emitter_int4_device_test on ROCm. 🎯 Justification triton/fusion_emitter_int4_device_test was failing 🚀 Kind of Contribution Please remove what does not apply: 🧪 Tests 📊 Benchmark (for Performance Improvements) Please measure and include speedups for one of the public HLOs in `compiler/xla/tools/benchmarks/hlo/`. 🧪 Unit Tests: triton/fusion_emitter_int4_device_test 🧪 Execution Tests: What execution tests were added? For example, a new optimization should be tested with an end-to-end execution test triggering the optimization and asserting correctness. Please provide test cases running with at most 2 GPUs. Copybara import of the project: -- d00e6d1 by Zoran Jovanovic <zjovanov@amd.com>: Fix expected output in fusion_emitter_int4_device_test for ROCm. Merging this change closes openxla#38742 COPYBARA_INTEGRATE_REVIEW=openxla#38742 from ROCm:rocm-fusion_emitter_int4_device_test d00e6d1 PiperOrigin-RevId: 880942122
…u in gpu_triton_cu… Imported from GitHub PR openxla#38801 …stom_call_test for ROCm 📝 Summary of Changes skipped CanNotEmitTritonCustomCallOnPreAmpereGpu in gpu_triton_custom_call_test for ROCm 🎯 Justification Unit test was failing because it works on ROCm 🚀 Kind of Contribution Please remove what does not apply: 🧪 Tests 📊 Benchmark (for Performance Improvements) Please measure and include speedups for one of the public HLOs in `compiler/xla/tools/benchmarks/hlo/`. 🧪 Unit Tests: gpu_triton_custom_call_test 🧪 Execution Tests: What execution tests were added? For example, a new optimization should be tested with an end-to-end execution test triggering the optimization and asserting correctness. Please provide test cases running with at most 2 GPUs. Copybara import of the project: -- 6cf15ac by Zoran Jovanovic <zjovanov@amd.com>: [ROCm] Skip CanNotEmitTritonCustomCallOnPreAmpereGpu in gpu_triton_custom_call_test for ROCm Merging this change closes openxla#38801 COPYBARA_INTEGRATE_REVIEW=openxla#38801 from ROCm:rocm-fix-gpu_triton_custom_call_test 6cf15ac PiperOrigin-RevId: 881953512
…egacy and test itself. Imported from GitHub PR openxla#38759 📝 Summary of Changes Modified IsDotAlgorithmSupportedByTriton to reflect implementation on ROCm. 🎯 Justification triton/dot_algorithms_test was failing 🚀 Kind of Contribution Please remove what does not apply: 🐛 Bug Fix, 🧪 Tests 📊 Benchmark (for Performance Improvements) Please measure and include speedups for one of the public HLOs in `compiler/xla/tools/benchmarks/hlo/`. 🧪 Unit Tests: triton/dot_algorithms_test 🧪 Execution Tests: What execution tests were added? For example, a new optimization should be tested with an end-to-end execution test triggering the optimization and asserting correctness. Please provide test cases running with at most 2 GPUs. Copybara import of the project: -- 757877c by Zoran Jovanovic <zjovanov@amd.com>: Fixed dot_algorithms_test. Updated support_legacy and test itself. -- 5685c9c by Zoran Jovanovic <zjovanov@amd.com>: Review comments. -- d801633 by Zoran Jovanovic <zjovanov@amd.com>: Review comment. Merging this change closes openxla#38759 COPYBARA_INTEGRATE_REVIEW=openxla#38759 from ROCm:rocm-fix-dot_algorithms_test d801633 PiperOrigin-RevId: 885489190
… autotuner Imported from GitHub PR openxla#38792 Enable FissionBackend autotuning for ROCm (rocBLAS + hipBLASLt) - Added HIPBLASLT_FISSION to backend proto - Updated factory_rocm.cc to register the backends - xla_gpu_experimental_disabe_binary_libraries, xla_gpu_enable_cublaslt behavior mirror CUDA Also minor fix: removed dead return in fission_backend.cc 🚀 Kind of Contribution ✨ New Feature Copybara import of the project: -- ad88d8c by Eetu Sjöblom <eetu.sjoblom@amd.com>: Enable rocblas/hipBLASLt fission on ROCm Merging this change closes openxla#38792 COPYBARA_INTEGRATE_REVIEW=openxla#38792 from ROCm:ci_rocm_enable_fission_upstream ad88d8c PiperOrigin-RevId: 884317138 (cherry picked from commit 77ecf53)
…ne_level == 0 Imported from GitHub PR openxla#37074 📝 Summary of Changes Adapt GpuCompiler::AddConvAndGemmAutotuningPass to match pre refactor behavior of AMDGPUCompiler. 🎯 Justification For ROCm we need to run miopen backend even when autotuning is disabled in order to decompose back unsupported fused convolutions. There is no runtime fallback. 🚀 Kind of Contribution 🐛 Bug Fix 📊 Benchmark (for Performance Improvements) N\A 🧪 Unit Tests: None 🧪 Execution Tests: None Copybara import of the project: -- 486498b by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>: [ROCm] Reenable miopen autotune when xla_gpu_autotune_level == 0 -- feec478 by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>: Fix //xla/backends/gpu/autotuner:miopen_test_amdgpu_any Merging this change closes openxla#37074 COPYBARA_INTEGRATE_REVIEW=openxla#37074 from ROCm:miopen_autotune feec478 PiperOrigin-RevId: 877364967
Backport upstream fix that prevents a crash in the BlockPingpong optimization when applied to FMA (non-MFMA) dot operations. The pass now uses dyn_cast instead of cast for AMDMfmaEncodingAttr and returns early when the encoding is not MFMA.
1 task
11b1fce to
07fe5a1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Applied code review comments from claude
Technical Details
Test Plan
Test Result
Submission Checklist