Userbuffer epic by alextmagro · Pull Request #367 · ROCm/TransformerEngine

alextmagro · 2025-11-11T21:24:31Z

This is the userbuffer_epic branch, to be merged only once all epic tasks have been completed. PRs for epic tasks will be onto this branch.

build_tools/pytorch.py

wangye805 · 2026-02-09T16:10:53Z

examples/pytorch/comm_gemm_overlap/te_layer_with_overlap.py

    parser.add_argument("--seed", type=int, default=1234, help="RNG seed.")
    parser.add_argument(
-        "--fp8", action="store_true", default=False, help="Enables the te.fp8_autocast() context."
+        "--fp8", action="store_true", default=False, help="Enables the te.autocast() context."


Up to TE v2.8, I think it's still fp8_autocast. Were you targeting at higher versions?

I think you had a few comments on this, so will address it here quickly. I moved the UB code up to release 2.10, as there were a few bugs and inefficiencies that NV fixed. Most of the changes that aren't guarded in the files are NV upstream changes.

I am fixing up the te_layer_with_overlap differences, and working on integrating the benchmark script into the file directly.

wangye805 · 2026-02-09T17:07:29Z

examples/pytorch/comm_gemm_overlap/te_layer_with_overlap_profile.py

+
+# This file was modified for portability to AMDGPU
+# Copyright (c) 2025-2026, Advanced Micro Devices, Inc. All rights reserved.
+# Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.


Was this file sharing a lot of codes with examples/pytorch/comm_gemm_overlap/te_layer_with_overlap.py? Is it possible to consolidate those two files

examples/pytorch/comm_gemm_overlap/ub_config.json

wangye805 · 2026-02-09T17:10:22Z

tests/pytorch/distributed/test_comm_gemm_overlap.py

 import transformer_engine.pytorch.cpp_extensions as tex
 from transformer_engine.pytorch.fp8 import FP8GlobalStateManager

+from transformer_engine.jax.cpp_extensions.misc import is_hip_extension


Let's not import jax specific code into pytorch side. Use this instead:

TransformerEngine/tests/pytorch/test_numerics.py

Line 17 in 0dfee56

from torch.utils.cpp_extension import IS_HIP_EXTENSION

Good catch, this is an mistake. Will fix.

transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp

wangye805 · 2026-02-09T18:13:30Z

transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp

-  if (_ub_comm->myrank == 0) printf("!!! [UB] Register UBuf %d\n", _ub_reg);
+  if (_ub_comm->myrank == 0) {
+    printf("!!! [UB] Register UBuf %d\n", _ub_reg);
+  }


I would prefer aligning the coding style with NV upstream so it's easier for us to maintain/IFU later

wangye805 · 2026-02-09T18:14:11Z

transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp

                      allgather_handle, barrier_handle, tp_size, num_max_streams, comm_cga_size,
                      gemm_priority, comm_priority, num_comm_sm, set_sm_margin, use_ce,
                      atomic_gemm) {
+  initialize(buffer_shape, buffer_dtype, comm_type, aggregate);


Same question here for the motivation of this initialize function in the constructor

transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp

wangye805 · 2026-03-08T01:48:25Z

transformer_engine/common/comm_gemm_overlap/userbuffers/userbuffers-host.cpp

  NVTE_CHECK_CUDA(cudaMemset((*comm)->flags_baseptr, 0, 2 * GPU_PAGE_SIZE));
  (*comm)->flags = reinterpret_cast<int *>(
+#ifdef __HIP_PLATFORM_AMD__
+      (reinterpret_cast<uintptr_t>((*comm)->flags) + GPU_PAGE_SIZE - 1) & GPU_PAGE_MASK);


Should it be (*comm)->flags_baseptr as the nv upstream below? (*comm)->flags is not allocated/assigned above

Yes, I have fixed that. Thanks!

wangye805 · 2026-03-08T02:06:25Z

transformer_engine/common/comm_gemm_overlap/userbuffers/userbuffers.cu


  __syncthreads();
-  if (threadIdx.x == 0) __threadfence_system();
+  if (threadIdx.x == 0) __threadfence();


Looks like __threadfence_system() is now supported in rocm 7.2: https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html#memory-fence-instructions

wangye805 · 2026-03-08T02:07:57Z

transformer_engine/common/comm_gemm_overlap/userbuffers/userbuffers.cu

 void userbuffers_send(const int srchandler, const size_t srcoffset, const int dsthandler,
                      const size_t dstoffset, const size_t bytes, communicator *comm,
-                      const int peer, cudaStream_t stream) {
+                      const int peer, cudaStream_t stream, int ring_id) {


Emm, I guess my question then would be why NV upstream does not need a ring_id? Is it because of we have different implementation? The NVTE_ROCM_MAX_RINGS?

transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp

wangye805 · 2026-03-08T02:15:22Z

transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp

    _comm_priority = comm_priority;
  }
-  for (int i = 0; i < std::min(num_max_streams, num_splits); i++) {
+  for (int i = 0; i < std::max(num_max_streams, num_splits); i++) {


In fact, do we need stream numbers more than the min of max_stream and num_splits?

We do. I am convinced that this is an upstream bug, as num_splits has a default value of 0, which seems off. Either way, we need at least 1 stream for each tp peer, which is what num_splits is supposed to be.

wangye805 · 2026-03-08T02:16:03Z

transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp

  NVTE_DIM_CHECK(chunk_height > 0 && chunk_width > 0, "Attempted to get empty tensor chunk");
  NVTE_DIM_CHECK(chunk_height <= height && chunk_width <= width,
                 "Attempted to get out-of-bounds tensor chunk");
+#ifndef __HIP_PLATFORM_AMD__


Since we already support mxfp8. Add a to-do comment so that we won't forget to turn it on later

This is a ifndef, so is enabled for us since we don't have the padding issues.

wangye805 · 2026-03-08T02:31:03Z

transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp

+
+  // Input data
+  const size_t source_size = source.numel();
+  const void *src_ptr = (rowwise) ? source.dptr() : source.columnwise_dptr();


Well, what if we need both row-wise and colwise? How about other fields of a tensor, for example, scale inv?

Within these functions we are working with only colwise or rowwise data. When we call hipblaslt we are passing in only one or the other for the GEMM, so no need for both within a single overlap call.

wangye805 · 2026-03-08T02:53:45Z

transformer_engine/pytorch/module/base.py

            "num_sm": 1 if method == "ring_exchange" else 16,
            "cga_size": 1 if method == "ring_exchange" else 2,
-            "set_sm_margin": not method == "ring_exchange",
+            "set_sm_margin": not method == "ring_exchange" and not IS_HIP_EXTENSION,


Ilya already had the sm_margin feature supported on rocm

This was a performance decision, not a functionality one. set_sm_margin seems to slow down UB on ROCm, probably because we have dedicated SDMA engines being used that don't require CU blocking?

Please put comment that it is disabled by performance reason then

wangye805 · 2026-03-08T02:54:53Z

transformer_engine/pytorch/module/base.py

+        if IS_HIP_EXTENSION and user_ub_cfg is not None:
+            for name, cfg in user_ub_cfg.items():
+                assert cfg.get("method") != "bulk", (
+                    f"Bulk overlap method for '{name}' is not supported on HIP/ROCm. "


I recall we supported bulk overlap but the performance is not great?

Yeonsoo was seeing some race conditions and weird hangs, and submitted a PR request to upstream. I am still seeing failures after rebasing to IFU 2.10, so I think that the issue is still there. I think if we want to enable this, we should consider it for a different PR as it will require a new implementation.

wangye805 · 2026-03-08T02:58:51Z

build_tools/hipify/custom_map.json

@@ -6,7 +6,14 @@
        "ATen/cudnn/Handle.h" : "ATen/miopen/Handle.h",
        "CUfunc_cache" : "hipFuncCache_t", 
        "<nvtx3/nvToolsExt.h>" : "<roctracer/roctx.h>",
-        "cudaFuncSetAttribute(" : "hipFuncSetAttribute((const void*)"
+        "cudaFuncSetAttribute(" : "hipFuncSetAttribute((const void*)",
+        "cudaLaunchKernel": "hipLaunchKernel",


cudaLaunchKernel cannot be hipified?

Looks like this was very recently added to hipify_torch, so we can probably remove this after we update our hipify_torch commit. I would recommend we do that separately, however.

I can actually see it in hipify_torch maps. I think custom map is not needed after recent TE hipification changes

Looking back I had had this since we need to hipify cudaLaunchKernelExC as well as cudaLaunchKernel. The former is still not in the map, so I have updated the custom map to specifically pick up the ExC variation.

ipanfilo · 2026-03-17T06:36:08Z

build_tools/hipify/custom_map.json

 {
 "custom_map" : {
        "<cuda_bf16.h>" : "<hip/hip_bfloat16.h>",
+        "util/cuda_runtime.h" : "util/hip_runtime.h",


it should be covered by line 5

Thanks, have removed.

ipanfilo · 2026-03-17T06:38:50Z

build_tools/hipify/custom_map.json

@@ -6,7 +6,14 @@
        "ATen/cudnn/Handle.h" : "ATen/miopen/Handle.h",
        "CUfunc_cache" : "hipFuncCache_t", 
        "<nvtx3/nvToolsExt.h>" : "<roctracer/roctx.h>",
-        "cudaFuncSetAttribute(" : "hipFuncSetAttribute((const void*)"
+        "cudaFuncSetAttribute(" : "hipFuncSetAttribute((const void*)",
+        "cudaLaunchKernel": "hipLaunchKernel",


I can actually see it in hipify_torch maps. I think custom map is not needed after recent TE hipification changes

tests/pytorch/distributed/test_comm_gemm_overlap.py

transformer_engine/common/comm_gemm_overlap/userbuffers/userbuffers.cu

ipanfilo · 2026-03-17T07:16:32Z

transformer_engine/pytorch/module/base.py

            "num_sm": 1 if method == "ring_exchange" else 16,
            "cga_size": 1 if method == "ring_exchange" else 2,
-            "set_sm_margin": not method == "ring_exchange",
+            "set_sm_margin": not method == "ring_exchange" and not IS_HIP_EXTENSION,


Please put comment that it is disabled by performance reason then

alextmagro · 2026-03-18T21:44:47Z

L3 CI
https://github.com/ROCm/TransformerEngine/actions/runs/23252049008

-- missing distributed/test_cast_master_weights_to_fp8.py hotfix that is now in dev.

alextmagro force-pushed the userbuffer_epic branch from 896c191 to 455b1ef Compare December 6, 2025 21:38

alextmagro force-pushed the userbuffer_epic branch from 455b1ef to e4e40e8 Compare December 15, 2025 06:28

alextmagro force-pushed the userbuffer_epic branch from b3e676a to 823adfd Compare January 27, 2026 15:37

alextmagro marked this pull request as ready for review January 27, 2026 15:38

alextmagro requested review from ipanfilo, wangye805 and wenchenvincent as code owners January 27, 2026 15:38

wangye805 requested changes Feb 9, 2026

View reviewed changes

alextmagro force-pushed the userbuffer_epic branch from d779653 to 470f153 Compare March 6, 2026 16:20

wangye805 requested changes Mar 8, 2026

View reviewed changes

alextmagro force-pushed the userbuffer_epic branch 2 times, most recently from a81c29f to 2ef5743 Compare March 17, 2026 03:17

ipanfilo reviewed Mar 17, 2026

View reviewed changes

Micky774 and others added 11 commits March 17, 2026 10:56

Typo fix (#397)

edb9f32

ROCm UserBuffers for Comm Overlap

47e6b40

Copyrights and cleanup

1371dba

test guards

0d63dd3

Cleanup and RS flag race condition fix

23c5cb0

Debugging midpoint

f317720

Cleanup and workspace fix

6bea013

Guard layer registration in UB

bf7769f

Cleanup of profiling example for rocm

beb4fa4

Readd example script and update custom_map

3de5b3b

fix typo

25972e1

alextmagro force-pushed the userbuffer_epic branch from abf93a3 to 25972e1 Compare March 17, 2026 15:57

alextmagro added 2 commits March 18, 2026 00:52

MI300 test skips due to jittery results

f90fcf4

Comment regarding sm_margin performance

7149615

alextmagro requested review from ipanfilo and wangye805 March 18, 2026 05:55

Fix missed merge

dbbe5c2

Conversation

alextmagro commented Nov 11, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alextmagro commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!