Conversation
896c191 to
455b1ef
Compare
455b1ef to
e4e40e8
Compare
b3e676a to
823adfd
Compare
| parser.add_argument("--seed", type=int, default=1234, help="RNG seed.") | ||
| parser.add_argument( | ||
| "--fp8", action="store_true", default=False, help="Enables the te.fp8_autocast() context." | ||
| "--fp8", action="store_true", default=False, help="Enables the te.autocast() context." |
There was a problem hiding this comment.
Up to TE v2.8, I think it's still fp8_autocast. Were you targeting at higher versions?
There was a problem hiding this comment.
I think you had a few comments on this, so will address it here quickly. I moved the UB code up to release 2.10, as there were a few bugs and inefficiencies that NV fixed. Most of the changes that aren't guarded in the files are NV upstream changes.
I am fixing up the te_layer_with_overlap differences, and working on integrating the benchmark script into the file directly.
|
|
||
| # This file was modified for portability to AMDGPU | ||
| # Copyright (c) 2025-2026, Advanced Micro Devices, Inc. All rights reserved. | ||
| # Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
Was this file sharing a lot of codes with examples/pytorch/comm_gemm_overlap/te_layer_with_overlap.py? Is it possible to consolidate those two files
| import transformer_engine.pytorch.cpp_extensions as tex | ||
| from transformer_engine.pytorch.fp8 import FP8GlobalStateManager | ||
|
|
||
| from transformer_engine.jax.cpp_extensions.misc import is_hip_extension |
There was a problem hiding this comment.
Let's not import jax specific code into pytorch side. Use this instead:
There was a problem hiding this comment.
Good catch, this is an mistake. Will fix.
| if (_ub_comm->myrank == 0) printf("!!! [UB] Register UBuf %d\n", _ub_reg); | ||
| if (_ub_comm->myrank == 0) { | ||
| printf("!!! [UB] Register UBuf %d\n", _ub_reg); | ||
| } |
There was a problem hiding this comment.
I would prefer aligning the coding style with NV upstream so it's easier for us to maintain/IFU later
| allgather_handle, barrier_handle, tp_size, num_max_streams, comm_cga_size, | ||
| gemm_priority, comm_priority, num_comm_sm, set_sm_margin, use_ce, | ||
| atomic_gemm) { | ||
| initialize(buffer_shape, buffer_dtype, comm_type, aggregate); |
There was a problem hiding this comment.
Same question here for the motivation of this initialize function in the constructor
transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp
Outdated
Show resolved
Hide resolved
d779653 to
470f153
Compare
| NVTE_CHECK_CUDA(cudaMemset((*comm)->flags_baseptr, 0, 2 * GPU_PAGE_SIZE)); | ||
| (*comm)->flags = reinterpret_cast<int *>( | ||
| #ifdef __HIP_PLATFORM_AMD__ | ||
| (reinterpret_cast<uintptr_t>((*comm)->flags) + GPU_PAGE_SIZE - 1) & GPU_PAGE_MASK); |
There was a problem hiding this comment.
Should it be (*comm)->flags_baseptr as the nv upstream below? (*comm)->flags is not allocated/assigned above
There was a problem hiding this comment.
Yes, I have fixed that. Thanks!
|
|
||
| __syncthreads(); | ||
| if (threadIdx.x == 0) __threadfence_system(); | ||
| if (threadIdx.x == 0) __threadfence(); |
There was a problem hiding this comment.
Looks like __threadfence_system() is now supported in rocm 7.2: https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html#memory-fence-instructions
| void userbuffers_send(const int srchandler, const size_t srcoffset, const int dsthandler, | ||
| const size_t dstoffset, const size_t bytes, communicator *comm, | ||
| const int peer, cudaStream_t stream) { | ||
| const int peer, cudaStream_t stream, int ring_id) { |
There was a problem hiding this comment.
Emm, I guess my question then would be why NV upstream does not need a ring_id? Is it because of we have different implementation? The NVTE_ROCM_MAX_RINGS?
| _comm_priority = comm_priority; | ||
| } | ||
| for (int i = 0; i < std::min(num_max_streams, num_splits); i++) { | ||
| for (int i = 0; i < std::max(num_max_streams, num_splits); i++) { |
There was a problem hiding this comment.
In fact, do we need stream numbers more than the min of max_stream and num_splits?
There was a problem hiding this comment.
We do. I am convinced that this is an upstream bug, as num_splits has a default value of 0, which seems off. Either way, we need at least 1 stream for each tp peer, which is what num_splits is supposed to be.
| NVTE_DIM_CHECK(chunk_height > 0 && chunk_width > 0, "Attempted to get empty tensor chunk"); | ||
| NVTE_DIM_CHECK(chunk_height <= height && chunk_width <= width, | ||
| "Attempted to get out-of-bounds tensor chunk"); | ||
| #ifndef __HIP_PLATFORM_AMD__ |
There was a problem hiding this comment.
Since we already support mxfp8. Add a to-do comment so that we won't forget to turn it on later
There was a problem hiding this comment.
This is a ifndef, so is enabled for us since we don't have the padding issues.
|
|
||
| // Input data | ||
| const size_t source_size = source.numel(); | ||
| const void *src_ptr = (rowwise) ? source.dptr() : source.columnwise_dptr(); |
There was a problem hiding this comment.
Well, what if we need both row-wise and colwise? How about other fields of a tensor, for example, scale inv?
There was a problem hiding this comment.
Within these functions we are working with only colwise or rowwise data. When we call hipblaslt we are passing in only one or the other for the GEMM, so no need for both within a single overlap call.
| "num_sm": 1 if method == "ring_exchange" else 16, | ||
| "cga_size": 1 if method == "ring_exchange" else 2, | ||
| "set_sm_margin": not method == "ring_exchange", | ||
| "set_sm_margin": not method == "ring_exchange" and not IS_HIP_EXTENSION, |
There was a problem hiding this comment.
Ilya already had the sm_margin feature supported on rocm
There was a problem hiding this comment.
This was a performance decision, not a functionality one. set_sm_margin seems to slow down UB on ROCm, probably because we have dedicated SDMA engines being used that don't require CU blocking?
There was a problem hiding this comment.
Please put comment that it is disabled by performance reason then
| if IS_HIP_EXTENSION and user_ub_cfg is not None: | ||
| for name, cfg in user_ub_cfg.items(): | ||
| assert cfg.get("method") != "bulk", ( | ||
| f"Bulk overlap method for '{name}' is not supported on HIP/ROCm. " |
There was a problem hiding this comment.
I recall we supported bulk overlap but the performance is not great?
There was a problem hiding this comment.
Yeonsoo was seeing some race conditions and weird hangs, and submitted a PR request to upstream. I am still seeing failures after rebasing to IFU 2.10, so I think that the issue is still there. I think if we want to enable this, we should consider it for a different PR as it will require a new implementation.
build_tools/hipify/custom_map.json
Outdated
| @@ -6,7 +6,14 @@ | |||
| "ATen/cudnn/Handle.h" : "ATen/miopen/Handle.h", | |||
| "CUfunc_cache" : "hipFuncCache_t", | |||
| "<nvtx3/nvToolsExt.h>" : "<roctracer/roctx.h>", | |||
| "cudaFuncSetAttribute(" : "hipFuncSetAttribute((const void*)" | |||
| "cudaFuncSetAttribute(" : "hipFuncSetAttribute((const void*)", | |||
| "cudaLaunchKernel": "hipLaunchKernel", | |||
There was a problem hiding this comment.
cudaLaunchKernel cannot be hipified?
There was a problem hiding this comment.
Looks like this was very recently added to hipify_torch, so we can probably remove this after we update our hipify_torch commit. I would recommend we do that separately, however.
There was a problem hiding this comment.
I can actually see it in hipify_torch maps. I think custom map is not needed after recent TE hipification changes
There was a problem hiding this comment.
Looking back I had had this since we need to hipify cudaLaunchKernelExC as well as cudaLaunchKernel. The former is still not in the map, so I have updated the custom map to specifically pick up the ExC variation.
a81c29f to
2ef5743
Compare
build_tools/hipify/custom_map.json
Outdated
| { | ||
| "custom_map" : { | ||
| "<cuda_bf16.h>" : "<hip/hip_bfloat16.h>", | ||
| "util/cuda_runtime.h" : "util/hip_runtime.h", |
There was a problem hiding this comment.
it should be covered by line 5
There was a problem hiding this comment.
Thanks, have removed.
build_tools/hipify/custom_map.json
Outdated
| @@ -6,7 +6,14 @@ | |||
| "ATen/cudnn/Handle.h" : "ATen/miopen/Handle.h", | |||
| "CUfunc_cache" : "hipFuncCache_t", | |||
| "<nvtx3/nvToolsExt.h>" : "<roctracer/roctx.h>", | |||
| "cudaFuncSetAttribute(" : "hipFuncSetAttribute((const void*)" | |||
| "cudaFuncSetAttribute(" : "hipFuncSetAttribute((const void*)", | |||
| "cudaLaunchKernel": "hipLaunchKernel", | |||
There was a problem hiding this comment.
I can actually see it in hipify_torch maps. I think custom map is not needed after recent TE hipification changes
| "num_sm": 1 if method == "ring_exchange" else 16, | ||
| "cga_size": 1 if method == "ring_exchange" else 2, | ||
| "set_sm_margin": not method == "ring_exchange", | ||
| "set_sm_margin": not method == "ring_exchange" and not IS_HIP_EXTENSION, |
There was a problem hiding this comment.
Please put comment that it is disabled by performance reason then
abf93a3 to
25972e1
Compare
|
L3 CI -- missing distributed/test_cast_master_weights_to_fp8.py hotfix that is now in dev. |
This is the userbuffer_epic branch, to be merged only once all epic tasks have been completed. PRs for epic tasks will be onto this branch.