NVFP4 cast/transpose without TMA by matthiasdiener · Pull Request #472 · ROCm/TransformerEngine

matthiasdiener · 2026-03-04T16:17:16Z

Description

Fixes https://github.com/ROCm/frameworks-internal/issues/15731

TODO:

Implement other cases, not just fwd 1D
tests for other cases

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Resolve wheels and examples

…ener/fp4-cast-transpose

This reverts commit 5c747bd.

…pose

ipanfilo · 2026-03-17T05:45:40Z

The code contains number of ifdefs for just substitution of cuda_fp4.h __nv_fp4_e2m1, etc, with HIP counterparts. I suggest to use custom hipification map (build_tools/hipify/custom_map.json) and remove ifdefs from code. It can also be used for #include <cudaTypedefs.h>

ipanfilo · 2026-03-17T04:59:50Z

tests/cpp/operator/CMakeLists.txt

              test_swizzle.cu)
 else()
  list(APPEND test_cuda_sources
+              test_cast_nvfp4_transpose.cu


It should rather go to common section, where all other cast tests are.

Fixed in 95d0c9f

ipanfilo · 2026-03-17T05:29:56Z

transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu


 __device__ __forceinline__ float ComputeGlobalEncodeScaleFP4(const float global_amax) {
+#ifdef __HIP_PLATFORM_AMD__
+  const float fp8_max = TypeExtrema<fp8e4m3>::max;


For device code constexpr should still work

I changed it to #if defined(__HIP_PLATFORM_AMD__) && !defined(__HIP_DEVICE_COMPILE__) in 95d0c9f

Yes, for host it is not a constexpr. However, host code translation results should be eliminated from final binary so value is not important. Even whole method could be guarded with the same final result so defined(HIP_PLATFORM_AMD) && !defined(HIP_DEVICE_COMPILE) is good.

transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu

ipanfilo · 2026-03-17T05:36:54Z

transformer_engine/common/CMakeLists.txt

        fused_attn_rocm/fused_attn_ck.cpp
-        fused_attn_rocm/utils.cpp)
+        fused_attn_rocm/utils.cpp
+        transpose/quantize_transpose_vector_blockwise_fp4.cu)


It is not ROCm specific source code, it should be added out of IS_ROCM/IS_CUDA if

Fixed in 95d0c9f

ipanfilo · 2026-03-17T05:48:51Z

transformer_engine/common/cast/dispatch/quantize.cuh

        auto &global_amax = (output_tensor->amax.dptr != nullptr) ? output_tensor->amax
                                                                  : output_tensor->columnwise_amax;
+
+        // If amax was not explicitly set, fall back to the scale field which


Added a guard in 95d0c9f

matthiasdiener · 2026-03-17T22:22:45Z

The code contains number of ifdefs for just substitution of cuda_fp4.h __nv_fp4_e2m1, etc, with HIP counterparts. I suggest to use custom hipification map (build_tools/hipify/custom_map.json) and remove ifdefs from code. It can also be used for #include <cudaTypedefs.h>

Thank you for the suggestion. I changed to using the hipify map in 55a8c84

wangye805

So currently we don't have any walkaround for the stochastic rounding path?

wangye805 · 2026-03-18T04:46:31Z

build_tools/hipify/custom_map.json

+        "__nv_fp4x4_e2m1" : "__hip_fp4x4_e2m1",
+        "__nv_fp4x2_storage_t" : "__hip_fp4x2_storage_t",
+        "<cudaTypedefs.h>" : "<hip/hip_version.h>",
+        "<cuda/barrier>" : "<hip/hip_version.h>"


Why cudaTypedefs.h and cuda/barrier are both translated to hip/hip_version.h?

It is just to "disable" those particular includes for which there seems to be no equivalent in ROCm. I changed this slightly in b4caf6f, let me know if you prefer that (or just restoring the previous version that uses #ifndef guards around the nonexisting headers).

Oh, I see. Then I prefer just guard those header files via "#ifndef HIP_PLATFORM_AMD". Maybe several months later, hipify supports those two header files but we still map it to null or hip_version.h

Well, hipify changes won't affect us until we update our submodule, so custom map can be reviewed. Also, if they are guarded with ifndef hipify support will not affect TE code too
Moreover, if those headers are not needed for ROCm path, no reason to hipify them to anything else except void
Thus, depending on how often those headers appear here than there and whether there are HIP ifdefs there already, custom map may be a good alternative to adding multiple guards

Disabling the #ifdefs in the custom map file reduces the number of files that we need to touch in this PR by 3. Are you ok with the way it is currently implemented @wangye805 ?

tests/cpp/operator/test_cast_nvfp4_transpose.cu

tests/cpp/test_common.cu

wangye805 · 2026-03-18T04:54:56Z

tests/cpp/test_common.h

 #include <hip/hip_bfloat16.h>
 #include "amd_detail/hip_float8.h"
-#endif
+#include <hip/hip_fp4.h>


I recall you put cuda_fp4.h translation rule into our hipify json file already?

I just adapted it to the surrounding code, which already includes the HIP headers directly: #include <hip/hip_bfloat16.h>. Let me know which one you prefer.

cuda_bf16 and cuda_fp8 headers are also hipified with cusom map

I mean we can simplify your line 16-28 by just one USE_ROCM guard on FP4_TYPE_SUPPORTED with other lines unchanged as NV upstream? This will make our code cleaner?

wangye805 · 2026-03-18T04:57:59Z

transformer_engine/common/cast/dispatch/quantize.cuh

+#ifdef __HIP_PLATFORM_AMD__
+        // If amax was not explicitly set, fall back to the scale field which
+        // holds the same value when set via set_scale().
+        NVTE_CHECK(global_amax.dptr != nullptr || output_tensor->scale.dptr != nullptr,


Is it a bug fix for upstream? If not, why do we need this specific treatment for global amax?

Yes, I believe this is a bug in upstream.

Maybe put comment then.

I see. Thanks!

Also, check if upstream already had an fix. If not, I think it's okay to drop the rocm specific guard. What do you think @ipanfilo?

transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu

tests/cpp/operator/test_cast_nvfp4_transpose.cu

matthiasdiener · 2026-03-18T21:36:46Z

So currently we don't have any walkaround for the stochastic rounding path?

I was able to implement SR via intrinsics on gfx950 in 36cf73a. I also expanded the test to use it.

ipanfilo · 2026-03-18T23:20:23Z

tests/cpp/test_common.h

 #include <hip/hip_bfloat16.h>
 #include "amd_detail/hip_float8.h"
-#endif
+#include <hip/hip_fp4.h>


cuda_bf16 and cuda_fp8 headers are also hipified with cusom map

ipanfilo · 2026-03-18T23:22:22Z

transformer_engine/common/cast/dispatch/quantize.cuh

+#ifdef __HIP_PLATFORM_AMD__
+        // If amax was not explicitly set, fall back to the scale field which
+        // holds the same value when set via set_scale().
+        NVTE_CHECK(global_amax.dptr != nullptr || output_tensor->scale.dptr != nullptr,


Maybe put comment then.

ipanfilo · 2026-03-18T23:26:07Z

transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu


 __device__ __forceinline__ float ComputeGlobalEncodeScaleFP4(const float global_amax) {
+#ifdef __HIP_PLATFORM_AMD__
+  const float fp8_max = TypeExtrema<fp8e4m3>::max;


Yes, for host it is not a constexpr. However, host code translation results should be eliminated from final binary so value is not important. Even whole method could be guarded with the same final result so defined(HIP_PLATFORM_AMD) && !defined(HIP_DEVICE_COMPILE) is good.

ipanfilo · 2026-03-18T23:31:51Z

transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu

 }

+#ifdef __HIP_PLATFORM_AMD__
+__device__ __forceinline__ fp4x4_storage_t cvt_fp32_to_fp4_4x_with_stochastic_rounding(


fp4x4_storage_t is already correctly redefined for HIP and CUDA so no need ifdef here. Or if, you want to keep original declaration unchanged, you can use 'using __nv_fp4x4_e2m1= __hip_fp4x4_storage_t' on AMD.

I'm not sure this can be simplified further.
Both sides need different return types and fp4x4_storage_t can only be used on AMD.
The map file has "__nv_fp4x2_e2m1" : "__hip_fp4x2_e2m1", , so using __nv_fp4x4_e2m1 = __hip_fp4x4_storage_t would become using __hip_fp4x4_e2m1 = __hip_fp4x4_storage_t after hipification, which is a redefinition of the existing struct in amd_hip_fp4.h.

ipanfilo · 2026-03-18T23:35:26Z

transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu

        "FP4 cvt.rs PTX instructions are architecture-specific. "
        "Try recompiling with sm_XXXa instead of sm_XXX.");
+#else
+#ifdef __gfx950__


It may make sense to have analogue of ARCH_HAS_STOCHASTIC_ROUNDING define if such guarding is used in multiple places - we'll later add more platforms with FP4 support.

ipanfilo · 2026-03-18T23:57:52Z

transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu


 // for 2D block scaling, we need to reduce amax in warp
+#ifdef __HIP_PLATFORM_AMD__
+static __device__ constexpr uint64_t WARP_REDUCE_AMAX_GROUP_MASKS[8] = {


I think with 32 threads per wavefront actively used the high half of mask should be 0

wangye805 · 2026-03-19T04:26:31Z

So currently we don't have any walkaround for the stochastic rounding path?

I was able to implement SR via intrinsics on gfx950 in 36cf73a. I also expanded the test to use it.

Great. Thanks

…pose

wangye805 and others added 14 commits February 2, 2026 14:16

[ROCm] resolve the conflicts in common dir

b8a4024

[ROCm] resolve the conflicts on jax side

0519b4b

[ROCm] resolve the conflicts on pytorch side

8f4b04d

[ROCm] resolve the conflicts in setup

e60ff21

[ROCm] resolve the cpp gtest

8bbb162

[ROCm] resolve pytorch and jax tests

f573b40

Resolve wheels and examples

pytest, example, wheels conflict resolution

eaaae94

jax and pytorch bugfix

8f94cf6

copyrights and fp8_autocast->autocast fix

bac7993

Enable test_distributed_dense.py

8ae38e8

address IFU comments

05a977a

_FormatHelperFP8 and missing file add

0385852

add use_async_d2h_group_size as a test parameter

46d382d

enable FP4 tests

15416f1

matthiasdiener self-assigned this Mar 4, 2026

matthiasdiener and others added 12 commits March 4, 2026 16:13

rough initial version

bac5096

initial working version

da24223

Addressing comments and small fixes

c03b7bb

various cleanups

c453dba

manually update runner labels

4a843ba

Comment cleanup

316dffb

Merge remote-tracking branch 'origin/IFU-dev-20251114-v2.10' into mdi…

8a47bc5

…ener/fp4-cast-transpose

only enable on gfx950

5c747bd

Update jax gemm.py

db56b8f

Merge remote-tracking branch 'origin/IFU-dev-20251114-v2.10' into mdi…

b318bda

…ener/fp4-cast-transpose

Revert "only enable on gfx950"

62eea94

This reverts commit 5c747bd.

reenable in NVTEDType

6d459ec

matthiasdiener changed the base branch from IFU-dev-20251114-v2.10 to dev March 6, 2026 19:17

matthiasdiener changed the base branch from dev to IFU-dev-20251114-v2.10 March 6, 2026 19:20

Fix dev merge conflicts

6eb2707

matthiasdiener changed the base branch from IFU-dev-20251114-v2.10 to dev March 11, 2026 16:19

leo-amd and others added 5 commits March 12, 2026 13:55

Run CI

586bd09

Merge branch 'dev' into mdiener/fp4-cast-transpose

4896edf

more scales fixing

aa18e9a

Merge remote-tracking branch 'origin/dev' into mdiener/fp4-cast-trans…

c918a19

…pose

Merge remote-tracking branch 'origin/dev' into mdiener/fp4-cast-trans…

5bd7388

…pose

matthiasdiener changed the title ~~[WIP] NVFP4 cast/transpose~~ NVFP4 cast/transpose without TMA Mar 16, 2026

matthiasdiener marked this pull request as ready for review March 16, 2026 15:49

matthiasdiener requested review from ipanfilo, wangye805 and wenchenvincent as code owners March 16, 2026 15:49

matthiasdiener requested a review from alextmagro March 16, 2026 16:10

ipanfilo reviewed Mar 17, 2026

View reviewed changes

matthiasdiener added 3 commits March 17, 2026 16:10

address review comments

95d0c9f

adjust error message slightly

6cd6038

simplify via hipify map

55a8c84

matthiasdiener force-pushed the mdiener/fp4-cast-transpose branch from 472372b to 55a8c84 Compare March 17, 2026 22:12

adjust more error messages

10d88bf

wangye805 requested changes Mar 18, 2026

View reviewed changes

matthiasdiener added 2 commits March 18, 2026 12:56

change disabling of header includes

b4caf6f

address review comments

511db61

matthiasdiener requested a review from ipanfilo March 18, 2026 18:28

matthiasdiener added 2 commits March 18, 2026 15:54

implement SR

36cf73a

simplify slightly

a85f68f

ipanfilo reviewed Mar 18, 2026

View reviewed changes

Merge remote-tracking branch 'origin/dev' into mdiener/fp4-cast-trans…

f4f5ec9

…pose

Conversation

matthiasdiener commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

ipanfilo commented Mar 17, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthiasdiener commented Mar 17, 2026

Uh oh!

wangye805 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

matthiasdiener commented Mar 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthiasdiener Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthiasdiener commented Mar 4, 2026 •

edited

Loading

matthiasdiener Mar 19, 2026 •

edited

Loading