Skip to content

NVFP4 cast/transpose without TMA#472

Open
matthiasdiener wants to merge 47 commits intodevfrom
mdiener/fp4-cast-transpose
Open

NVFP4 cast/transpose without TMA#472
matthiasdiener wants to merge 47 commits intodevfrom
mdiener/fp4-cast-transpose

Conversation

@matthiasdiener
Copy link
Contributor

@matthiasdiener matthiasdiener commented Mar 4, 2026

Description

Fixes https://github.com/ROCm/frameworks-internal/issues/15731

TODO:

  • Implement other cases, not just fwd 1D
  • tests for other cases

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@matthiasdiener matthiasdiener self-assigned this Mar 4, 2026
@matthiasdiener matthiasdiener changed the base branch from IFU-dev-20251114-v2.10 to dev March 6, 2026 19:17
@matthiasdiener matthiasdiener changed the base branch from dev to IFU-dev-20251114-v2.10 March 6, 2026 19:20
@matthiasdiener matthiasdiener changed the base branch from IFU-dev-20251114-v2.10 to dev March 11, 2026 16:19
@matthiasdiener matthiasdiener changed the title [WIP] NVFP4 cast/transpose NVFP4 cast/transpose without TMA Mar 16, 2026
@matthiasdiener matthiasdiener marked this pull request as ready for review March 16, 2026 15:49
@ipanfilo
Copy link
Collaborator

The code contains number of ifdefs for just substitution of cuda_fp4.h __nv_fp4_e2m1, etc, with HIP counterparts. I suggest to use custom hipification map (build_tools/hipify/custom_map.json) and remove ifdefs from code. It can also be used for #include <cudaTypedefs.h>

test_swizzle.cu)
else()
list(APPEND test_cuda_sources
test_cast_nvfp4_transpose.cu
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should rather go to common section, where all other cast tests are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 95d0c9f


__device__ __forceinline__ float ComputeGlobalEncodeScaleFP4(const float global_amax) {
#ifdef __HIP_PLATFORM_AMD__
const float fp8_max = TypeExtrema<fp8e4m3>::max;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For device code constexpr should still work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to #if defined(__HIP_PLATFORM_AMD__) && !defined(__HIP_DEVICE_COMPILE__) in 95d0c9f

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for host it is not a constexpr. However, host code translation results should be eliminated from final binary so value is not important. Even whole method could be guarded with the same final result so defined(HIP_PLATFORM_AMD) && !defined(HIP_DEVICE_COMPILE) is good.

fused_attn_rocm/fused_attn_ck.cpp
fused_attn_rocm/utils.cpp)
fused_attn_rocm/utils.cpp
transpose/quantize_transpose_vector_blockwise_fp4.cu)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not ROCm specific source code, it should be added out of IS_ROCM/IS_CUDA if

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 95d0c9f

auto &global_amax = (output_tensor->amax.dptr != nullptr) ? output_tensor->amax
: output_tensor->columnwise_amax;

// If amax was not explicitly set, fall back to the scale field which
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guard it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a guard in 95d0c9f

@matthiasdiener matthiasdiener force-pushed the mdiener/fp4-cast-transpose branch from 472372b to 55a8c84 Compare March 17, 2026 22:12
@matthiasdiener
Copy link
Contributor Author

The code contains number of ifdefs for just substitution of cuda_fp4.h __nv_fp4_e2m1, etc, with HIP counterparts. I suggest to use custom hipification map (build_tools/hipify/custom_map.json) and remove ifdefs from code. It can also be used for #include <cudaTypedefs.h>

Thank you for the suggestion. I changed to using the hipify map in 55a8c84

Copy link
Collaborator

@wangye805 wangye805 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So currently we don't have any walkaround for the stochastic rounding path?

"__nv_fp4x4_e2m1" : "__hip_fp4x4_e2m1",
"__nv_fp4x2_storage_t" : "__hip_fp4x2_storage_t",
"<cudaTypedefs.h>" : "<hip/hip_version.h>",
"<cuda/barrier>" : "<hip/hip_version.h>"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why cudaTypedefs.h and cuda/barrier are both translated to hip/hip_version.h?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just to "disable" those particular includes for which there seems to be no equivalent in ROCm. I changed this slightly in b4caf6f, let me know if you prefer that (or just restoring the previous version that uses #ifndef guards around the nonexisting headers).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Then I prefer just guard those header files via "#ifndef HIP_PLATFORM_AMD". Maybe several months later, hipify supports those two header files but we still map it to null or hip_version.h

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, hipify changes won't affect us until we update our submodule, so custom map can be reviewed. Also, if they are guarded with ifndef hipify support will not affect TE code too
Moreover, if those headers are not needed for ROCm path, no reason to hipify them to anything else except void
Thus, depending on how often those headers appear here than there and whether there are HIP ifdefs there already, custom map may be a good alternative to adding multiple guards

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabling the #ifdefs in the custom map file reduces the number of files that we need to touch in this PR by 3. Are you ok with the way it is currently implemented @wangye805 ?

#include <hip/hip_bfloat16.h>
#include "amd_detail/hip_float8.h"
#endif
#include <hip/hip_fp4.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall you put cuda_fp4.h translation rule into our hipify json file already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just adapted it to the surrounding code, which already includes the HIP headers directly: #include <hip/hip_bfloat16.h>. Let me know which one you prefer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuda_bf16 and cuda_fp8 headers are also hipified with cusom map

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean we can simplify your line 16-28 by just one USE_ROCM guard on FP4_TYPE_SUPPORTED with other lines unchanged as NV upstream? This will make our code cleaner?

#ifdef __HIP_PLATFORM_AMD__
// If amax was not explicitly set, fall back to the scale field which
// holds the same value when set via set_scale().
NVTE_CHECK(global_amax.dptr != nullptr || output_tensor->scale.dptr != nullptr,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a bug fix for upstream? If not, why do we need this specific treatment for global amax?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe this is a bug in upstream.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put comment then.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks!

Also, check if upstream already had an fix. If not, I think it's okay to drop the rocm specific guard. What do you think @ipanfilo?

@matthiasdiener matthiasdiener requested a review from ipanfilo March 18, 2026 18:28
@matthiasdiener
Copy link
Contributor Author

So currently we don't have any walkaround for the stochastic rounding path?

I was able to implement SR via intrinsics on gfx950 in 36cf73a. I also expanded the test to use it.

#include <hip/hip_bfloat16.h>
#include "amd_detail/hip_float8.h"
#endif
#include <hip/hip_fp4.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuda_bf16 and cuda_fp8 headers are also hipified with cusom map

#ifdef __HIP_PLATFORM_AMD__
// If amax was not explicitly set, fall back to the scale field which
// holds the same value when set via set_scale().
NVTE_CHECK(global_amax.dptr != nullptr || output_tensor->scale.dptr != nullptr,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put comment then.


__device__ __forceinline__ float ComputeGlobalEncodeScaleFP4(const float global_amax) {
#ifdef __HIP_PLATFORM_AMD__
const float fp8_max = TypeExtrema<fp8e4m3>::max;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for host it is not a constexpr. However, host code translation results should be eliminated from final binary so value is not important. Even whole method could be guarded with the same final result so defined(HIP_PLATFORM_AMD) && !defined(HIP_DEVICE_COMPILE) is good.

}

#ifdef __HIP_PLATFORM_AMD__
__device__ __forceinline__ fp4x4_storage_t cvt_fp32_to_fp4_4x_with_stochastic_rounding(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fp4x4_storage_t is already correctly redefined for HIP and CUDA so no need ifdef here. Or if, you want to keep original declaration unchanged, you can use 'using __nv_fp4x4_e2m1= __hip_fp4x4_storage_t' on AMD.

Copy link
Contributor Author

@matthiasdiener matthiasdiener Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this can be simplified further.
Both sides need different return types and fp4x4_storage_t can only be used on AMD.
The map file has "__nv_fp4x2_e2m1" : "__hip_fp4x2_e2m1", , so using __nv_fp4x4_e2m1 = __hip_fp4x4_storage_t would become using __hip_fp4x4_e2m1 = __hip_fp4x4_storage_t after hipification, which is a redefinition of the existing struct in amd_hip_fp4.h.

"FP4 cvt.rs PTX instructions are architecture-specific. "
"Try recompiling with sm_XXXa instead of sm_XXX.");
#else
#ifdef __gfx950__
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may make sense to have analogue of ARCH_HAS_STOCHASTIC_ROUNDING define if such guarding is used in multiple places - we'll later add more platforms with FP4 support.


// for 2D block scaling, we need to reduce amax in warp
#ifdef __HIP_PLATFORM_AMD__
static __device__ constexpr uint64_t WARP_REDUCE_AMAX_GROUP_MASKS[8] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with 32 threads per wavefront actively used the high half of mask should be 0

@wangye805
Copy link
Collaborator

So currently we don't have any walkaround for the stochastic rounding path?

I was able to implement SR via intrinsics on gfx950 in 36cf73a. I also expanded the test to use it.

Great. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants