Skip to content

cuda::std::simd Optimize Arithmetic Floating-point Operations#8775

Open
fbusato wants to merge 8 commits intoNVIDIA:mainfrom
fbusato:simd-optimize-add-mul-fma
Open

cuda::std::simd Optimize Arithmetic Floating-point Operations#8775
fbusato wants to merge 8 commits intoNVIDIA:mainfrom
fbusato:simd-optimize-add-mul-fma

Conversation

@fbusato
Copy link
Copy Markdown
Contributor

@fbusato fbusato commented Apr 30, 2026

Description

This PR introduces the following optimizations and checks:

  • HADD2, HMUL2, HFMA for Half and Bfloat16: There are no single-element variants, so here we only check that all operators relying on them generate the expected number of instructions of that type.

  • F32x2 Blackwell SM100: Ensure that the following operations are mapped to F32x2 instructions:

    • Plus +
    • Minus -
    • Unary minus (-)
    • Increment ++
    • Decrement --

@fbusato fbusato self-assigned this Apr 30, 2026
@fbusato fbusato requested a review from a team as a code owner April 30, 2026 23:43
@fbusato fbusato added this to CCCL Apr 30, 2026
@fbusato fbusato requested a review from a team as a code owner April 30, 2026 23:43
@fbusato fbusato requested a review from bernhardmgruber April 30, 2026 23:43
@fbusato fbusato added the libcu++ For all items related to libcu++ label Apr 30, 2026
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Apr 30, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Apr 30, 2026
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@fbusato fbusato changed the title [DRAFT] cuda::std::simd Optimize basic vector operations cuda::std::simd Optimize Arithmentic Floating-point Operations May 1, 2026
@github-actions

This comment has been minimized.

Comment thread libcudacxx/include/cuda/std/__simd/specializations/fixed_size_storage.h Outdated
Comment thread libcudacxx/include/cuda/std/__simd/specializations/fixed_size_vec.h Outdated
Comment thread libcudacxx/include/cuda/std/__simd/specializations/fp32x2_intrinsics.h Outdated
# define _CCCL_HAS_SIMD_F32X2() 0
#endif // _CCCL_CUDA_COMPILER(NVCC, >=, 12, 8) || (__cccl_ptx_isa >= 860ULL)

#if _CCCL_HAS_SIMD_F32X2()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the entire file need to be gated behind this? I notice that only a few functions specifically need it, and for those you can just add an extra #else clause that does the naive algorithm.

This saves downstream users also needing to gate their code behind _CCCL_HAS_SIMD_F32X2() as the symbols always exist.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional. Users should never use this file

Comment thread libcudacxx/test/atomic_codegen/dump_and_check.bash Outdated
Comment thread libcudacxx/test/simd_codegen/CMakeLists.txt
endif()

set(simd_codegen_cuda_archs 80 90)
if (CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.9)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't assume nvcc here. Can also be clang-cuda

Copy link
Copy Markdown
Contributor Author

@fbusato fbusato May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think checking the SASS code for clang-cuda is critical

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

License header?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed atomic_codegen. I defer it to @wmaxey

Comment thread libcudacxx/test/simd_codegen/decrement_f32x2.cu Outdated
@github-actions

This comment has been minimized.

@fbusato fbusato changed the title cuda::std::simd Optimize Arithmentic Floating-point Operations cuda::std::simd Optimize Arithmetic Floating-point Operations May 1, 2026
@fbusato fbusato moved this from In Review to In Progress in CCCL May 1, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

😬 CI Workflow Results

🟥 Finished in 1h 08m: Pass: 98%/110 | Total: 19h 00m | Max: 43m 45s | Hits: 98%/307133

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

libcu++ For all items related to libcu++

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants