Skip to content

yaska/cuda: Use templates for device code#7747

Draft
raffenet wants to merge 3 commits intopmodels:mainfrom
raffenet:yaksa-cuda
Draft

yaska/cuda: Use templates for device code#7747
raffenet wants to merge 3 commits intopmodels:mainfrom
raffenet:yaksa-cuda

Conversation

@raffenet
Copy link
Copy Markdown
Contributor

@raffenet raffenet commented Mar 11, 2026

Pull Request Description

The generated CUDA pack/unpack kernels in yaksa follow a very C-like style where we have separate functions for different types and operations. Since CUDA code is a variant of C++, we can utilize templates to limit duplication. Implement one pack and unpack device function per layout that handles all different ops and datatypes. Wrappers for each type are maintained to avoid changing the internal abstraction for now.

We also replace hand-written min/max functions with CUDA math API calls that should generate optimal device code.

TODO: quantify compile time and binary size benefits (if any)

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@raffenet
Copy link
Copy Markdown
Contributor Author

test:mpich/ch4/gpu

Replace the per-operation __global__ kernel functions with a single
template<typename Op> kernel per direction (pack/unpack) per layout×type
combination. Add yaksuri_cudai_ops.cuh defining one C++ functor struct
per operation.
Template pack/unpack kernels on both operation and element type using
template<template<typename> class Op, typename T>, allowing all 9
builtin types to share a single kernel template per layout×direction.

In a 2-level nesting configuration, the files (1 builtin, 5 single-level
+ 25 double-level) now contains 2 kernel templates plus 9×2 host
functions that instantiate them as kernel<YaksuriOpFoo, type>.
Replace the if-based specializations and integer bitwise trick with
CUDA's overloaded max()/min(), which resolve to hardware MAX/MIN
instructions for integer types and FMAX/FMIN for float/double.
No type-specific specializations are needed.

Note that CUDA's max()/min() for floating-point follow IEEE 754-2008
and return the non-NaN operand when one input is NaN.
@raffenet
Copy link
Copy Markdown
Contributor Author

test:mpich/ch4/gpu

@raffenet
Copy link
Copy Markdown
Contributor Author

TODO: quantify compile time and binary size benefits (if any)

Binary size is unchanged-to-slightly worse with this PR using CUDA 12.5.

pmrs-gpu-240-01% ls -lh src/mpi/datatype/typerep/yaksa/.libs/libyaksa.a
-rw-r--r-- 1 raffenet cels 60M Mar 12 11:48 src/mpi/datatype/typerep/yaksa/.libs/libyaksa.a
pmrs-gpu-240-01% ls -lh ../old/src/mpi/datatype/typerep/yaksa/.libs/libyaksa.a
-rw-r--r-- 1 raffenet cels 59M Mar 12 11:55 ../old/src/mpi/datatype/typerep/yaksa/.libs/libyaksa.a

@raffenet
Copy link
Copy Markdown
Contributor Author

raffenet commented Mar 12, 2026

TODO: quantify compile time and binary size benefits (if any)

compile time comparison on pmrs-gpu-01 (just yaksa subdirectory)

old: make -j16 846.39s user 198.92s system 1221% cpu 1:25.55 total
new: make -j16 379.03s user 59.51s system 1081% cpu 40.551 total

so at least some significant improvement in build time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant