yaska/cuda: Use templates for device code by raffenet · Pull Request #7747 · pmodels/mpich

raffenet · 2026-03-11T02:20:16Z

Pull Request Description

The generated CUDA pack/unpack kernels in yaksa follow a very C-like style where we have separate functions for different types and operations. Since CUDA code is a variant of C++, we can utilize templates to limit duplication. Implement one pack and unpack device function per layout that handles all different ops and datatypes. Wrappers for each type are maintained to avoid changing the internal abstraction for now.

We also replace hand-written min/max functions with CUDA math API calls that should generate optimal device code.

TODO: quantify compile time and binary size benefits (if any)

Author Checklist

Provide Description
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form: module: short description
Commit message explains what's in the commit.
Passes All Tests
Whitespace checker. Warnings test. Additional tests via comments.
Contribution Agreement
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.

raffenet · 2026-03-11T02:20:29Z

test:mpich/ch4/gpu

Replace the per-operation __global__ kernel functions with a single template<typename Op> kernel per direction (pack/unpack) per layout×type combination. Add yaksuri_cudai_ops.cuh defining one C++ functor struct per operation.

Template pack/unpack kernels on both operation and element type using template<template<typename> class Op, typename T>, allowing all 9 builtin types to share a single kernel template per layout×direction. In a 2-level nesting configuration, the files (1 builtin, 5 single-level + 25 double-level) now contains 2 kernel templates plus 9×2 host functions that instantiate them as kernel<YaksuriOpFoo, type>.

Replace the if-based specializations and integer bitwise trick with CUDA's overloaded max()/min(), which resolve to hardware MAX/MIN instructions for integer types and FMAX/FMIN for float/double. No type-specific specializations are needed. Note that CUDA's max()/min() for floating-point follow IEEE 754-2008 and return the non-NaN operand when one input is NaN.

raffenet · 2026-03-11T21:18:44Z

test:mpich/ch4/gpu

raffenet · 2026-03-12T17:29:54Z

TODO: quantify compile time and binary size benefits (if any)

Binary size is unchanged-to-slightly worse with this PR using CUDA 12.5.

pmrs-gpu-240-01% ls -lh src/mpi/datatype/typerep/yaksa/.libs/libyaksa.a
-rw-r--r-- 1 raffenet cels 60M Mar 12 11:48 src/mpi/datatype/typerep/yaksa/.libs/libyaksa.a
pmrs-gpu-240-01% ls -lh ../old/src/mpi/datatype/typerep/yaksa/.libs/libyaksa.a
-rw-r--r-- 1 raffenet cels 59M Mar 12 11:55 ../old/src/mpi/datatype/typerep/yaksa/.libs/libyaksa.a

raffenet · 2026-03-12T18:03:29Z

TODO: quantify compile time and binary size benefits (if any)

compile time comparison on pmrs-gpu-01 (just yaksa subdirectory)

old: make -j16 846.39s user 198.92s system 1221% cpu 1:25.55 total
new: make -j16 379.03s user 59.51s system 1081% cpu 40.551 total

so at least some significant improvement in build time!

raffenet added 3 commits March 11, 2026 16:18

yaksa/cuda: Template pack/unpack kernels on operation type

174a400

Replace the per-operation __global__ kernel functions with a single template<typename Op> kernel per direction (pack/unpack) per layout×type combination. Add yaksuri_cudai_ops.cuh defining one C++ functor struct per operation.

raffenet force-pushed the yaksa-cuda branch from edf01c3 to d7b1d2e Compare March 11, 2026 21:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

yaska/cuda: Use templates for device code#7747

yaska/cuda: Use templates for device code#7747
raffenet wants to merge 3 commits intopmodels:mainfrom
raffenet:yaksa-cuda

raffenet commented Mar 11, 2026 •

edited

Loading

Uh oh!

raffenet commented Mar 11, 2026

Uh oh!

raffenet commented Mar 11, 2026

Uh oh!

raffenet commented Mar 12, 2026

Uh oh!

raffenet commented Mar 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raffenet commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Description

Author Checklist

Uh oh!

raffenet commented Mar 11, 2026

Uh oh!

raffenet commented Mar 11, 2026

Uh oh!

raffenet commented Mar 12, 2026

Uh oh!

raffenet commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

raffenet commented Mar 11, 2026 •

edited

Loading

raffenet commented Mar 12, 2026 •

edited

Loading