yaska/cuda: Use templates for device code#7747
Draft
raffenet wants to merge 3 commits intopmodels:mainfrom
Draft
yaska/cuda: Use templates for device code#7747raffenet wants to merge 3 commits intopmodels:mainfrom
raffenet wants to merge 3 commits intopmodels:mainfrom
Conversation
Contributor
Author
|
test:mpich/ch4/gpu |
Replace the per-operation __global__ kernel functions with a single template<typename Op> kernel per direction (pack/unpack) per layout×type combination. Add yaksuri_cudai_ops.cuh defining one C++ functor struct per operation.
Template pack/unpack kernels on both operation and element type using template<template<typename> class Op, typename T>, allowing all 9 builtin types to share a single kernel template per layout×direction. In a 2-level nesting configuration, the files (1 builtin, 5 single-level + 25 double-level) now contains 2 kernel templates plus 9×2 host functions that instantiate them as kernel<YaksuriOpFoo, type>.
Replace the if-based specializations and integer bitwise trick with CUDA's overloaded max()/min(), which resolve to hardware MAX/MIN instructions for integer types and FMAX/FMIN for float/double. No type-specific specializations are needed. Note that CUDA's max()/min() for floating-point follow IEEE 754-2008 and return the non-NaN operand when one input is NaN.
Contributor
Author
|
test:mpich/ch4/gpu |
Contributor
Author
Binary size is unchanged-to-slightly worse with this PR using CUDA 12.5. pmrs-gpu-240-01% ls -lh src/mpi/datatype/typerep/yaksa/.libs/libyaksa.a
-rw-r--r-- 1 raffenet cels 60M Mar 12 11:48 src/mpi/datatype/typerep/yaksa/.libs/libyaksa.a
pmrs-gpu-240-01% ls -lh ../old/src/mpi/datatype/typerep/yaksa/.libs/libyaksa.a
-rw-r--r-- 1 raffenet cels 59M Mar 12 11:55 ../old/src/mpi/datatype/typerep/yaksa/.libs/libyaksa.a |
Contributor
Author
compile time comparison on pmrs-gpu-01 (just yaksa subdirectory) old: make -j16 846.39s user 198.92s system 1221% cpu 1:25.55 total so at least some significant improvement in build time! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request Description
The generated CUDA pack/unpack kernels in yaksa follow a very C-like style where we have separate functions for different types and operations. Since CUDA code is a variant of C++, we can utilize templates to limit duplication. Implement one pack and unpack device function per layout that handles all different ops and datatypes. Wrappers for each type are maintained to avoid changing the internal abstraction for now.
We also replace hand-written min/max functions with CUDA math API calls that should generate optimal device code.
TODO: quantify compile time and binary size benefits (if any)
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.