Modified CUDA/HIP reduction algorithm by vikaskurapati · Pull Request #62 · SeisSol/Device

vikaskurapati · 2026-02-25T15:31:10Z

Modified CUDA reduction algorithm based on #56. Tested, and benchmarked comparing it with the current implementation on Hopper using Vista's GPU; the average runtime is around 1.5x-6x faster just for the reduction on Hopper. I did not test it with SeisSol, and as reduction is only a small component of SeisSol, I do not anticipate any significant speed-up in SeisSol runs due to this.

The WorkGroupSize, and ItemsPerWorkItem are heuristically picked from benchmarking after running it with different configurations. The results of best configurations are here depending on vector sizes, and data types. I chose currently 1024, 4 on basis of 1e5-1e6 for float -- which are typical time cluster sizes for big problems we use. But if there is a better idea for choosing these, please let me know.

=== Benchmark Comparison for Type: float (Time Metric) ===
Vector Size | Old Time (ms) | New Time (ms) | Speedup | Best Config (WG, Items)

1e+00 | 0.0030 | 0.0041 | 0.72x | (512, 8)
1e+01 | 0.0030 | 0.0041 | 0.74x | (128, 4)
1e+02 | 0.0029 | 0.0042 | 0.69x | (128, 4)
1e+03 | 0.0029 | 0.0043 | 0.67x | (256, 4)
1e+04 | 0.0054 | 0.0044 | 1.24x | (256, 8)
1e+05 | 0.0304 | 0.0044 | 6.86x | (1024, 4)
1e+06 | 0.2783 | 0.0048 | 57.49x | (1024, 4)
1e+07 | 3.8529 | 0.0111 | 348.49x | (512, 16)
1e+08 | 46.2797 | 0.1087 | 425.86x | (256, 16)
1e+09 | 461.8960 | 1.0327 | 447.28x | (256, 16)

=== Benchmark Comparison for Type: int (Time Metric) ===
Vector Size | Old Time (ms) | New Time (ms) | Speedup | Best Config (WG, Items)

1e+00 | 0.0030 | 0.0041 | 0.73x | (256, 8)
1e+01 | 0.0029 | 0.0042 | 0.69x | (256, 8)
1e+02 | 0.0029 | 0.0041 | 0.71x | (128, 8)
1e+03 | 0.0030 | 0.0040 | 0.74x | (128, 4)
1e+04 | 0.0054 | 0.0043 | 1.27x | (128, 8)
1e+05 | 0.0307 | 0.0043 | 7.07x | (512, 4)
1e+06 | 0.2807 | 0.0048 | 58.37x | (512, 8)
1e+07 | 3.8434 | 0.0107 | 358.10x | (512, 8)
1e+08 | 46.2018 | 0.1086 | 425.34x | (256, 16)
1e+09 | 461.2680 | 1.0323 | 446.82x | (512, 16)

=== Benchmark Comparison for Type: double (Time Metric) ===
Vector Size | Old Time (ms) | New Time (ms) | Speedup | Best Config (WG, Items)

1e+00 | 0.0030 | 0.0042 | 0.73x | (128, 8)
1e+01 | 0.0029 | 0.0040 | 0.73x | (256, 8)
1e+02 | 0.0030 | 0.0041 | 0.72x | (256, 8)
1e+03 | 0.0030 | 0.0043 | 0.70x | (512, 4)
1e+04 | 0.0061 | 0.0043 | 1.41x | (512, 4)
1e+05 | 0.0335 | 0.0045 | 7.44x | (256, 4)
1e+06 | 0.3053 | 0.0056 | 54.97x | (512, 8)
1e+07 | 4.9312 | 0.0262 | 188.43x | (256, 8)
1e+08 | 49.1560 | 0.2110 | 232.95x | (256, 8)
1e+09 | 491.2810 | 2.0570 | 238.83x | (128, 16)

bug fix clang-format modified workgroup sizes

davschneller

LGTM.

Large workgroup sizes are a good idea—just to keep enough data in flight in case some other wave/warp has to wait.

davschneller · 2026-02-25T16:32:20Z

algorithms/cudahip/Reduction.cpp


-  dim3 grid(1, 1, 1);
-  dim3 block(1024, 1, 1);
+  size_t totalItems = WorkGroupSize * ItemsPerWorkItem;


Actually more an AMD name :) (NVIDIA would call it a "block")
Also make const; same for the block count

algorithms/cudahip/Reduction.cpp

davschneller · 2026-02-25T16:44:58Z

algorithms/cudahip/Reduction.cpp

-
-  for (int offset = 1; offset < warpSize; offset *= 2) {
+  for (int offset = warpSize / 2; offset > 0; offset /= 2) {
    value = operation(value, shuffledown(value, offset));


Just as a side note; you could use cub in the CUDA case here. (there are sometimes, e.g. for min/max more advanced reduction functions starting with Ampere) That should come with CUDA included.

For the AMD case, there should be an equivalent library (which might require explicit linking, however) — or you could use the OCKL functions (as long as they exist). See here for a forward as it's used: https://github.com/ROCm/clr/blob/b90c29358c694e66ea78cb1e3957edad09f35cbf/hipamd/include/hip/amd_detail/amd_warp_sync_functions.h#L57-L92 (you can almost just take a copy of that declaration code for the relevant functions—or try to use the HIP functions below ... EDIT: found the docs: https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html )

... though I'm not sure if there's much speedup to gain in either case.
And it could also be put in some future PR.

I am not sure if I understand it correctly, and in https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html -- I found the __reduce functions for warps only for int or unsigned int types. So, I added these if conditions, I am unsure how benificial they would be. If you suggest removing this, and keeping the manual ones, I will remove them.

I am not sure if I understand it correctly, and in https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html -- I found the __reduce functions for warps only for int or unsigned int types. So, I added these if conditions, I am unsure how benificial they would be. If you suggest removing this, and keeping the manual ones, I will remove them.

Surprisingly, these functions are not recognized by HIP in the CI, and they fail, while the NVIDIA tests pass silently. I am a bit confused, but for now, I removed this for HIP too. If there is a particular way to get this done, please let me know. :)

Hmm; I guess only by a ton of ifdefs and the likes. :D

I guess the HIP in the CI is a bit older afterwards; and still doesn't have the _sync instructions. It would have the OCKL instructions available (I've tested them with an older HIP version). But of course only on AMD.
They should in principle be a little faster, because they avoid accessing the LDS unit (and then waiting upon the result for each stage), but can work completely in VALU and SALU, and also combine with the reduction operation. (cf. https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/ ; the blogpost is a little older—but in essence, the same features plus a few more are also available on modern cards)
Though I think the whole discussion can also be postponed to somewhen else/future issues.

Okay. I will remove the ifdefs now, and you could probably look at it in the future in a bit more systematic way than I did.

davschneller · 2026-02-25T16:45:54Z

algorithms/cudahip/Reduction.cpp

+  const int warpId = threadIdx.x / warpSize;

+  val = warpReduce(val, operation);
+  if (laneId == 0)


davschneller · 2026-02-25T16:46:12Z

algorithms/cudahip/Reduction.cpp


-    auto value = operation.defaultValue;
-    auto lastAcc = operation.defaultValue;
+  if (warpId == 0)


Brackets (it's too easy to slip in another statement between those two at some point otherwise)

davschneller · 2026-02-25T16:48:42Z

algorithms/cudahip/Reduction.cpp


+  val = warpReduce(val, operation);
+  if (laneId == 0)
+    shmem[warpId] = val;


In principle, you could already use the atomics here.
That's most likely how e.g. SYCL did it internally (or the respective OCKL reductions would do it, or maybe cub has something as well—though I'm not sure how public those are).

(though IIRC there was a performance regression for FP32 on the MI250X in LDS, and only there with that)

Thanks for pointing it out. However, given that exact LDS FP32 regression you mentioned on the MI250X, I think sticking to the two-pass shuffle (Warp 0 sweeping the shared memory) seems like a safer bet for now. But if you insist on implementing it that way, I can try in a different PR, but I will need to benchmark the differences, and I don't have access to LUMI as of now.

Ok; sure. You can try sometime else.

…ake const - Rename WorkGroupSize -> BlockSize and ItemsPerWorkItem -> ItemsPerThread (WorkGroup/WorkItem are AMD/SYCL terminology; block/thread are NVIDIA terms) - Add curly brackets to if (laneId == 0) and if (warpId == 0) bodies in blockReduce - Make totalItems and numBlocks const in reduceVector

Format and fix compilation fix remove hip for warp reduce functions

Reduction for CUDA

bc193c3

bug fix clang-format modified workgroup sizes

vikaskurapati requested a review from davschneller February 25, 2026 15:31

davschneller approved these changes Feb 25, 2026

View reviewed changes

davschneller changed the title ~~Modified CUDA reduction algorithm~~ Modified CUDA/HIP reduction algorithm Feb 25, 2026

vikaskurapati requested a review from davschneller March 2, 2026 09:20

vikaskurapati force-pushed the vikas/cudareduction branch 2 times, most recently from 398d6b7 to 0192755 Compare March 3, 2026 10:51

vendor warp reduce functions

ffc30c4

Format and fix compilation fix remove hip for warp reduce functions

vikaskurapati force-pushed the vikas/cudareduction branch from 0192755 to ffc30c4 Compare March 3, 2026 10:58

removed ifdefs

677408c

vikaskurapati merged commit 3c90852 into master Mar 18, 2026
17 checks passed

vikaskurapati deleted the vikas/cudareduction branch March 18, 2026 13:50

Conversation

vikaskurapati commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

=== Benchmark Comparison for Type: float (Time Metric) === Vector Size | Old Time (ms) | New Time (ms) | Speedup | Best Config (WG, Items)

=== Benchmark Comparison for Type: int (Time Metric) === Vector Size | Old Time (ms) | New Time (ms) | Speedup | Best Config (WG, Items)

=== Benchmark Comparison for Type: double (Time Metric) === Vector Size | Old Time (ms) | New Time (ms) | Speedup | Best Config (WG, Items)

Uh oh!

davschneller left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

davschneller Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davschneller Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davschneller Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vikaskurapati commented Feb 25, 2026 •

edited

Loading

=== Benchmark Comparison for Type: float (Time Metric) ===
Vector Size | Old Time (ms) | New Time (ms) | Speedup | Best Config (WG, Items)

=== Benchmark Comparison for Type: int (Time Metric) ===
Vector Size | Old Time (ms) | New Time (ms) | Speedup | Best Config (WG, Items)

=== Benchmark Comparison for Type: double (Time Metric) ===
Vector Size | Old Time (ms) | New Time (ms) | Speedup | Best Config (WG, Items)

davschneller left a comment •

edited

Loading

davschneller Feb 25, 2026 •

edited

Loading

davschneller Mar 18, 2026 •

edited

Loading

davschneller Feb 25, 2026 •

edited

Loading