Draft of segmented reduce optimization by gevtushenko · Pull Request #578 · NVIDIA/cub

gevtushenko · 2022-09-30T19:36:44Z

This PR applies a technique similar to one in segmented sort algorithm. Segments are partitioned and various thread groups are applied to various segment categories. While optimizing segmented reduction I introduced warp reduce agent and generalized reduce agent implementation. Below are speedups for small segment sizes, best speedup is about 66x:

Medium size segments experience minor slowdowns, but it can be addressed by further tuning:

Large size segments are not affected by optimization:

In the commits, there's an attempt to fuse small segments reduction with the partitioning stage. This optimization doesn't perform as well. My guess is that it slows down decoupled look-back at the partitioning stage or affects it's occupancy, which leads to overall slowdown.

In order not to break stream capture (if one is used), I incorporated a separate check for that. We might need to check stream capturing mode in our tests later.

This reverts commit 8239e36.

This reverts commit 84c02eb.

gevtushenko · 2022-10-09T00:53:58Z

Experimented with a deterministic version of large segments optimization. Assigned a number of CTAs per each segment. The optimization is quite expensive in terms of the memory and requires about num_segments * (sizeof(int) + 4 * sizeof(AccumulatorT)). The speedup disappears as soon as there's about 16 large segments (particular number depends on the number of SMs), so I don't think it's worth it. Just in case, pushed and reverted mentioned optimization.

gevtushenko added 5 commits September 30, 2022 21:22

Optimize segmented reduce

224f433

Fuse partitioning and small segments processing

8239e36

Revert partitioning and small segments processing fusion

2a45a70

This reverts commit 8239e36.

Don't query stream capture if there's not enough segments

82dd606

Fix temporary storage names

0121c2e

gevtushenko added the P2: nice to have Desired, but not necessary. label Sep 30, 2022

gevtushenko added 2 commits October 9, 2022 04:21

Optimize large segments

84c02eb

Revert large segments optimization

556f139

This reverts commit 84c02eb.

gevtushenko mentioned this pull request Mar 29, 2023

Performance of small sums could be improved NVIDIA/cccl#921

Closed

leofang mentioned this pull request Jan 26, 2025

Batched cupy.sum on short reduction axes are slow cupy/cupy#8092

Open

jrhemstad mentioned this pull request Oct 9, 2025

[FEA]: Load-balanced segmented reduce NVIDIA/cccl#6171

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft of segmented reduce optimization#578

Draft of segmented reduce optimization#578
gevtushenko wants to merge 7 commits intoNVIDIA:mainfrom
gevtushenko:enh-main/github/segmented_reduce

gevtushenko commented Sep 30, 2022

Uh oh!

gevtushenko commented Oct 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gevtushenko commented Sep 30, 2022

Uh oh!

gevtushenko commented Oct 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant