coll: circulant graph allreduce algorithm by hzhou · Pull Request #7748 · pmodels/mpich

hzhou · 2026-03-16T21:31:37Z

Pull Request Description

An Allreduce algorithm can be composed as Reduce + Bcast. It is not round-efficient for small to medium messages, but for large messages it can be efficient as the number of chunks saturate the pipeline.

composition

Buffer management

[skip warnings]

Author Checklist

Provide Description
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form: module: short description
Commit message explains what's in the commit.
Passes All Tests
Whitespace checker. Warnings test. Additional tests via comments.
Contribution Agreement
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.

The chunk_count should fit within chunk_size. Also fix checking of contig datatypes.

It's possible users may set some unreasonable value for MPIR_CVAR_CIRC_GRAPH_CHUNK_SIZE, such as 1, that may create integer overflow issue.

If MPIR_CVAR_CIRC_GRAPH_CHUNK_SIZE is 0, it forces the circ_graph algorithm to use a single chunk. Since we can't ensure the data will fit in a pre-allocated genq buffer pool, we fallback to use malloc.

To support allreduce as a composition of Reduce + Bcast, we need allow switching coll_type in the middle of the algorithm. Previously if there is an dependent recv operation, it is always issued before any sends. If we allow switching coll_type, there will case of: recv->recv->...->send->recv->send->send->... So we need check every previously issued request in order to clear recv before send.

1. the persist_packbuf is only checked during the bcast stage. 2. add a flag, persist_packbuf_loaded, as a dependency to prevent a bcast send to proceed without previous packbuf copy incomplete. 3. To keep all sends and all recvs are issued in order, we only need check the op_stage is past START or COPY (for sends).

Implemented as a composition of Reduce + Bcast.

When the count argument is big, the reduction result, especially for MPI_PROD and MPI_FLOAT, it can easily exceed the floating point precision. The test become unstable for algorithm that doesn't perform reduction in the same order as the provided reference solution. Avoid big value by mod 10 in the data so we can test arbitrary large data size without worry about loss of precision.

hzhou · 2026-03-19T02:01:17Z

test:mpich/ch3/most
test:mpich/ch4/most

In allreduce, intermediate reduction should avoid directly operate on GPU buffers to cut unnecessary latency.

hzhou force-pushed the 2603_cga_allreduce branch 3 times, most recently from e9d667c to 47a2d4c Compare March 17, 2026 14:59

coll/cga: fix reduce chunk_count

87cb737

The chunk_count should fit within chunk_size. Also fix checking of contig datatypes.

hzhou force-pushed the 2603_cga_allreduce branch from 47a2d4c to 43f14c6 Compare March 18, 2026 03:09

hzhou added 3 commits March 18, 2026 10:58

coll/cga: use MPI_Aint to prevent overflow

44dfdcd

It's possible users may set some unreasonable value for MPIR_CVAR_CIRC_GRAPH_CHUNK_SIZE, such as 1, that may create integer overflow issue.

coll/cga: use malloc for packbuf if CHUNK_SIZE is 0

84748c4

If MPIR_CVAR_CIRC_GRAPH_CHUNK_SIZE is 0, it forces the circ_graph algorithm to use a single chunk. Since we can't ensure the data will fit in a pre-allocated genq buffer pool, we fallback to use malloc.

hzhou force-pushed the 2603_cga_allreduce branch 2 times, most recently from 17c0a0e to 26a3d84 Compare March 18, 2026 16:05

hzhou added 3 commits March 18, 2026 11:09

coll: add intra_circ_graph allreduce algorithm

e5be20c

Implemented as a composition of Reduce + Bcast.

hzhou force-pushed the 2603_cga_allreduce branch from 26a3d84 to 5866780 Compare March 18, 2026 16:10

hzhou force-pushed the 2603_cga_allreduce branch from a14e41d to 4c4fbc9 Compare March 19, 2026 13:55

coll/cga: add staging to avoid direct gpu reduction

1856eee

In allreduce, intermediate reduction should avoid directly operate on GPU buffers to cut unnecessary latency.

hzhou force-pushed the 2603_cga_allreduce branch from 4c4fbc9 to 1856eee Compare March 19, 2026 14:49

hzhou mentioned this pull request Mar 19, 2026

Allreduce algorithm, performance and codepath issue on ZE gpus #7024

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coll: circulant graph allreduce algorithm#7748

coll: circulant graph allreduce algorithm#7748
hzhou wants to merge 8 commits intopmodels:mainfrom
hzhou:2603_cga_allreduce

hzhou commented Mar 16, 2026 •

edited

Loading

Uh oh!

hzhou commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hzhou commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Description

composition

Buffer management

Author Checklist

Uh oh!

hzhou commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hzhou commented Mar 16, 2026 •

edited

Loading