Skip to content

coll: circulant graph allreduce algorithm#7748

Open
hzhou wants to merge 8 commits intopmodels:mainfrom
hzhou:2603_cga_allreduce
Open

coll: circulant graph allreduce algorithm#7748
hzhou wants to merge 8 commits intopmodels:mainfrom
hzhou:2603_cga_allreduce

Conversation

@hzhou
Copy link
Copy Markdown
Contributor

@hzhou hzhou commented Mar 16, 2026

Pull Request Description

An Allreduce algorithm can be composed as Reduce + Bcast. It is not round-efficient for small to medium messages, but for large messages it can be efficient as the number of chunks saturate the pipeline.

composition

image

Buffer management

image

[skip warnings]

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@hzhou hzhou force-pushed the 2603_cga_allreduce branch 3 times, most recently from e9d667c to 47a2d4c Compare March 17, 2026 14:59
The chunk_count should fit within chunk_size.

Also fix checking of contig datatypes.
@hzhou hzhou force-pushed the 2603_cga_allreduce branch from 47a2d4c to 43f14c6 Compare March 18, 2026 03:09
hzhou added 3 commits March 18, 2026 10:58
It's possible users may set some unreasonable value for
MPIR_CVAR_CIRC_GRAPH_CHUNK_SIZE, such as 1, that may create integer
overflow issue.
If MPIR_CVAR_CIRC_GRAPH_CHUNK_SIZE is 0, it forces the circ_graph
algorithm to use a single chunk. Since we can't ensure the data will fit
in a pre-allocated genq buffer pool, we fallback to use malloc.
To support allreduce as a composition of Reduce + Bcast, we need allow
switching coll_type in the middle of the algorithm.

Previously if there is an dependent recv operation, it is always issued
before any sends. If we allow switching coll_type, there will case of:
    recv->recv->...->send->recv->send->send->...
So we need check every previously issued request in order to clear recv
before send.
@hzhou hzhou force-pushed the 2603_cga_allreduce branch 2 times, most recently from 17c0a0e to 26a3d84 Compare March 18, 2026 16:05
hzhou added 3 commits March 18, 2026 11:09
1. the persist_packbuf is only checked during the bcast stage.

2. add a flag, persist_packbuf_loaded, as a dependency to prevent
a bcast send to proceed without previous packbuf copy incomplete.

3. To keep all sends and all recvs are issued in order, we only need
check the op_stage is past START or COPY (for sends).
Implemented as a composition of Reduce + Bcast.
When the count argument is big, the reduction result, especially for
MPI_PROD and MPI_FLOAT, it can easily exceed the floating point
precision. The test become unstable for algorithm that doesn't perform
reduction in the same order as the provided reference solution.

Avoid big value by mod 10 in the data so we can test arbitrary large
data size without worry about loss of precision.
@hzhou hzhou force-pushed the 2603_cga_allreduce branch from 26a3d84 to 5866780 Compare March 18, 2026 16:10
@hzhou
Copy link
Copy Markdown
Contributor Author

hzhou commented Mar 19, 2026

test:mpich/ch3/most
test:mpich/ch4/most

@hzhou hzhou force-pushed the 2603_cga_allreduce branch from a14e41d to 4c4fbc9 Compare March 19, 2026 13:55
In allreduce, intermediate reduction should avoid directly operate on
GPU buffers to cut unnecessary latency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant