coll: circulant graph allreduce algorithm#7748
Open
hzhou wants to merge 8 commits intopmodels:mainfrom
Open
Conversation
e9d667c to
47a2d4c
Compare
The chunk_count should fit within chunk_size. Also fix checking of contig datatypes.
47a2d4c to
43f14c6
Compare
It's possible users may set some unreasonable value for MPIR_CVAR_CIRC_GRAPH_CHUNK_SIZE, such as 1, that may create integer overflow issue.
If MPIR_CVAR_CIRC_GRAPH_CHUNK_SIZE is 0, it forces the circ_graph algorithm to use a single chunk. Since we can't ensure the data will fit in a pre-allocated genq buffer pool, we fallback to use malloc.
To support allreduce as a composition of Reduce + Bcast, we need allow
switching coll_type in the middle of the algorithm.
Previously if there is an dependent recv operation, it is always issued
before any sends. If we allow switching coll_type, there will case of:
recv->recv->...->send->recv->send->send->...
So we need check every previously issued request in order to clear recv
before send.
17c0a0e to
26a3d84
Compare
1. the persist_packbuf is only checked during the bcast stage. 2. add a flag, persist_packbuf_loaded, as a dependency to prevent a bcast send to proceed without previous packbuf copy incomplete. 3. To keep all sends and all recvs are issued in order, we only need check the op_stage is past START or COPY (for sends).
Implemented as a composition of Reduce + Bcast.
When the count argument is big, the reduction result, especially for MPI_PROD and MPI_FLOAT, it can easily exceed the floating point precision. The test become unstable for algorithm that doesn't perform reduction in the same order as the provided reference solution. Avoid big value by mod 10 in the data so we can test arbitrary large data size without worry about loss of precision.
26a3d84 to
5866780
Compare
Contributor
Author
|
test:mpich/ch3/most |
a14e41d to
4c4fbc9
Compare
In allreduce, intermediate reduction should avoid directly operate on GPU buffers to cut unnecessary latency.
4c4fbc9 to
1856eee
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request Description
An
Allreducealgorithm can be composed asReduce+Bcast. It is not round-efficient for small to medium messages, but for large messages it can be efficient as the number of chunks saturate the pipeline.composition
Buffer management
[skip warnings]
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.