You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update hierarchical all_reduce and all_gather internal generation arithmetic from (2k-1, 2k) to (3k-2, 3k) (all_reduce.hpp:459-460, all_gather.hpp:420-421). This reserves a uniform stride-three internal generation footprint per user-level call across all hierarchical collectives, leaving slot 3k-1 available for the upcoming three-phase hierarchical all_to_all without colliding on the shared communicators' generation namespace.
Fix the pre-existing bad_parameter strings in all_reduce.hpp:451 and all_gather.hpp:412 from "the 2k/2k+1 internal mapping" to "the 3k-2/3k internal mapping". The previous strings described arithmetic that did not match the actual code (which used 2k-1, 2k).
Add a cross-collective generation regression test that interleaves all_reduce and all_gather calls at consecutive user generations on a single shared hierarchical_communicator and asserts that all calls succeed under the new arithmetic. The four-call version that also exercises all_to_all will be added in a follow-up PR when hierarchical all_to_all lands.
Any background context you want to provide?
This is the first prerequisite PR for hierarchical all_to_all, per the direction in discussion #7200. Hierarchical all_to_all is a three-phase decomposition (intra-subtree gather → inter-representative all_to_all → intra-subtree scatter) that consumes three internal generation slots per user-level call. Two-phase hierarchical collectives sharing the same hierarchical_communicator currently use stride-two arithmetic, which collides on the gate's generation counter once a stride-three call has run on the same communicator. The architectural decision recorded in #7200 is to use a uniform stride of three across all hierarchical collectives; two-phase ones use slots 3k-2 and 3k and skip 3k-1.
The skip carries no extra round trip and requires no internal communicator API changes. next_generation (detail/communicator.hpp:447) already accepts new_generation >= generation_ and post-increments, so issuing the last phase at internal generation 3k directly leaves the gate at 3k+1, ready for the next user-level call. Both two-phase and three-phase shapes leave the gate in the same state, so they can be freely interleaved on the same communicator across user generations.
The bad_parameter string fix is a small pre-existing documentation slip that this PR cleans up while in the same files. The strings already described arithmetic that did not match the actual code, so updating them to "3k-2/3k" leaves the source consistent with both the new code and the new design note.
The full design and the broader prerequisite PR sequence are written up in docs/gsoc-2026/hierarchical_all_to_all_design.md (9th covers the stride-three mechanism in detail; 14th step 0 scopes this PR).
Checklist
Not all points below apply to all pull requests.
I have added a new feature and have added tests to go along with it.
I have fixed a bug and have added a regression test.
I have added a test using random numbers; I have made sure it uses a seed, and that random numbers generated are valid inputs for the tests.
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer TIP This summary will be updated as you push new changes.
I updated the design note to reflect the decisions from this thread and to make the prerequisite PR sequence explicit. In particular, it records the stride-three generation plan, the shared top-level partition helper, the scoped gather/scatter helpers needed for top reps, and the padded Phase 2 block layout.
Living copy: https://github.com/iemAnshuman/hpx/blob/docs/hierarchical-all-to-all-design/docs/hierarchical_all_to_all_design.md
I will use this as the working reference for the stride-three and partition-helper PRs unless there are objections.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proposed Changes
all_reduceandall_gatherinternal generation arithmetic from(2k-1, 2k)to(3k-2, 3k)(all_reduce.hpp:459-460,all_gather.hpp:420-421). This reserves a uniform stride-three internal generation footprint per user-level call across all hierarchical collectives, leaving slot3k-1available for the upcoming three-phase hierarchicalall_to_allwithout colliding on the shared communicators' generation namespace.bad_parameterstrings inall_reduce.hpp:451andall_gather.hpp:412from "the 2k/2k+1 internal mapping" to "the 3k-2/3k internal mapping". The previous strings described arithmetic that did not match the actual code (which used2k-1, 2k).all_reduceandall_gathercalls at consecutive user generations on a single sharedhierarchical_communicatorand asserts that all calls succeed under the new arithmetic. The four-call version that also exercisesall_to_allwill be added in a follow-up PR when hierarchicalall_to_alllands.Any background context you want to provide?
This is the first prerequisite PR for hierarchical
all_to_all, per the direction in discussion #7200. Hierarchicalall_to_allis a three-phase decomposition (intra-subtree gather → inter-representativeall_to_all→ intra-subtree scatter) that consumes three internal generation slots per user-level call. Two-phase hierarchical collectives sharing the samehierarchical_communicatorcurrently use stride-two arithmetic, which collides on the gate's generation counter once a stride-three call has run on the same communicator. The architectural decision recorded in #7200 is to use a uniform stride of three across all hierarchical collectives; two-phase ones use slots3k-2and3kand skip3k-1.The skip carries no extra round trip and requires no internal communicator API changes.
next_generation(detail/communicator.hpp:447) already acceptsnew_generation >= generation_and post-increments, so issuing the last phase at internal generation3kdirectly leaves the gate at3k+1, ready for the next user-level call. Both two-phase and three-phase shapes leave the gate in the same state, so they can be freely interleaved on the same communicator across user generations.The
bad_parameterstring fix is a small pre-existing documentation slip that this PR cleans up while in the same files. The strings already described arithmetic that did not match the actual code, so updating them to "3k-2/3k" leaves the source consistent with both the new code and the new design note.The full design and the broader prerequisite PR sequence are written up in
docs/gsoc-2026/hierarchical_all_to_all_design.md(9th covers the stride-three mechanism in detail; 14th step 0 scopes this PR).Checklist
Not all points below apply to all pull requests.