Conversation
| `MPIR_CVAR_ENABLE_YAKSA_REDUCTION = 0`; this enables the fallback path | ||
| (host-based) for reduction. | ||
|
|
||
| * `MPIR_CVAR_YAKSA_REDUCTION_THRESHOLD`: This CVAR determines the threshold to |
There was a problem hiding this comment.
If the message size is smaller than MPIR_CVAR_YAKSA_REDUCTION_THRESHOLD, reduction collectives will directly pass the GPU data to the reduction algorithms assuming the internal yaksa engine can directly perform operations with GPU data, potentially using a GPU kernel. If the message size is larger than MPIR_CVAR_YAKSA_REDUCTION_THRESHOLD, reduction collectives will always pack the GPU data to host memory first before passing on to reduction algorithms. The motivation to set this CVAR is because the current reduction algorithms are optimized for host memories and under-performs with large GPU messages.
|
|
||
| ### 2.7. Fallback Behavior for Collective Algorithm | ||
|
|
||
| MPICH will fallback if the selected algorithm is not applicable to the |
There was a problem hiding this comment.
Can we improve this description of when fallback occurs? I think we should state that fallback can occur from user-specified selections and default selections from the .json configurations and CVAR overrides. I think it is slightly counter-intuitive that even if you force a particular algorithm with a CVAR, you still may fallback, and that should be clear here.
Pull Request Description
This PR updates/adds description for some MPICH CVARs
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.