Upgrade to xpk v1.0.0, disable nccl installation on host on cluster creation#1881
Open
aybchan wants to merge 25 commits into
Open
Upgrade to xpk v1.0.0, disable nccl installation on host on cluster creation#1881aybchan wants to merge 25 commits into
xpk v1.0.0, disable nccl installation on host on cluster creation#1881aybchan wants to merge 25 commits into
Conversation
a5e71e6 to
fca94d5
Compare
fca94d5 to
e0d315f
Compare
b6b3fec to
67d047c
Compare
…olbox into aybchan/upgrade-xpk-v1.0.0
f8ac796 to
1ef6d3b
Compare
1ef6d3b to
fa2fbc5
Compare
olupton
reviewed
Feb 4, 2026
| GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} | ||
| NVCR_TOKEN: ${{ secrets.NVCR_TOKEN }} | ||
| ENVS: | ||
| NCCL_NET_PLUGIN=/opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so; |
Collaborator
There was a problem hiding this comment.
This is disabling the TCPXO plugin on GCP, isn't it?
Comment on lines
-143
to
-148
| # Work around GCP's deployment model that munges together three | ||
| # mostly unrelated things: (1) the host machine's CUDA driver/libs, | ||
| # (2) the version of NCCL installed on the host machine, and (3) | ||
| # the GCP-specific NCCL plugins. These are jumbled together and | ||
| # mounted into the container at /usr/local/nvidia/lib64. We only want | ||
| # #3, so copy them to a separate directory. |
Collaborator
There was a problem hiding this comment.
IIUC this PR is stopping NCCL from being installed on the host machine, i.e. negating (2), but (3) and (1) are still being jumbled together in the same directory on the host machine? So we'd also need to stop that to get good perf OOTB.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.