Skip to content

Upgrade to xpk v1.0.0, disable nccl installation on host on cluster creation#1881

Open
aybchan wants to merge 25 commits into
mainfrom
aybchan/upgrade-xpk-v1.0.0
Open

Upgrade to xpk v1.0.0, disable nccl installation on host on cluster creation#1881
aybchan wants to merge 25 commits into
mainfrom
aybchan/upgrade-xpk-v1.0.0

Conversation

@aybchan
Copy link
Copy Markdown
Member

@aybchan aybchan commented Jan 7, 2026

No description provided.

@aybchan aybchan self-assigned this Jan 7, 2026
@aybchan aybchan marked this pull request as draft January 7, 2026 12:04
@aybchan aybchan marked this pull request as ready for review January 8, 2026 16:45
@aybchan aybchan force-pushed the aybchan/upgrade-xpk-v1.0.0 branch 2 times, most recently from a5e71e6 to fca94d5 Compare January 9, 2026 16:44
@aybchan aybchan force-pushed the aybchan/upgrade-xpk-v1.0.0 branch from fca94d5 to e0d315f Compare January 9, 2026 17:50
@aybchan aybchan force-pushed the aybchan/upgrade-xpk-v1.0.0 branch from b6b3fec to 67d047c Compare January 9, 2026 21:46
@aybchan aybchan marked this pull request as draft January 12, 2026 09:43
@aybchan aybchan marked this pull request as ready for review January 12, 2026 14:05
@aybchan aybchan marked this pull request as draft January 12, 2026 14:05
@aybchan aybchan marked this pull request as ready for review January 14, 2026 15:18
@aybchan aybchan force-pushed the aybchan/upgrade-xpk-v1.0.0 branch from f8ac796 to 1ef6d3b Compare January 14, 2026 16:38
@aybchan aybchan force-pushed the aybchan/upgrade-xpk-v1.0.0 branch from 1ef6d3b to fa2fbc5 Compare January 14, 2026 16:48
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
NVCR_TOKEN: ${{ secrets.NVCR_TOKEN }}
ENVS:
NCCL_NET_PLUGIN=/opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is disabling the TCPXO plugin on GCP, isn't it?

Comment on lines -143 to -148
# Work around GCP's deployment model that munges together three
# mostly unrelated things: (1) the host machine's CUDA driver/libs,
# (2) the version of NCCL installed on the host machine, and (3)
# the GCP-specific NCCL plugins. These are jumbled together and
# mounted into the container at /usr/local/nvidia/lib64. We only want
# #3, so copy them to a separate directory.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this PR is stopping NCCL from being installed on the host machine, i.e. negating (2), but (3) and (1) are still being jumbled together in the same directory on the host machine? So we'd also need to stop that to get good perf OOTB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants