fix: skip self NVLink connection check #582#609
Open
pazyork wants to merge 1 commit intodeepseek-ai:mainfrom
Open
fix: skip self NVLink connection check #582#609pazyork wants to merge 1 commit intodeepseek-ai:mainfrom
pazyork wants to merge 1 commit intodeepseek-ai:mainfrom
Conversation
Fixes deepseek-ai#582: avoid false assertion when multiple ranks share the same physical GPU (common in local single-GPU debugging, container GPU sharing, MIG scenarios). Add unit test for this case.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #582
Problem
When multiple ranks share the same physical GPU (common in local single-GPU debugging, container GPU sharing, MIG vGPU scenarios, even on multi-GPU servers), the NVLink initialization check triggers a false assertion:
AssertionError: No NVLink connection between GPU 0 and GPU 0Root cause
The original code only skips duplicate checks by index
i >= j, but doesn't handle the case where different rank indices map to the same physical GPU ID.Fix
Add
or physical_device_indices[i] == physical_device_indices[j]condition to the check loop, skip NVLink check for same physical GPU pairs.Test
Added lightweight unit test in
tests/test_utils.pywith mocked pynvml/distributed, no real multi-GPU hardware required. Verified:Impact
No breaking changes, only affects the NVLink initialization check process. All existing normal deployment scenarios are completely unaffected.
Thanks for reviewing! It's a tiny fix, feel free to let me know if you have any suggestions and I'll adjust promptly. @jershi425 @sphish