Skip to content

fix: skip self NVLink connection check #582#609

Open
pazyork wants to merge 1 commit intodeepseek-ai:mainfrom
pazyork:main
Open

fix: skip self NVLink connection check #582#609
pazyork wants to merge 1 commit intodeepseek-ai:mainfrom
pazyork:main

Conversation

@pazyork
Copy link
Copy Markdown

@pazyork pazyork commented Apr 24, 2026

Fixes #582

Problem
When multiple ranks share the same physical GPU (common in local single-GPU debugging, container GPU sharing, MIG vGPU scenarios, even on multi-GPU servers), the NVLink initialization check triggers a false assertion:
AssertionError: No NVLink connection between GPU 0 and GPU 0

Root cause
The original code only skips duplicate checks by index i >= j, but doesn't handle the case where different rank indices map to the same physical GPU ID.

Fix
Add or physical_device_indices[i] == physical_device_indices[j] condition to the check loop, skip NVLink check for same physical GPU pairs.

Test
Added lightweight unit test in tests/test_utils.py with mocked pynvml/distributed, no real multi-GPU hardware required. Verified:

  1. Duplicate GPU scenario no longer raises assertion
  2. Normal multi-GPU scenario works unchanged

Impact
No breaking changes, only affects the NVLink initialization check process. All existing normal deployment scenarios are completely unaffected.

Thanks for reviewing! It's a tiny fix, feel free to let me know if you have any suggestions and I'll adjust promptly. @jershi425 @sphish

Fixes deepseek-ai#582: avoid false assertion when multiple ranks share the same physical GPU (common in local single-GPU debugging, container GPU sharing, MIG scenarios). Add unit test for this case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AssertionError: No NVLink connection between GPU 0 and GPU 0

2 participants