#54 - Add RTX PRO 6000 Blackwell Server Edition support to tune_system.py#121
#54 - Add RTX PRO 6000 Blackwell Server Edition support to tune_system.py#121chloecrozier wants to merge 1 commit into
Conversation
|
| Filename | Overview |
|---|---|
| python/tune_system.py | Adds RTX PRO 6000 Blackwell Server Edition detection with per-GPU BAR1 threshold, dma-buf path detection for peermem check, CPU governor output aggregation, and lru_cache on get_nic_info; logic is sound with one minor architecture-agnostic detection concern. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[check_peermem_kernel] --> B{peermem loaded?}
B -- Yes --> C[INFO: loaded]
B -- No --> D{is_any_integrated_gpu?}
D -- Yes --> E[INFO: integrated GPU, no peermem needed]
D -- No --> F{_dmabuf_gpu_path_available?\n/dev/dma_heap/system exists?}
F -- Yes --> G[INFO: dma-buf path available,\npatched DPDK takes this route]
F -- No --> H[WARNING: peermem not loaded]
I[check_bar1_size] --> J{is_any_integrated_gpu?}
J -- Yes --> K[skip: no resizable BAR1]
J -- No --> L[_gpu_name_by_bdf\nnvidia-smi --query-gpu=pci.bus_id,name]
L --> M[nvidia-smi -q -d MEMORY\nparse BAR1 per GPU]
M --> N{Blackwell Server Edition\nin gpu_names?}
N -- Yes --> O{bar1_total < 32 GiB?}
O -- Yes --> P[WARNING: BAR1 low,\ncheck BIOS Resizable BAR]
O -- No --> Q[INFO: BAR1 size OK]
N -- No --> R{bar1_total > 1024 MiB?}
R -- Yes --> Q
R -- No --> S[WARNING: BAR1 may indicate issue]
Reviews (3): Last reviewed commit: "#54 - Add RTX PRO 6000 support" | Re-trigger Greptile
| except FileNotFoundError: | ||
| print( | ||
| "The ibdev2netdev command is not found. Ensure that it is installed and available in your PATH." | ||
| logging.warning( | ||
| "The ibdev2netdev command is not found (try: apt install infiniband-diags). " | ||
| "Skipping NIC-dependent checks (mrrs, mps, mtu)." | ||
| ) | ||
| return [], [] | ||
| return [] | ||
| except subprocess.CalledProcessError as e: | ||
| print(f"Error while executing ibdev2netdev: {e}") | ||
| return [], [] | ||
| logging.error(f"Error while executing ibdev2netdev: {e}") | ||
| return [] | ||
| except Exception as e: | ||
| print(f"An unexpected error occurred: {e}") | ||
| return [], [] | ||
| logging.error(f"Unexpected error while running ibdev2netdev: {e}") | ||
| return [] |
There was a problem hiding this comment.
ibdev2netdev warning emitted multiple times under
--check all
get_nic_info() now owns the warning, but --check all calls it three times independently — once each from check_mrrs(), check_max_payload_size(), and check_mtu_size() — so a missing ibdev2netdev produces three copies of the same warning in a single run. The PR's stated goal is collapsing redundant output (done for the CPU governor), but this case is left un-collapsed. A caller-side guard (cache the result, or emit the warning only once with a module-level flag) would be consistent with that goal.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
a4be5be to
7a1daa1
Compare
|
Addressed and applied the greptile suggestions |
| else() | ||
| message(WARNING | ||
| "CUDA Toolkit ${CMAKE_CUDA_COMPILER_VERSION} is older than 13.0; sm_120 " | ||
| "(RTX PRO 6000 Blackwell Server Edition) will be omitted from " |
There was a problem hiding this comment.
What's the reason we have a message specific to this GPU? In theory we should support any reasonably new GPU I think
There was a problem hiding this comment.
Good point, I removed it! I see that the tune script already catches arch mismatches at runtime, so this is redundant
| "(e.g. GB10 / DGX Spark) where peermem does not apply. Use kind: host_pinned " | ||
| "in the daqiri YAML for GPUDirect on this platform." | ||
| ) | ||
| elif is_any_blackwell_server_edition_discrete() and _dmabuf_gpu_path_available(): |
There was a problem hiding this comment.
Same question as above about having a message specific to a GPU type
There was a problem hiding this comment.
Yeah since dma-buf availability is what matters, I dropped the SKU gate and just made the message generic
Updates python/tune_system.py so the existing IGX / DGX Spark detection
paths have a discrete-Blackwell sibling, while keeping every user-facing
message hardware-agnostic.
- check_peermem_kernel: when /dev/dma_heap/system is present, replaces
the misleading "load nvidia-peermem" warning with a hardware-agnostic
INFO that points at the patched-DPDK dma-buf path. Falls back to the
original WARN on stock-DPDK builds. No GPU-type gate.
- check_bar1_size: per-GPU 32 GiB Blackwell-class threshold via
_gpu_name_by_bdf(), so heterogeneous boxes only get the Blackwell
rule on the Blackwell card. The user-visible message includes the
actual nvidia-smi product name rather than a hard-coded SKU string.
- check_cpu_governor: aggregates per-CPU output into one summary line
so a 256-core system is not buried in 256 identical errors.
- get_nic_info: returns [] consistently on error paths (was returning
([], []) which crashed callers); cached via lru_cache so --check all
runs ibdev2netdev once and emits the missing-tool warning at most once.
Validated on a 5x RTX PRO 6000 Blackwell SE / 256-core EPYC dev box:
--check peermem produces the new generic INFO, BAR1 verified at 128 GiB
per card, cpu-freq summarizes 256 cores in one line, and the
ibdev2netdev-missing path emits a single WARNING.
Signed-off-by: Chloe Crozier <chloecrozier@gmail.com>
7a1daa1 to
5c99511
Compare
dleshchev
left a comment
There was a problem hiding this comment.
Can be merged after fixing some nits below and one more thing (id'ed by claude)
The PR body lists under Added: "CMake warns when CUDA Toolkit < 13.0 omits sm_120." That is not in this diff (git show HEAD --name-only → only tune_system.py), and the only related in-tree logic doesn't match the claim on two counts:
- CMakeLists.txt:28-31 silently appends arch 121 when CUDA ≥ 13.0 — it does not warn when CUDA < 13.0.
- Arch 121 is GB10 / DGX Spark (sm_121), per AGENTS.md. RTX PRO 6000 Blackwell SE is sm_120, which is not in the default arch list (80;90 + 121).
Net effect: the tuning script now advertises/validates RTX PRO 6000 support, but a from-source build with default CMAKE_CUDA_ARCHITECTURES won't actually compile sm_120 kernels for it. Please either (a) drop the CMake bullet from the description if it belongs to a different PR, or (b) include the intended CMake change here — and if real RTX PRO 6000 support is the goal, confirm whether 120 needs adding to the default arch list (out of scope for this file, but it's what "support" implies). At minimum the description should match what merges.
| return os.path.exists("/dev/dma_heap/system") | ||
|
|
||
|
|
||
| def _gpu_name_by_bdf(): |
There was a problem hiding this comment.
doesn't it mirror existing get_nvidia_gpu_info_by_bdf?
| """ | ||
| Checks if the CPU frequency governor is set to 'performance' for all online CPUs. | ||
| Output is bucketed by result so a 256-core system does not emit 256 lines when | ||
| every CPU is in the same state. Per-CPU detail still surfaces if results vary. |
There was a problem hiding this comment.
not sure if this is correct - so far it only reports overall stats
| # The threshold is applied per-GPU via gpu_names below so heterogeneous | ||
| # boxes (e.g. RTX PRO 6000 + H100) only get the Blackwell rule on the | ||
| # Blackwell card. | ||
| BAR1_BLACKWELL_MIN_MIB = 32768 # 32 GiB |
There was a problem hiding this comment.
do we want to move it to the top along with other constants?
Updates the existing tune script and the CMake build about RTX PRO 6000 Blackwell Server Edition, alongside the IGX / DGX Spark paths. Every check still answers "is the system tuned for max throughput?".
Added
Fixed
ibdev2netdevmissing no longer crashes--check allibdev2netdevwarning emits once per run, not three timesExample output (dev box: 5× RTX PRO 6000 Blackwell SE, 256-core EPYC)