#54 - Add RTX PRO 6000 Blackwell Server Edition support to tune_system.py by chloecrozier · Pull Request #121 · NVIDIA/daqiri

chloecrozier · 2026-06-04T03:24:38Z

Updates the existing tune script and the CMake build about RTX PRO 6000 Blackwell Server Edition, alongside the IGX / DGX Spark paths. Every check still answers "is the system tuned for max throughput?".

Added

Detects RTX PRO 6000 Blackwell Server Edition cards
Flags low BAR1 only on that card
Explains why peermem is optional on Blackwell
CMake warns when CUDA Toolkit < 13.0 omits sm_120

Fixed

ibdev2netdev missing no longer crashes --check all
256 CPU governor lines collapsed to one summary
ibdev2netdev warning emits once per run, not three times

Example output (dev box: 5× RTX PRO 6000 Blackwell SE, 256-core EPYC)

$ sudo python3 python/tune_system.py --check cpu-freq
ERROR - CPU governor: scaling_governor file not found on 256/256 online CPUs.
        The cpufreq driver may not be loaded (e.g. amd-pstate, intel_pstate, or
        cppc_cpufreq). Performance scaling cannot be checked.

$ sudo python3 python/tune_system.py --check bar1-size
INFO - GPU 00000000:04:00.0: BAR1 size is 131072 MiB.
INFO - GPU 00000000:73:00.0: BAR1 size is 131072 MiB.
INFO - GPU 00000000:74:00.0: BAR1 size is 131072 MiB.
INFO - GPU 00000000:84:00.0: BAR1 size is 131072 MiB.
INFO - GPU 00000000:F3:00.0: BAR1 size is 131072 MiB.

$ sudo python3 python/tune_system.py --check peermem
INFO - nvidia-peermem module is not loaded. On this RTX PRO 6000 Blackwell Server
       Edition system /dev/dma_heap/system is available, so the patched DPDK shipped
       with this repo (dpdk_patches/dmabuf.patch) takes the dma-buf GPUDirect path
       and does not need peermem. If you are building DAQIRI against stock DPDK
       instead, load nvidia-peermem.

$ sudo python3 python/tune_system.py --check mrrs   # when ibdev2netdev is missing
WARNING - The ibdev2netdev command is not found (try: apt install infiniband-diags).
          Skipping NIC-dependent checks (mrrs, mps, mtu).

greptile-apps · 2026-06-04T03:30:17Z

Greptile Summary

This PR extends python/tune_system.py to handle RTX PRO 6000 Blackwell Server Edition hardware: a per-GPU 32 GiB BAR1 threshold (applied only to cards whose nvidia-smi product name contains "Blackwell Server Edition"), a dma-buf GPUDirect path check that replaces the nvidia-peermem warning when /dev/dma_heap/system is available, CPU governor output collapsed from one-per-core to one summary line, and get_nic_info fixed to return a consistent [] (previously returned ([], []) on error, crashing callers) and cached via lru_cache so --check all runs ibdev2netdev once.

BAR1 / Blackwell path (check_bar1_size): calls _gpu_name_by_bdf() to build a {pci_bdf: product_name} map, then matches each GPU against the threshold only when "Blackwell Server Edition" appears in the name; heterogeneous multi-GPU boxes with mixed architectures are handled correctly.
Peermem check (check_peermem_kernel): adds a new elif _dmabuf_gpu_path_available() branch that emits INFO instead of WARNING when the kernel dma-buf heap is present; the check is architecture-agnostic by design but relies solely on the kernel-side device node rather than confirming driver-side dma-buf support.
get_nic_info fix: old ([], []) error-path return caused IndexError in callers on any ibdev2netdev failure; the fix returns [] consistently and replaces bare print() calls with logging.warning/error.

Confidence Score: 5/5

Safe to merge; changes are confined to a diagnostic script with no impact on the core C++/CUDA library or its build system.

All logic paths have been validated on the author's 5-GPU/256-core dev box and the output matches expectations. The get_nic_info return-type fix is correct and callers handle the new [] return cleanly. The BAR1 threshold and Blackwell name-matching logic are sound. The one notable concern — that _dmabuf_gpu_path_available() tests only the kernel-side device node and not driver-side dma-buf support — affects a diagnostic INFO message rather than any data path or build output.

No files require special attention; the single changed file is a standalone Python diagnostic script.

Important Files Changed

Filename	Overview
python/tune_system.py	Adds RTX PRO 6000 Blackwell Server Edition detection with per-GPU BAR1 threshold, dma-buf path detection for peermem check, CPU governor output aggregation, and lru_cache on get_nic_info; logic is sound with one minor architecture-agnostic detection concern.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[check_peermem_kernel] --> B{peermem loaded?}
    B -- Yes --> C[INFO: loaded]
    B -- No --> D{is_any_integrated_gpu?}
    D -- Yes --> E[INFO: integrated GPU, no peermem needed]
    D -- No --> F{_dmabuf_gpu_path_available?\n/dev/dma_heap/system exists?}
    F -- Yes --> G[INFO: dma-buf path available,\npatched DPDK takes this route]
    F -- No --> H[WARNING: peermem not loaded]

    I[check_bar1_size] --> J{is_any_integrated_gpu?}
    J -- Yes --> K[skip: no resizable BAR1]
    J -- No --> L[_gpu_name_by_bdf\nnvidia-smi --query-gpu=pci.bus_id,name]
    L --> M[nvidia-smi -q -d MEMORY\nparse BAR1 per GPU]
    M --> N{Blackwell Server Edition\nin gpu_names?}
    N -- Yes --> O{bar1_total < 32 GiB?}
    O -- Yes --> P[WARNING: BAR1 low,\ncheck BIOS Resizable BAR]
    O -- No --> Q[INFO: BAR1 size OK]
    N -- No --> R{bar1_total > 1024 MiB?}
    R -- Yes --> Q
    R -- No --> S[WARNING: BAR1 may indicate issue]

_{Reviews (3): Last reviewed commit: "#54 - Add RTX PRO 6000 support" | Re-trigger Greptile}

greptile-apps · 2026-06-04T03:30:21Z

    except FileNotFoundError:
-        print(
-            "The ibdev2netdev command is not found. Ensure that it is installed and available in your PATH."
+        logging.warning(
+            "The ibdev2netdev command is not found (try: apt install infiniband-diags). "
+            "Skipping NIC-dependent checks (mrrs, mps, mtu)."
        )
-        return [], []
+        return []
    except subprocess.CalledProcessError as e:
-        print(f"Error while executing ibdev2netdev: {e}")
-        return [], []
+        logging.error(f"Error while executing ibdev2netdev: {e}")
+        return []
    except Exception as e:
-        print(f"An unexpected error occurred: {e}")
-        return [], []
+        logging.error(f"Unexpected error while running ibdev2netdev: {e}")
+        return []


ibdev2netdev warning emitted multiple times under --check all

get_nic_info() now owns the warning, but --check all calls it three times independently — once each from check_mrrs(), check_max_payload_size(), and check_mtu_size() — so a missing ibdev2netdev produces three copies of the same warning in a single run. The PR's stated goal is collapsing redundant output (done for the CPU governor), but this case is left un-collapsed. A caller-side guard (cache the result, or emit the warning only once with a module-level flag) would be consistent with that goal.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

chloecrozier · 2026-06-04T03:40:33Z

Addressed and applied the greptile suggestions

cliffburdick · 2026-06-04T04:34:43Z

+  else()
+    message(WARNING
+      "CUDA Toolkit ${CMAKE_CUDA_COMPILER_VERSION} is older than 13.0; sm_120 "
+      "(RTX PRO 6000 Blackwell Server Edition) will be omitted from "


What's the reason we have a message specific to this GPU? In theory we should support any reasonably new GPU I think

Good point, I removed it! I see that the tune script already catches arch mismatches at runtime, so this is redundant

cliffburdick · 2026-06-04T04:36:55Z

                "(e.g. GB10 / DGX Spark) where peermem does not apply. Use kind: host_pinned "
                "in the daqiri YAML for GPUDirect on this platform."
            )
+        elif is_any_blackwell_server_edition_discrete() and _dmabuf_gpu_path_available():


Same question as above about having a message specific to a GPU type

Yeah since dma-buf availability is what matters, I dropped the SKU gate and just made the message generic

Updates python/tune_system.py so the existing IGX / DGX Spark detection paths have a discrete-Blackwell sibling, while keeping every user-facing message hardware-agnostic. - check_peermem_kernel: when /dev/dma_heap/system is present, replaces the misleading "load nvidia-peermem" warning with a hardware-agnostic INFO that points at the patched-DPDK dma-buf path. Falls back to the original WARN on stock-DPDK builds. No GPU-type gate. - check_bar1_size: per-GPU 32 GiB Blackwell-class threshold via _gpu_name_by_bdf(), so heterogeneous boxes only get the Blackwell rule on the Blackwell card. The user-visible message includes the actual nvidia-smi product name rather than a hard-coded SKU string. - check_cpu_governor: aggregates per-CPU output into one summary line so a 256-core system is not buried in 256 identical errors. - get_nic_info: returns [] consistently on error paths (was returning ([], []) which crashed callers); cached via lru_cache so --check all runs ibdev2netdev once and emits the missing-tool warning at most once. Validated on a 5x RTX PRO 6000 Blackwell SE / 256-core EPYC dev box: --check peermem produces the new generic INFO, BAR1 verified at 128 GiB per card, cpu-freq summarizes 256 cores in one line, and the ibdev2netdev-missing path emits a single WARNING. Signed-off-by: Chloe Crozier <chloecrozier@gmail.com>

dleshchev

Can be merged after fixing some nits below and one more thing (id'ed by claude)

The PR body lists under Added: "CMake warns when CUDA Toolkit < 13.0 omits sm_120." That is not in this diff (git show HEAD --name-only → only tune_system.py), and the only related in-tree logic doesn't match the claim on two counts:

CMakeLists.txt:28-31 silently appends arch 121 when CUDA ≥ 13.0 — it does not warn when CUDA < 13.0.
Arch 121 is GB10 / DGX Spark (sm_121), per AGENTS.md. RTX PRO 6000 Blackwell SE is sm_120, which is not in the default arch list (80;90 + 121).

Net effect: the tuning script now advertises/validates RTX PRO 6000 support, but a from-source build with default CMAKE_CUDA_ARCHITECTURES won't actually compile sm_120 kernels for it. Please either (a) drop the CMake bullet from the description if it belongs to a different PR, or (b) include the intended CMake change here — and if real RTX PRO 6000 support is the goal, confirm whether 120 needs adding to the default arch list (out of scope for this file, but it's what "support" implies). At minimum the description should match what merges.

dleshchev · 2026-06-05T13:24:12Z

+    return os.path.exists("/dev/dma_heap/system")
+
+
+def _gpu_name_by_bdf():


doesn't it mirror existing get_nvidia_gpu_info_by_bdf?

dleshchev · 2026-06-05T13:25:22Z

    """
    Checks if the CPU frequency governor is set to 'performance' for all online CPUs.
+    Output is bucketed by result so a 256-core system does not emit 256 lines when
+    every CPU is in the same state. Per-CPU detail still surfaces if results vary.


not sure if this is correct - so far it only reports overall stats

dleshchev · 2026-06-05T13:26:51Z

+    # The threshold is applied per-GPU via gpu_names below so heterogeneous
+    # boxes (e.g. RTX PRO 6000 + H100) only get the Blackwell rule on the
+    # Blackwell card.
+    BAR1_BLACKWELL_MIN_MIB = 32768  # 32 GiB


do we want to move it to the top along with other constants?

greptile-apps Bot reviewed Jun 4, 2026

View reviewed changes

NVIDIA deleted a comment from greptile-apps Bot Jun 4, 2026

chloecrozier force-pushed the rtx-pro-6000-system-tuning branch from a4be5be to 7a1daa1 Compare June 4, 2026 03:37

chloecrozier requested review from cliffburdick and dleshchev June 4, 2026 03:45

cliffburdick reviewed Jun 4, 2026

View reviewed changes

chloecrozier force-pushed the rtx-pro-6000-system-tuning branch from 7a1daa1 to 5c99511 Compare June 4, 2026 06:05

chloecrozier requested a review from cliffburdick June 4, 2026 06:06

dleshchev approved these changes Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#54 - Add RTX PRO 6000 Blackwell Server Edition support to tune_system.py#121

#54 - Add RTX PRO 6000 Blackwell Server Edition support to tune_system.py#121
chloecrozier wants to merge 1 commit into
mainfrom
rtx-pro-6000-system-tuning

chloecrozier commented Jun 4, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 4, 2026 •

edited

Loading

Greptile Summary

Uh oh!

greptile-apps Bot Jun 4, 2026

Uh oh!

chloecrozier commented Jun 4, 2026

Uh oh!

cliffburdick Jun 4, 2026

Uh oh!

chloecrozier Jun 4, 2026

Uh oh!

cliffburdick Jun 4, 2026

Uh oh!

chloecrozier Jun 4, 2026

Uh oh!

dleshchev left a comment

Uh oh!

dleshchev Jun 5, 2026

Uh oh!

dleshchev Jun 5, 2026

Uh oh!

dleshchev Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return os.path.exists("/dev/dma_heap/system")


		def _gpu_name_by_bdf():

Conversation

chloecrozier commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Added

Fixed

Example output (dev box: 5× RTX PRO 6000 Blackwell SE, 256-core EPYC)

Uh oh!

greptile-apps Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

chloecrozier commented Jun 4, 2026

Uh oh!

cliffburdick Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

chloecrozier Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

cliffburdick Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

chloecrozier Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

dleshchev left a comment

Choose a reason for hiding this comment

Uh oh!

dleshchev Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

dleshchev Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

dleshchev Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chloecrozier commented Jun 4, 2026 •

edited

Loading

greptile-apps Bot commented Jun 4, 2026 •

edited

Loading