#15 - Add DGX Spark sweep tooling and RDMA loopback prereqs by RamyaGuru · Pull Request #97 · NVIDIA/daqiri

RamyaGuru · 2026-05-26T20:54:59Z

Summary

Second of a 3-PR stack replacing #72. Layers DGX Spark-specific sweep tooling and the RDMA-bench depth bump on top of #96 (reusable bench infra).

examples/run_spark_bench.sh — Spark-tuned sweep driver with per-backend payload/batch matrices, CPU pins, drop-source dispatch (DPDK imissed/ierrors/nombuf, RDMA CQ error, socket /proc/net/udp drops + nstat retrans), and one CSV row per cell into bench-results/.
scripts/spark_data_fill.sh — one-shot driver that runs the full bench matrix across DPDK / socket-UDP / socket-TCP, with hugepage pre-flight and orphan-hugepage cleanup between runs.
scripts/setup_spark_rdma_loopback.sh — idempotent host prereq pinning static ARP entries and source-based policy routing for the Spark single-NIC cross-cable RoCE loopback. The script hardcodes the Spark CX-7 netdev names, MACs, and 1.1.1.1 / 2.2.2.2 IPs from daqiri_bench_rdma_tx_rx_spark.yaml — the technique generalizes to any single-NIC RoCE loopback, the script does not (the spark_ prefix is intentional).
examples/rdma_bench.cpp — raise kMaxOutstanding 5→20 to match num_bufs in the YAML configs. Lifts small-payload pps 8–22× on Spark (4 KB: 4→39 msg/s, 8 KB: 4→88, 64 KB: 32→255) without affecting 8 MB / 1 MB cells already saturated at depth 5. The new comment documents why the constant cannot exceed num_bufs (post_req / free_tx_burst ordering in the same loop iteration would deadlock instead of throttling). A follow-up tracks the deeper architectural fix.
.gitignore — pcie_schematic.png (generated by tune_system.py).

Also includes two parse fixes Greptile flagged on the draft of run_spark_bench.sh:

/proc/net/udp column 13 (drops) is decimal (%lu in net/ipv4/udp.c), not hex. Dropped the strtonum("0x" ...) treatment that was silently multiplying drop counts whenever the column value contained any digit > 9.
The socket bench emits sent_packets / sent_bytes, not RDMA's send_completions / send_bytes, so the RDMA-keyed fallback was always returning empty for socket backends and producing zero-filled CSV rows. The fallback now dispatches on $BACKEND.

Stack

Stacked on #96 — must land after.

PR γ (Spark perf doc) opens next once its checks pass.

Test plan

Re-build clean on top of β: cmake --build build -j — no warnings.
setup_spark_rdma_loopback.sh idempotent: two consecutive sudo ./scripts/setup_spark_rdma_loopback.sh runs both exit 0; ip rule list shows 4 rules and arp -n shows 4 PERM entries for 1.1.1.1 / 2.2.2.2.
Greptile field-name fix regression test: ./examples/run_spark_bench.sh socket-udp smoke now produces a CSV row with packets=1000, bytes=1472000 (pre-fix the same run produced all zeros). Throughput is low because socket-UDP at MTU is bench-bottlenecked on Spark — that's a separate characterization concern for the perf-doc data-fill follow-up, not a wrapper bug.
kMaxOutstanding bump + RDMA smoke: ./examples/run_spark_bench.sh rdma smoke produces gbps=83.764 on the 8 MB native cell (matches the perf-doc headline) with 0 CQ errors.
/proc/net/udp decimal-parse sanity: no order-of-magnitude inflation between the host-side awk sum and the CSV drops field.

🤖 Generated with Claude Code

Stacks on the bench-infra PR (#α) which provides --seconds, TokenBucketPacer, and the bench-output format these scripts parse. - examples/run_spark_bench.sh: Spark-tuned sweep driver with per-backend payload/batch matrices, CPU pins, drop-source dispatch (DPDK imissed/ ierrors/nombuf, RDMA CQ errors, socket /proc/net/udp drops + nstat retrans), and one CSV row per cell into bench-results/. - scripts/spark_data_fill.sh: one-shot driver that runs the full bench matrix across DPDK / socket-UDP / socket-TCP, with hugepage pre-flight and orphan-hugepage cleanup between runs. - scripts/setup_spark_rdma_loopback.sh: idempotent host prereq pinning static ARP entries and source-based policy routing for the Spark single-NIC cross-cable RoCE loopback. Hardcodes the Spark CX-7 netdev names, MAC addresses, and 1.1.1.1 / 2.2.2.2 IPs from daqiri_bench_rdma_tx_rx_spark.yaml — the technique generalizes to any single-NIC RoCE loopback, the script does not. Renamed from the earlier setup_rdma_loopback.sh draft to make the platform scope obvious in the directory listing. - examples/rdma_bench.cpp: raise kMaxOutstanding 5→20 to match num_bufs in the YAML configs. Lifts small-payload pps 8–22× on Spark (4 KB: 4→39 msg/s, 8 KB: 4→88, 64 KB: 32→255) without affecting the 8 MB / 1 MB cells already saturated at depth 5. The new comment documents why the constant cannot exceed num_bufs (post_req / free_tx_burst ordering in the same loop iteration would deadlock instead of throttling). A follow-up tracks the deeper architectural fix (interleave drain with post, bulk tx_ring dequeue, configurable depth). - .gitignore: pcie_schematic.png (generated by tune_system.py). Includes fixes for two parsing bugs Greptile flagged on the original draft of run_spark_bench.sh: - /proc/net/udp drops column is decimal (%lu in net/ipv4/udp.c), not hex. Drop the strtonum("0x" ...) treatment that was silently multiplying drop counts whenever the column value contained any digit > 9. - Socket bench emits sent_packets / sent_bytes (not RDMA's send_completions / send_bytes), so the RDMA-keyed fallback was always returning empty for socket backends and producing zero-filled CSV rows. Dispatch the fallback on $BACKEND. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rgurunathan <rgurunathan@nvidia.com>

greptile-apps · 2026-05-27T13:52:55Z

Greptile Summary

This PR layers DGX Spark-specific bench tooling (run_spark_bench.sh, spark_data_fill.sh, setup_spark_rdma_loopback.sh) and a kMaxOutstanding 5→20 bump in rdma_bench.cpp on top of the reusable bench infra from #96. The constant change is well-documented with an explanation of the post_req/free_tx_burst deadlock bound.

examples/run_spark_bench.sh — new Spark sweep driver with per-backend payload/batch matrices, drop-source dispatch (DPDK counters, RDMA CQ errors, /proc/net/udp decimal parse, TCP nstat), CPU/GPU stat capture, and one CSV row per cell.
scripts/spark_data_fill.sh — orchestration driver with hugepage pre-flight, auto-detected ETH_DST_ADDR, and orphan-hugepage cleanup between runs.
scripts/setup_spark_rdma_loopback.sh — idempotent host prereq script for the CX-7 cross-cable RoCE loopback, pinning static ARP entries and source-based policy routing for 1.1.1.1/2.2.2.2.

Confidence Score: 3/5

Safe to merge for the core library and RDMA loopback setup; the bench CSV tooling has a data-labeling bug that would silently record encoder/decoder utilization under the gpu_sm_pct/gpu_mem_pct column names.

The kMaxOutstanding bump in rdma_bench.cpp is correct and well-commented, and the loopback setup script is clean. The column-index mismatch in run_spark_bench.sh — where -s pucvmet moves sm%/mem% from $5/$6 to $3/$4 but the awk extractors still read $5/$6 — means every CSV row produced by the script mislabels GPU encoder% and decoder% as SM% and memory%. Both happen to be near-zero for GPUDirect workloads, so the bug doesn't surface in the test plan's smoke results, but it would quietly corrupt any GPU-utilization analysis drawn from the CSVs.

examples/run_spark_bench.sh — specifically the nvidia-smi dmon awk column extractors for gpu_sm and gpu_mem

Important Files Changed

Filename	Overview
examples/run_spark_bench.sh	New Spark sweep driver; good drop-source dispatch and decimal /proc/net/udp parse, but nvidia-smi dmon column indices ($5/$6) are wrong for -s pucvmet — sm% and mem% land at $3/$4 with that flag, so the CSV records encoder/decoder utilization under the gpu_sm_pct/gpu_mem_pct headers.
examples/rdma_bench.cpp	kMaxOutstanding raised 5→20 with a clear comment explaining the deadlock bound (post_req/free_tx_burst ordering); BurstParams free contract is correctly maintained on all exit paths in post_req.
scripts/setup_spark_rdma_loopback.sh	Idempotent RDMA loopback prereq script; correctly flushes routes and deletes rules before re-adding, static ARP self-entries are intentional for the cross-cable RoCE loopback topology.
scripts/spark_data_fill.sh	Orchestration driver with solid hugepage pre-flight checks, auto-detected ETH_DST_ADDR, and orphan-hugepage cleanup between runs; correctly captures wrapper exit code via PIPESTATUS[0].
.gitignore	Adds bench-results/ and pcie_schematic.png (tune_system.py output) to .gitignore.

_{Reviews (1): Last reviewed commit: "#15 - Add DGX Spark sweep tooling and RD..." | Re-trigger Greptile}

greptile-apps · 2026-05-27T13:52:59Z

+  local gpu_sm gpu_mem
+  gpu_sm="$(awk '/^ *[0-9]/ { count++; sum += $5 } END { if (count) printf "%.1f", sum/count; else print 0 }' \
+               "$cell_dir/nvidia_smi_dmon.txt" 2>/dev/null || echo 0)"
+  gpu_mem="$(awk '/^ *[0-9]/ { count++; sum += $6 } END { if (count) printf "%.1f", sum/count; else print 0 }' \
+                "$cell_dir/nvidia_smi_dmon.txt" 2>/dev/null || echo 0)"


The column indices $5 and $6 match the default nvidia-smi dmon output (where gtemp/mtemp precede sm/mem, giving: idx=$1 pwr=$2 gtemp=$3 mtemp=$4 sm=$5 mem=$6). But the script explicitly passes -s pucvmet, which reorders the groups: power → utilization → clocks → violations → bandwidth → errors → temperature. With that ordering, temperature moves to the end and the utilization metrics shift left: idx=$1 pwr=$2 sm=$3 mem=$4 enc=$5 dec=$6. As written, the CSV columns gpu_sm_pct and gpu_mem_pct will silently record encoder% and decoder% instead of SM and memory utilization. Both happen to be near-zero for GPUDirect workloads, making this easy to miss during testing.

Suggested change

local gpu_sm gpu_mem

gpu_sm="$(awk '/^ *[0-9]/ { count++; sum += $5 } END { if (count) printf "%.1f", sum/count; else print 0 }' \

"$cell_dir/nvidia_smi_dmon.txt" 2>/dev/null || echo 0)"

gpu_mem="$(awk '/^ *[0-9]/ { count++; sum += $6 } END { if (count) printf "%.1f", sum/count; else print 0 }' \

"$cell_dir/nvidia_smi_dmon.txt" 2>/dev/null || echo 0)"

local gpu_sm gpu_mem

# With -s pucvmet the column order is: idx($1) pwr($2) sm($3) mem($4) enc($5)...

# (temperature is moved to the end; $5/$6 in the default layout only hold sm/mem

# when gtemp/mtemp appear between power and utilization, which -s overrides).

gpu_sm="$(awk '/^ *[0-9]/ { count++; sum += $3 } END { if (count) printf "%.1f", sum/count; else print 0 }' \

"$cell_dir/nvidia_smi_dmon.txt" 2>/dev/null || echo 0)"

gpu_mem="$(awk '/^ *[0-9]/ { count++; sum += $4 } END { if (count) printf "%.1f", sum/count; else print 0 }' \

"$cell_dir/nvidia_smi_dmon.txt" 2>/dev/null || echo 0)"

* #15 - Add RDMA benchmark profiling knobs Instrument the RDMA benchmark and manager hot paths so SEND/RECV posting, CQ polling, ring traffic, and app refill behavior can be measured while debugging the perftest gap. Add RDMA tuning hooks for relaxed ordering, SEND signaling cadence, queue depth, SPSC rings, and RDMA buffer alignment, plus a server connection readiness guard. Signed-off-by: Cliff Burdick <cburdick@nvidia.com> * #15 - Remove RDMA benchmark debug instrumentation Drop benchmark-side timing/profile output and remove RDMA manager profiling timers while keeping the non-debug RDMA transport tuning changes. Signed-off-by: Cliff Burdick <cburdick@nvidia.com> * #15 - Avoid RDMA benchmark buffer starvation deadlock Treat TX buffer exhaustion as backpressure in the RDMA benchmark so the worker returns to completion draining before retrying. Also clean up RDMA TX packet burst allocations on partial failure. Signed-off-by: Cliff Burdick <cburdick@nvidia.com> * #15 - Address Greptile RDMA review comments Fix RDMA completion enqueue failure cleanup to avoid returning metadata twice, and document the single-producer/single-consumer invariant for RDMA connection rings. Signed-off-by: Cliff Burdick <cburdick@nvidia.com> * #15 - Retire RDMA send completions on CQ errors Sweep outstanding SEND work requests through an errored CQE so selective signaling cannot orphan unsignaled bursts on TX errors. Signed-off-by: Cliff Burdick <cburdick@nvidia.com> --------- Signed-off-by: Cliff Burdick <cburdick@nvidia.com>

dleshchev · 2026-06-03T17:51:07Z

Clarifying stack repair state as of 2026-06-03: this PR shows MERGED because it was merged into the stacked base branch 15-bench-infra, not because merge commit 73cabc97a29fa3349b6611d8fb3eacf3df5f8373 is the mainline landing path. The main-bound content was rebuilt on top of current main in follow-up PR #116, which has now landed on main.

Cleanup note: once PR #118 lands, the stale stack branches under 15-bench-* / cburdick/* can be pruned. Keep backup/15-bench-infra-stacked-20260603 until #118 is safely landed.

This was referenced May 26, 2026

#15 - Add reusable C++ bench infrastructure for per-platform performance reports #96

Merged

#15 - Add DGX Spark v1 performance report #98

Draft

#15 - Add C++ bench infrastructure for DGX Spark performance report #72

Closed

RamyaGuru marked this pull request as ready for review May 27, 2026 13:46

greptile-apps Bot reviewed May 27, 2026

View reviewed changes

dleshchev merged commit 73cabc9 into 15-bench-infra Jun 3, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#15 - Add DGX Spark sweep tooling and RDMA loopback prereqs#97

#15 - Add DGX Spark sweep tooling and RDMA loopback prereqs#97
dleshchev merged 2 commits into
15-bench-infrafrom
15-bench-spark-tooling

RamyaGuru commented May 26, 2026

Uh oh!

greptile-apps Bot commented May 27, 2026

Greptile Summary

Uh oh!

greptile-apps Bot May 27, 2026

Uh oh!

Uh oh!

dleshchev commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

RamyaGuru commented May 26, 2026

Summary

Stack

Test plan

Uh oh!

greptile-apps Bot commented May 27, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Uh oh!

greptile-apps Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dleshchev commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants