Updates to RDMA bench profiling by cliffburdick · Pull Request #103 · NVIDIA/daqiri

cliffburdick · 2026-05-27T22:54:42Z

No description provided.

Instrument the RDMA benchmark and manager hot paths so SEND/RECV posting, CQ polling, ring traffic, and app refill behavior can be measured while debugging the perftest gap. Add RDMA tuning hooks for relaxed ordering, SEND signaling cadence, queue depth, SPSC rings, and RDMA buffer alignment, plus a server connection readiness guard. Signed-off-by: Cliff Burdick <cburdick@nvidia.com>

Drop benchmark-side timing/profile output and remove RDMA manager profiling timers while keeping the non-debug RDMA transport tuning changes. Signed-off-by: Cliff Burdick <cburdick@nvidia.com>

Treat TX buffer exhaustion as backpressure in the RDMA benchmark so the worker returns to completion draining before retrying. Also clean up RDMA TX packet burst allocations on partial failure. Signed-off-by: Cliff Burdick <cburdick@nvidia.com>

greptile-apps · 2026-05-27T23:02:33Z

Greptile Summary

Adds benchmark profiling infrastructure (token-bucket pacer, rate-limiting flag, structured stdout output, DGX Spark sweep drivers) and applies several RDMA manager tuning changes.

RDMA manager: switches to std::map for ordered WR tracking with upper_bound predecessor sweeps; adds complete_send_wrs to retire unsignaled WRs on any CQE (success or error); enlarges pools to 2047 entries; adds relaxed-ordering MR registration with fallback; raises MR alignment to max(existing, GPU_PAGE_SIZE); fixes a pool leak in get_tx_packet_burst partial-failure paths.
RDMA bench: raises kMaxOutstanding from 5 to 64 with backpressure replacing the spin-wait; drains completions before posting; uses free_tx_metadata (not free_tx_burst) on pre-packet-allocation failure.

Confidence Score: 5/5

Safe to merge; the three previously-flagged correctness issues (double-free, missing error-path sweep, undocumented SPSC invariant) are all addressed in this commit stack.

The RDMA thread changes are substantial but well-contained: the selective-signaling predecessor sweep correctly uses std::map::upper_bound, error CQEs now call complete_send_wrs before continuing, and the SPSC ring assumption is documented. The benchmark layer changes are additive and low-risk.

The complete_send_wrs lambda in src/managers/rdma/daqiri_rdma_mgr.cpp is the highest-density changed path and worth a second human read, particularly the force-signal threshold and the two-case CRITICAL logging behavior.

Important Files Changed

Filename	Overview
src/managers/rdma/daqiri_rdma_mgr.cpp	Major RDMA thread overhaul: selective signaling, std::map for ordered WR tracking, complete_send_wrs lambda, SPSC rings, pool enlargements, relaxed-ordering MR fallback, and fixed get_tx_packet_burst error paths.
examples/rdma_bench.cpp	kMaxOutstanding raised 5→64 with backpressure instead of spin-wait; completion drain loop; TokenBucketPacer; free_tx_metadata on pre-packet-allocation failure. Free contract looks correct.
examples/socket_bench.cpp	Adds TokenBucketPacer, byte counters, structured output, and fixes a latent bug where iterations==0 would immediately exit the send/recv loop instead of running time-bounded.
examples/raw_bench_common.cpp	TokenBucketPacer implementation with chunked 10 ms sleep for stop-flag responsiveness and single-write stringstream stdout to prevent RX/TX line interleaving.
examples/run_spark_bench.sh	New DGX Spark sweep driver with per-backend payload/batch matrices, drop-source dispatch, CPU/GPU counter capture, and one CSV row per cell. RDMA and socket stdout parsing is correctly keyed by backend type.
examples/CMakeLists.txt	raw_bench_common.cpp and CUDA::cudart linked into rdma_bench and socket_bench; BUILD_RPATH added to socket_bench to fix runtime libdaqiri.so discovery.

Sequence Diagram

sequenceDiagram
    participant App as Application Thread
    participant TxRing as tx_ring (SPSC)
    participant RDMA as rdma_thread
    participant RxRing as rx_ring (SPSC)
    participant CQ as TX Completion Queue

    App->>App: create_burst_params() + rdma_set_header()
    App->>App: get_tx_packet_burst() [backpressure if pool empty]
    App->>App: set_packet_lengths() + send_tx_burst()
    App->>TxRing: "enqueue(BurstParams*)"
    RDMA->>TxRing: "dequeue(BurstParams*)"
    RDMA->>RDMA: "ibv_post_send() [signal every N or when map>=threshold]"
    RDMA->>RDMA: "outstanding_send_wr_ids[wr_id+p] = burst"
    CQ-->>RDMA: IBV_WC_SEND (signaled CQE)
    RDMA->>RDMA: complete_send_wrs(wr_id, status) upper_bound sweep
    RDMA->>RxRing: "enqueue each swept BurstParams*"
    App->>RxRing: get_rx_burst() completion
    App->>App: free_tx_burst(completion)

_{Reviews (4): Last reviewed commit: "#15 - Retire RDMA send completions on CQ..." | Re-trigger Greptile}

Fix RDMA completion enqueue failure cleanup to avoid returning metadata twice, and document the single-producer/single-consumer invariant for RDMA connection rings. Signed-off-by: Cliff Burdick <cburdick@nvidia.com>

cliffburdick · 2026-05-28T16:43:43Z

@greptile review

Sweep outstanding SEND work requests through an errored CQE so selective signaling cannot orphan unsignaled bursts on TX errors. Signed-off-by: Cliff Burdick <cburdick@nvidia.com>

* #15 - Add DGX Spark sweep tooling and RDMA loopback prereqs Stacks on the bench-infra PR (#α) which provides --seconds, TokenBucketPacer, and the bench-output format these scripts parse. - examples/run_spark_bench.sh: Spark-tuned sweep driver with per-backend payload/batch matrices, CPU pins, drop-source dispatch (DPDK imissed/ ierrors/nombuf, RDMA CQ errors, socket /proc/net/udp drops + nstat retrans), and one CSV row per cell into bench-results/. - scripts/spark_data_fill.sh: one-shot driver that runs the full bench matrix across DPDK / socket-UDP / socket-TCP, with hugepage pre-flight and orphan-hugepage cleanup between runs. - scripts/setup_spark_rdma_loopback.sh: idempotent host prereq pinning static ARP entries and source-based policy routing for the Spark single-NIC cross-cable RoCE loopback. Hardcodes the Spark CX-7 netdev names, MAC addresses, and 1.1.1.1 / 2.2.2.2 IPs from daqiri_bench_rdma_tx_rx_spark.yaml — the technique generalizes to any single-NIC RoCE loopback, the script does not. Renamed from the earlier setup_rdma_loopback.sh draft to make the platform scope obvious in the directory listing. - examples/rdma_bench.cpp: raise kMaxOutstanding 5→20 to match num_bufs in the YAML configs. Lifts small-payload pps 8–22× on Spark (4 KB: 4→39 msg/s, 8 KB: 4→88, 64 KB: 32→255) without affecting the 8 MB / 1 MB cells already saturated at depth 5. The new comment documents why the constant cannot exceed num_bufs (post_req / free_tx_burst ordering in the same loop iteration would deadlock instead of throttling). A follow-up tracks the deeper architectural fix (interleave drain with post, bulk tx_ring dequeue, configurable depth). - .gitignore: pcie_schematic.png (generated by tune_system.py). Includes fixes for two parsing bugs Greptile flagged on the original draft of run_spark_bench.sh: - /proc/net/udp drops column is decimal (%lu in net/ipv4/udp.c), not hex. Drop the strtonum("0x" ...) treatment that was silently multiplying drop counts whenever the column value contained any digit > 9. - Socket bench emits sent_packets / sent_bytes (not RDMA's send_completions / send_bytes), so the RDMA-keyed fallback was always returning empty for socket backends and producing zero-filled CSV rows. Dispatch the fallback on $BACKEND. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rgurunathan <rgurunathan@nvidia.com> * Updates to RDMA bench profiling (#103) * #15 - Add RDMA benchmark profiling knobs Instrument the RDMA benchmark and manager hot paths so SEND/RECV posting, CQ polling, ring traffic, and app refill behavior can be measured while debugging the perftest gap. Add RDMA tuning hooks for relaxed ordering, SEND signaling cadence, queue depth, SPSC rings, and RDMA buffer alignment, plus a server connection readiness guard. Signed-off-by: Cliff Burdick <cburdick@nvidia.com> * #15 - Remove RDMA benchmark debug instrumentation Drop benchmark-side timing/profile output and remove RDMA manager profiling timers while keeping the non-debug RDMA transport tuning changes. Signed-off-by: Cliff Burdick <cburdick@nvidia.com> * #15 - Avoid RDMA benchmark buffer starvation deadlock Treat TX buffer exhaustion as backpressure in the RDMA benchmark so the worker returns to completion draining before retrying. Also clean up RDMA TX packet burst allocations on partial failure. Signed-off-by: Cliff Burdick <cburdick@nvidia.com> * #15 - Address Greptile RDMA review comments Fix RDMA completion enqueue failure cleanup to avoid returning metadata twice, and document the single-producer/single-consumer invariant for RDMA connection rings. Signed-off-by: Cliff Burdick <cburdick@nvidia.com> * #15 - Retire RDMA send completions on CQ errors Sweep outstanding SEND work requests through an errored CQE so selective signaling cannot orphan unsignaled bursts on TX errors. Signed-off-by: Cliff Burdick <cburdick@nvidia.com> --------- Signed-off-by: Cliff Burdick <cburdick@nvidia.com> --------- Signed-off-by: rgurunathan <rgurunathan@nvidia.com> Signed-off-by: Cliff Burdick <cburdick@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Cliff Burdick <30670611+cliffburdick@users.noreply.github.com>

Replay the RDMA benchmark profiling and tuning changes from PR #103 on top of main after the stacked branch was merged out of order. Keep the newer main fixes while resolving conflicts: use create_tx_burst_params and null guards in the benchmark, keep IBV_ACCESS_RELAXED_ORDERING guarded for older verbs headers, preserve RDMA metrics accounting, and avoid duplicate TX burst cleanup after send_tx_burst failure. Original PR #103 merge commit: 94a7d2e. Signed-off-by: Cliff Burdick <cburdick@nvidia.com> Signed-off-by: Denis Leshchev <dleshchev@nvidia.com>

dleshchev · 2026-06-03T17:51:06Z

Clarifying stack repair state as of 2026-06-03: this PR shows MERGED because it was merged into the stacked branch 15-bench-spark-tooling, not because merge commit 94a7d2efe294c8e8ccfa91ee5afd1d59841f0719 was the final mainline landing path. The main-bound RDMA profiling content has been rebuilt on top of current main in follow-up PR #118.

Cleanup note: once PR #118 lands, the stale stack branches under 15-bench-* / cburdick/* can be pruned. Keep backup/15-bench-infra-stacked-20260603 until #118 is safely landed.

Replay the RDMA benchmark profiling and tuning changes from PR #103 on top of main after the stacked branch was merged out of order. Keep the newer main fixes while resolving conflicts: use create_tx_burst_params and null guards in the benchmark, keep IBV_ACCESS_RELAXED_ORDERING guarded for older verbs headers, preserve RDMA metrics accounting, and avoid duplicate TX burst cleanup after send_tx_burst failure. Original PR #103 merge commit: 94a7d2e. Signed-off-by: Cliff Burdick <cburdick@nvidia.com> Signed-off-by: Denis Leshchev <dleshchev@nvidia.com> Co-authored-by: Cliff Burdick <30670611+cliffburdick@users.noreply.github.com>

cliffburdick added 3 commits May 27, 2026 21:10

#15 - Remove RDMA benchmark debug instrumentation

7766d27

Drop benchmark-side timing/profile output and remove RDMA manager profiling timers while keeping the non-debug RDMA transport tuning changes. Signed-off-by: Cliff Burdick <cburdick@nvidia.com>

greptile-apps Bot reviewed May 27, 2026

View reviewed changes

Comment thread src/managers/rdma/daqiri_rdma_mgr.cpp Outdated

Comment thread src/managers/rdma/daqiri_rdma_mgr.cpp

#15 - Address Greptile RDMA review comments

97c5a7c

Fix RDMA completion enqueue failure cleanup to avoid returning metadata twice, and document the single-producer/single-consumer invariant for RDMA connection rings. Signed-off-by: Cliff Burdick <cburdick@nvidia.com>

#15 - Retire RDMA send completions on CQ errors

6abd9b6

Sweep outstanding SEND work requests through an errored CQE so selective signaling cannot orphan unsignaled bursts on TX errors. Signed-off-by: Cliff Burdick <cburdick@nvidia.com>

RamyaGuru changed the base branch from main to 15-bench-spark-tooling June 3, 2026 14:57

dleshchev merged commit 94a7d2e into 15-bench-spark-tooling Jun 3, 2026
3 checks passed

dleshchev mentioned this pull request Jun 3, 2026

#15 - Reapply RDMA benchmark profiling updates #118

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates to RDMA bench profiling#103

Updates to RDMA bench profiling#103
dleshchev merged 5 commits into
15-bench-spark-toolingfrom
cburdick/rdma-bench-profiling-fixes

cliffburdick commented May 27, 2026

Uh oh!

greptile-apps Bot commented May 27, 2026 •

edited

Loading

Greptile Summary

Uh oh!

Uh oh!

Uh oh!

cliffburdick commented May 28, 2026

Uh oh!

Uh oh!

dleshchev commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cliffburdick commented May 27, 2026

Uh oh!

greptile-apps Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

cliffburdick commented May 28, 2026

Uh oh!

Uh oh!

dleshchev commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented May 27, 2026 •

edited

Loading