#15 - Reapply RDMA benchmark profiling updates by dleshchev · Pull Request #118 · NVIDIA/daqiri

dleshchev · 2026-06-03T17:42:46Z

Rebuilds the already-merged PR #103 changes as a clean follow-up on main after PR #96 and PR #116 landed.

Summary:

Replays the RDMA benchmark backpressure and completion-drain behavior.
Carries RDMA manager tuning: larger CQ/WR pools, SPSC rings, relaxed-ordering fallback, SEND signaling cadence, and server connection readiness checks.
Documents the DAQIRI_RDMA_SEND_SIGNAL_EVERY runtime knob in the benchmark tutorial.
Preserves newer main fixes from the conflict resolution: create_tx_burst_params/null guards, guarded IBV_ACCESS_RELAXED_ORDERING, metrics accounting, and send_tx_burst ownership cleanup.

Verification:

git diff --check origin/main..HEAD
python3 scripts/check_doc_refs.py
env IMAGE_TAG=daqiri:pr103-rdma-bc7fbd4 BASE_TARGET=dpdk DAQIRI_MGR="dpdk socket rdma" DAQIRI_BUILD_PYTHON=OFF BUILD_SHARED_LIBS=ON DAQIRI_ENABLE_OTEL_METRICS=OFF scripts/build-container.sh

Note: the container build was run before the docs-only amend; the compiled C++ content is unchanged.

greptile-apps · 2026-06-03T17:49:05Z

Greptile Summary

This PR replays the RDMA benchmark and manager tuning from the previously-merged PR #103 on top of the current main after the stacked branch conflict. It carries the backpressure rework in rdma_bench.cpp (kMaxOutstanding 20→64, completion-drain loop, post_req returning bool) alongside the manager-side changes (SPSC rings, larger CQ/WR/pool sizes, relaxed-ordering fallback, SEND-signaling cadence env var, and server-connection readiness guard).

examples/rdma_bench.cpp: replaces the busy-retry allocation loop with a single non-blocking call; drain loop now exhausts all pending completions before each post cycle; the outer while(outstanding < kMax && post_req(...)) pattern cleanly interleaves backpressure with work posting.
src/managers/rdma/daqiri_rdma_mgr.cpp: the new complete_send_wrs lambda uses std::map::upper_bound to retire a cumulative range of WR IDs on each signaled completion — required for correctness when send_signal_every > 1; ring flags downgraded to SPSC, which is safe given the stated single-producer/single-consumer constraint documented in the new comment.
src/managers/rdma/daqiri_rdma_mgr.h: MAX_CQ and MAX_OUSTANDING_WR raised to 1024, consistent with the larger pool allocations.

Confidence Score: 5/5

Safe to merge; the core logic changes are sound and DCO sign-off is present on the single commit.

The cumulative-completion lambda, SPSC ring downgrade, and backpressure rework are all internally consistent. The relaxed-ordering fallback is correctly guarded with #ifdef. The server-readiness check in rdma_get_server_conn_id closes a real startup race. One non-blocking observation remains: the size() >= 15 safety valve silently limits the tunability of DAQIRI_RDMA_SEND_SIGNAL_EVERY for values greater than ~16, but this does not affect correctness.

No files require special attention; the suggestion on the signaling threshold in daqiri_rdma_mgr.cpp is advisory.

Important Files Changed

Filename	Overview
examples/rdma_bench.cpp	Backpressure rework: kMaxOutstanding raised 20→64, post_req now returns bool, completions drained in a tight loop before posting work; one pre-existing BurstParams leak on send_tx_burst failure (already flagged in a prior thread).
src/managers/rdma/daqiri_rdma_mgr.cpp	Adds SEND-signaling cadence (DAQIRI_RDMA_SEND_SIGNAL_EVERY), cumulative-completion lambda (complete_send_wrs via upper_bound), SPSC ring flags, larger pools, relaxed-ordering fallback, and a server-readiness guard in rdma_get_server_conn_id; safety-valve threshold (size()>=15) silently caps effective send_signal_every at ~16 for any larger env-var value.
src/managers/rdma/daqiri_rdma_mgr.h	MAX_CQ and MAX_OUSTANDING_WR bumped to 1024 to match the larger pool/CQ sizes in the .cpp; straightforward and consistent.
docs/tutorials/benchmarking_examples.md	New section documents DAQIRI_RDMA_SEND_SIGNAL_EVERY; prose is accurate for values ≤ ~16 but does not mention the internal safety-valve cap that limits effective cadence for larger values.

Sequence Diagram

sequenceDiagram
    participant App as rdma_bench (App thread)
    participant API as daqiri API
    participant Ring as SPSC tx_ring / rx_ring
    participant Thread as rdma_thread

    loop each worker iteration
        App->>API: get_rx_burst() — drain completions
        API->>Ring: rte_ring_dequeue(rx_ring)
        Ring-->>API: "completion BurstParams*"
        API-->>App: completion (or NO_PACKETS)
        App->>App: outstanding_send-- / outstanding_recv--
        App->>API: free_tx_burst(completion)

        App->>API: create_tx_burst_params()
        API-->>App: "msg*"
        App->>API: get_tx_packet_burst(msg) — backpressure on pool
        alt pool exhausted
            API-->>App: NO_FREE_BURST_BUFFERS → return false
            App->>App: sleep 100µs, retry next iteration
        else success
            App->>API: send_tx_burst(msg)
            API->>Ring: rte_ring_enqueue(tx_ring, msg)
        end
    end

    loop rdma_thread poll loop
        Thread->>Thread: ibv_poll_cq(rx_cq) — RECV completions
        Thread->>Thread: ibv_poll_cq(tx_cq) — SEND completions
        Note over Thread: complete_send_wrs(wr_id) upper_bound → retire cumulative range
        Thread->>Ring: rte_ring_enqueue(rx_ring, completion)
        Thread->>Ring: "rte_ring_dequeue(tx_ring, &burst)"
        Thread->>Thread: "ibv_post_send / ibv_post_recv (signal every N or size>=15)"
    end

_{Reviews (2): Last reviewed commit: "#15 - Reapply RDMA benchmark profiling u..." | Re-trigger Greptile}

greptile-apps · 2026-06-03T17:49:10Z

+        return false;
      }
-      if (daqiri::send_tx_burst(msg) != daqiri::Status::SUCCESS) { return; }
+      if (daqiri::send_tx_burst(msg) != daqiri::Status::SUCCESS) { return false; }


After get_tx_packet_burst succeeds, msg owns allocated packet buffers. If send_tx_burst fails, returning false without freeing msg silently drains the TX buffer pool — subsequent calls to get_tx_packet_burst will eventually return NO_FREE_BURST_BUFFERS, making the benchmark appear to stall without a clear error.

Suggested change

if (daqiri::send_tx_burst(msg) != daqiri::Status::SUCCESS) { return false; }

if (daqiri::send_tx_burst(msg) != daqiri::Status::SUCCESS) {

daqiri::free_tx_burst(msg);

return false;

}

Rule Used: BurstParams is a zero-copy batch — the caller MUST... (source)

greptile-apps · 2026-06-03T17:49:11Z

@@ -1356,7 +1438,7 @@ int RdmaMgr::setup_pools_and_rings() {
    ring_name = "TX_RING_" + std::to_string(i);
    DAQIRI_LOG_DEBUG("Setting up TX ring {}", ring_name);
    ring = rte_ring_create(
-        ring_name.c_str(), 2048, rte_socket_id(), RING_F_MC_RTS_DEQ | RING_F_MP_RTS_ENQ);
+        ring_name.c_str(), 2048, rte_socket_id(), RING_F_SP_ENQ | RING_F_SC_DEQ);
    if (ring == nullptr) {
      DAQIRI_LOG_CRITICAL("Failed to allocate TX ring {}!", ring_name);
      return -1;
@@ -1367,7 +1449,7 @@ int RdmaMgr::setup_pools_and_rings() {
  // Packet length buffers
  DAQIRI_LOG_DEBUG("Setting up RX meta pool");


Doc sync required for src/managers/rdma/ and examples/rdma_bench.cpp changes

This PR modifies src/managers/rdma/daqiri_rdma_mgr.cpp, daqiri_rdma_mgr.h, and examples/rdma_bench.cpp but does not update the matching docs. Per the doc-sync mapping, the following should be updated in the same PR:

docs/getting-started.md, docs/configuration.md, docs/tutorials/configuration-walkthrough.md, README.md (Backends section), AGENTS.md — for the RDMA manager changes (pool sizes, SPSC rings, CQ tuning, relaxed-ordering fallback, send-signaling cadence, server-readiness guard).

docs/tutorials/benchmarking_examples.md, docs/tutorials/configuration-walkthrough.md, AGENTS.md (benchmark table) — for the rdma_bench.cpp backpressure and completion-drain rework, and the new kMaxOutstanding=64 credit limit.

Rule Used: DAQIRI has no automated doc-sync gate beyond mkdoc... (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Replay the RDMA benchmark profiling and tuning changes from PR #103 on top of main after the stacked branch was merged out of order. Keep the newer main fixes while resolving conflicts: use create_tx_burst_params and null guards in the benchmark, keep IBV_ACCESS_RELAXED_ORDERING guarded for older verbs headers, preserve RDMA metrics accounting, and avoid duplicate TX burst cleanup after send_tx_burst failure. Original PR #103 merge commit: 94a7d2e. Signed-off-by: Cliff Burdick <cburdick@nvidia.com> Signed-off-by: Denis Leshchev <dleshchev@nvidia.com>

greptile-apps Bot reviewed Jun 3, 2026

View reviewed changes

dleshchev force-pushed the review/pr-103-rdma-profiling branch from edabbde to f7fe355 Compare June 3, 2026 17:50

This was referenced Jun 3, 2026

Updates to RDMA bench profiling #103

Merged

#15 - Add DGX Spark sweep tooling and RDMA loopback prereqs #97

Merged

dleshchev merged commit aa501bc into main Jun 3, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#15 - Reapply RDMA benchmark profiling updates#118

#15 - Reapply RDMA benchmark profiling updates#118
dleshchev merged 1 commit into
mainfrom
review/pr-103-rdma-profiling

dleshchev commented Jun 3, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 3, 2026 •

edited

Loading

Greptile Summary

Uh oh!

greptile-apps Bot Jun 3, 2026

Uh oh!

greptile-apps Bot Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dleshchev commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dleshchev commented Jun 3, 2026 •

edited

Loading

greptile-apps Bot commented Jun 3, 2026 •

edited

Loading