Skip to content

#15 - Reapply RDMA benchmark profiling updates#118

Merged
dleshchev merged 1 commit into
mainfrom
review/pr-103-rdma-profiling
Jun 3, 2026
Merged

#15 - Reapply RDMA benchmark profiling updates#118
dleshchev merged 1 commit into
mainfrom
review/pr-103-rdma-profiling

Conversation

@dleshchev
Copy link
Copy Markdown
Collaborator

@dleshchev dleshchev commented Jun 3, 2026

Rebuilds the already-merged PR #103 changes as a clean follow-up on main after PR #96 and PR #116 landed.

Summary:

  • Replays the RDMA benchmark backpressure and completion-drain behavior.
  • Carries RDMA manager tuning: larger CQ/WR pools, SPSC rings, relaxed-ordering fallback, SEND signaling cadence, and server connection readiness checks.
  • Documents the DAQIRI_RDMA_SEND_SIGNAL_EVERY runtime knob in the benchmark tutorial.
  • Preserves newer main fixes from the conflict resolution: create_tx_burst_params/null guards, guarded IBV_ACCESS_RELAXED_ORDERING, metrics accounting, and send_tx_burst ownership cleanup.

Verification:

  • git diff --check origin/main..HEAD
  • python3 scripts/check_doc_refs.py
  • env IMAGE_TAG=daqiri:pr103-rdma-bc7fbd4 BASE_TARGET=dpdk DAQIRI_MGR="dpdk socket rdma" DAQIRI_BUILD_PYTHON=OFF BUILD_SHARED_LIBS=ON DAQIRI_ENABLE_OTEL_METRICS=OFF scripts/build-container.sh

Note: the container build was run before the docs-only amend; the compiled C++ content is unchanged.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Jun 3, 2026

Greptile Summary

This PR replays the RDMA benchmark and manager tuning from the previously-merged PR #103 on top of the current main after the stacked branch conflict. It carries the backpressure rework in rdma_bench.cpp (kMaxOutstanding 20→64, completion-drain loop, post_req returning bool) alongside the manager-side changes (SPSC rings, larger CQ/WR/pool sizes, relaxed-ordering fallback, SEND-signaling cadence env var, and server-connection readiness guard).

  • examples/rdma_bench.cpp: replaces the busy-retry allocation loop with a single non-blocking call; drain loop now exhausts all pending completions before each post cycle; the outer while(outstanding < kMax && post_req(...)) pattern cleanly interleaves backpressure with work posting.
  • src/managers/rdma/daqiri_rdma_mgr.cpp: the new complete_send_wrs lambda uses std::map::upper_bound to retire a cumulative range of WR IDs on each signaled completion — required for correctness when send_signal_every > 1; ring flags downgraded to SPSC, which is safe given the stated single-producer/single-consumer constraint documented in the new comment.
  • src/managers/rdma/daqiri_rdma_mgr.h: MAX_CQ and MAX_OUSTANDING_WR raised to 1024, consistent with the larger pool allocations.

Confidence Score: 5/5

Safe to merge; the core logic changes are sound and DCO sign-off is present on the single commit.

The cumulative-completion lambda, SPSC ring downgrade, and backpressure rework are all internally consistent. The relaxed-ordering fallback is correctly guarded with #ifdef. The server-readiness check in rdma_get_server_conn_id closes a real startup race. One non-blocking observation remains: the size() >= 15 safety valve silently limits the tunability of DAQIRI_RDMA_SEND_SIGNAL_EVERY for values greater than ~16, but this does not affect correctness.

No files require special attention; the suggestion on the signaling threshold in daqiri_rdma_mgr.cpp is advisory.

Important Files Changed

Filename Overview
examples/rdma_bench.cpp Backpressure rework: kMaxOutstanding raised 20→64, post_req now returns bool, completions drained in a tight loop before posting work; one pre-existing BurstParams leak on send_tx_burst failure (already flagged in a prior thread).
src/managers/rdma/daqiri_rdma_mgr.cpp Adds SEND-signaling cadence (DAQIRI_RDMA_SEND_SIGNAL_EVERY), cumulative-completion lambda (complete_send_wrs via upper_bound), SPSC ring flags, larger pools, relaxed-ordering fallback, and a server-readiness guard in rdma_get_server_conn_id; safety-valve threshold (size()>=15) silently caps effective send_signal_every at ~16 for any larger env-var value.
src/managers/rdma/daqiri_rdma_mgr.h MAX_CQ and MAX_OUSTANDING_WR bumped to 1024 to match the larger pool/CQ sizes in the .cpp; straightforward and consistent.
docs/tutorials/benchmarking_examples.md New section documents DAQIRI_RDMA_SEND_SIGNAL_EVERY; prose is accurate for values ≤ ~16 but does not mention the internal safety-valve cap that limits effective cadence for larger values.

Sequence Diagram

sequenceDiagram
    participant App as rdma_bench (App thread)
    participant API as daqiri API
    participant Ring as SPSC tx_ring / rx_ring
    participant Thread as rdma_thread

    loop each worker iteration
        App->>API: get_rx_burst() — drain completions
        API->>Ring: rte_ring_dequeue(rx_ring)
        Ring-->>API: "completion BurstParams*"
        API-->>App: completion (or NO_PACKETS)
        App->>App: outstanding_send-- / outstanding_recv--
        App->>API: free_tx_burst(completion)

        App->>API: create_tx_burst_params()
        API-->>App: "msg*"
        App->>API: get_tx_packet_burst(msg) — backpressure on pool
        alt pool exhausted
            API-->>App: NO_FREE_BURST_BUFFERS → return false
            App->>App: sleep 100µs, retry next iteration
        else success
            App->>API: send_tx_burst(msg)
            API->>Ring: rte_ring_enqueue(tx_ring, msg)
        end
    end

    loop rdma_thread poll loop
        Thread->>Thread: ibv_poll_cq(rx_cq) — RECV completions
        Thread->>Thread: ibv_poll_cq(tx_cq) — SEND completions
        Note over Thread: complete_send_wrs(wr_id) upper_bound → retire cumulative range
        Thread->>Ring: rte_ring_enqueue(rx_ring, completion)
        Thread->>Ring: "rte_ring_dequeue(tx_ring, &burst)"
        Thread->>Thread: "ibv_post_send / ibv_post_recv (signal every N or size>=15)"
    end
Loading

Reviews (2): Last reviewed commit: "#15 - Reapply RDMA benchmark profiling u..." | Re-trigger Greptile

Comment thread examples/rdma_bench.cpp
return false;
}
if (daqiri::send_tx_burst(msg) != daqiri::Status::SUCCESS) { return; }
if (daqiri::send_tx_burst(msg) != daqiri::Status::SUCCESS) { return false; }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 After get_tx_packet_burst succeeds, msg owns allocated packet buffers. If send_tx_burst fails, returning false without freeing msg silently drains the TX buffer pool — subsequent calls to get_tx_packet_burst will eventually return NO_FREE_BURST_BUFFERS, making the benchmark appear to stall without a clear error.

Suggested change
if (daqiri::send_tx_burst(msg) != daqiri::Status::SUCCESS) { return false; }
if (daqiri::send_tx_burst(msg) != daqiri::Status::SUCCESS) {
daqiri::free_tx_burst(msg);
return false;
}

Rule Used: BurstParams is a zero-copy batch — the caller MUST... (source)

Comment on lines 1420 to 1450
@@ -1356,7 +1438,7 @@ int RdmaMgr::setup_pools_and_rings() {
ring_name = "TX_RING_" + std::to_string(i);
DAQIRI_LOG_DEBUG("Setting up TX ring {}", ring_name);
ring = rte_ring_create(
ring_name.c_str(), 2048, rte_socket_id(), RING_F_MC_RTS_DEQ | RING_F_MP_RTS_ENQ);
ring_name.c_str(), 2048, rte_socket_id(), RING_F_SP_ENQ | RING_F_SC_DEQ);
if (ring == nullptr) {
DAQIRI_LOG_CRITICAL("Failed to allocate TX ring {}!", ring_name);
return -1;
@@ -1367,7 +1449,7 @@ int RdmaMgr::setup_pools_and_rings() {
// Packet length buffers
DAQIRI_LOG_DEBUG("Setting up RX meta pool");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Doc sync required for src/managers/rdma/ and examples/rdma_bench.cpp changes

This PR modifies src/managers/rdma/daqiri_rdma_mgr.cpp, daqiri_rdma_mgr.h, and examples/rdma_bench.cpp but does not update the matching docs. Per the doc-sync mapping, the following should be updated in the same PR:

  • docs/getting-started.md, docs/configuration.md, docs/tutorials/configuration-walkthrough.md, README.md (Backends section), AGENTS.md — for the RDMA manager changes (pool sizes, SPSC rings, CQ tuning, relaxed-ordering fallback, send-signaling cadence, server-readiness guard).
  • docs/tutorials/benchmarking_examples.md, docs/tutorials/configuration-walkthrough.md, AGENTS.md (benchmark table) — for the rdma_bench.cpp backpressure and completion-drain rework, and the new kMaxOutstanding=64 credit limit.

Rule Used: DAQIRI has no automated doc-sync gate beyond mkdoc... (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Replay the RDMA benchmark profiling and tuning changes from PR #103 on top of main after the stacked branch was merged out of order.

Keep the newer main fixes while resolving conflicts: use create_tx_burst_params and null guards in the benchmark, keep IBV_ACCESS_RELAXED_ORDERING guarded for older verbs headers, preserve RDMA metrics accounting, and avoid duplicate TX burst cleanup after send_tx_burst failure.

Original PR #103 merge commit: 94a7d2e.

Signed-off-by: Cliff Burdick <cburdick@nvidia.com>
Signed-off-by: Denis Leshchev <dleshchev@nvidia.com>
@dleshchev dleshchev force-pushed the review/pr-103-rdma-profiling branch from edabbde to f7fe355 Compare June 3, 2026 17:50
@dleshchev dleshchev merged commit aa501bc into main Jun 3, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants