#15 - Reapply RDMA benchmark profiling updates#118
Conversation
|
| Filename | Overview |
|---|---|
| examples/rdma_bench.cpp | Backpressure rework: kMaxOutstanding raised 20→64, post_req now returns bool, completions drained in a tight loop before posting work; one pre-existing BurstParams leak on send_tx_burst failure (already flagged in a prior thread). |
| src/managers/rdma/daqiri_rdma_mgr.cpp | Adds SEND-signaling cadence (DAQIRI_RDMA_SEND_SIGNAL_EVERY), cumulative-completion lambda (complete_send_wrs via upper_bound), SPSC ring flags, larger pools, relaxed-ordering fallback, and a server-readiness guard in rdma_get_server_conn_id; safety-valve threshold (size()>=15) silently caps effective send_signal_every at ~16 for any larger env-var value. |
| src/managers/rdma/daqiri_rdma_mgr.h | MAX_CQ and MAX_OUSTANDING_WR bumped to 1024 to match the larger pool/CQ sizes in the .cpp; straightforward and consistent. |
| docs/tutorials/benchmarking_examples.md | New section documents DAQIRI_RDMA_SEND_SIGNAL_EVERY; prose is accurate for values ≤ ~16 but does not mention the internal safety-valve cap that limits effective cadence for larger values. |
Sequence Diagram
sequenceDiagram
participant App as rdma_bench (App thread)
participant API as daqiri API
participant Ring as SPSC tx_ring / rx_ring
participant Thread as rdma_thread
loop each worker iteration
App->>API: get_rx_burst() — drain completions
API->>Ring: rte_ring_dequeue(rx_ring)
Ring-->>API: "completion BurstParams*"
API-->>App: completion (or NO_PACKETS)
App->>App: outstanding_send-- / outstanding_recv--
App->>API: free_tx_burst(completion)
App->>API: create_tx_burst_params()
API-->>App: "msg*"
App->>API: get_tx_packet_burst(msg) — backpressure on pool
alt pool exhausted
API-->>App: NO_FREE_BURST_BUFFERS → return false
App->>App: sleep 100µs, retry next iteration
else success
App->>API: send_tx_burst(msg)
API->>Ring: rte_ring_enqueue(tx_ring, msg)
end
end
loop rdma_thread poll loop
Thread->>Thread: ibv_poll_cq(rx_cq) — RECV completions
Thread->>Thread: ibv_poll_cq(tx_cq) — SEND completions
Note over Thread: complete_send_wrs(wr_id) upper_bound → retire cumulative range
Thread->>Ring: rte_ring_enqueue(rx_ring, completion)
Thread->>Ring: "rte_ring_dequeue(tx_ring, &burst)"
Thread->>Thread: "ibv_post_send / ibv_post_recv (signal every N or size>=15)"
end
Reviews (2): Last reviewed commit: "#15 - Reapply RDMA benchmark profiling u..." | Re-trigger Greptile
| return false; | ||
| } | ||
| if (daqiri::send_tx_burst(msg) != daqiri::Status::SUCCESS) { return; } | ||
| if (daqiri::send_tx_burst(msg) != daqiri::Status::SUCCESS) { return false; } |
There was a problem hiding this comment.
After
get_tx_packet_burst succeeds, msg owns allocated packet buffers. If send_tx_burst fails, returning false without freeing msg silently drains the TX buffer pool — subsequent calls to get_tx_packet_burst will eventually return NO_FREE_BURST_BUFFERS, making the benchmark appear to stall without a clear error.
| if (daqiri::send_tx_burst(msg) != daqiri::Status::SUCCESS) { return false; } | |
| if (daqiri::send_tx_burst(msg) != daqiri::Status::SUCCESS) { | |
| daqiri::free_tx_burst(msg); | |
| return false; | |
| } |
Rule Used: BurstParams is a zero-copy batch — the caller MUST... (source)
| @@ -1356,7 +1438,7 @@ int RdmaMgr::setup_pools_and_rings() { | |||
| ring_name = "TX_RING_" + std::to_string(i); | |||
| DAQIRI_LOG_DEBUG("Setting up TX ring {}", ring_name); | |||
| ring = rte_ring_create( | |||
| ring_name.c_str(), 2048, rte_socket_id(), RING_F_MC_RTS_DEQ | RING_F_MP_RTS_ENQ); | |||
| ring_name.c_str(), 2048, rte_socket_id(), RING_F_SP_ENQ | RING_F_SC_DEQ); | |||
| if (ring == nullptr) { | |||
| DAQIRI_LOG_CRITICAL("Failed to allocate TX ring {}!", ring_name); | |||
| return -1; | |||
| @@ -1367,7 +1449,7 @@ int RdmaMgr::setup_pools_and_rings() { | |||
| // Packet length buffers | |||
| DAQIRI_LOG_DEBUG("Setting up RX meta pool"); | |||
There was a problem hiding this comment.
Doc sync required for
src/managers/rdma/ and examples/rdma_bench.cpp changes
This PR modifies src/managers/rdma/daqiri_rdma_mgr.cpp, daqiri_rdma_mgr.h, and examples/rdma_bench.cpp but does not update the matching docs. Per the doc-sync mapping, the following should be updated in the same PR:
docs/getting-started.md,docs/configuration.md,docs/tutorials/configuration-walkthrough.md,README.md(Backends section),AGENTS.md— for the RDMA manager changes (pool sizes, SPSC rings, CQ tuning, relaxed-ordering fallback, send-signaling cadence, server-readiness guard).docs/tutorials/benchmarking_examples.md,docs/tutorials/configuration-walkthrough.md,AGENTS.md(benchmark table) — for therdma_bench.cppbackpressure and completion-drain rework, and the newkMaxOutstanding=64credit limit.
Rule Used: DAQIRI has no automated doc-sync gate beyond mkdoc... (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Replay the RDMA benchmark profiling and tuning changes from PR #103 on top of main after the stacked branch was merged out of order. Keep the newer main fixes while resolving conflicts: use create_tx_burst_params and null guards in the benchmark, keep IBV_ACCESS_RELAXED_ORDERING guarded for older verbs headers, preserve RDMA metrics accounting, and avoid duplicate TX burst cleanup after send_tx_burst failure. Original PR #103 merge commit: 94a7d2e. Signed-off-by: Cliff Burdick <cburdick@nvidia.com> Signed-off-by: Denis Leshchev <dleshchev@nvidia.com>
edabbde to
f7fe355
Compare
Rebuilds the already-merged PR #103 changes as a clean follow-up on main after PR #96 and PR #116 landed.
Summary:
DAQIRI_RDMA_SEND_SIGNAL_EVERYruntime knob in the benchmark tutorial.Verification:
Note: the container build was run before the docs-only amend; the compiled C++ content is unchanged.