Skip to content

#113 - Fix RDMA cross-host SEND timeout and shutdown deadlock#115

Open
chloecrozier wants to merge 2 commits into
mainfrom
feature/spark-xhost-configs
Open

#113 - Fix RDMA cross-host SEND timeout and shutdown deadlock#115
chloecrozier wants to merge 2 commits into
mainfrom
feature/spark-xhost-configs

Conversation

@chloecrozier
Copy link
Copy Markdown
Member

@chloecrozier chloecrozier commented Jun 2, 2026

Summary

Closes #113 (RDMA cross-host SEND fails with IBV_WC_RETRY_EXC_ERR) plus a graceful-shutdown deadlock the #113 fix exposes, and adds cross-host DGX Spark example YAMLs as the regression test.

Fixes (src/managers/rdma/daqiri_rdma_mgr.{h,cpp})

  1. Wrong interface index on the server ([BUG] RDMA cross-host SEND immediately fails with transport_retry_exceeded and the server never posts receives #113). When a client connects, the server picks which configured interface to use for that connection. It was using the wrong number: the hardware's port number from the InfiniBand driver, instead of the index into our own configured interface list. The two happen to look similar but mean different things, so the server thread immediately crashed trying to look up an interface that wasn't there. Because the server crashed before posting any receives, the client's first send had nothing to land in and timed out with RETRY_EXC_ERR. Fix: use the interface index we already saved when the listener was set up.

  2. Server hangs on graceful shutdown (only reachable once the first issue was fixed). When the client disconnects, the server tries to clean up its worker thread. It was doing this badly in three ways at once: it held a lock while waiting for the worker, the worker had no way to know it should stop (it only watched the global "everything is shutting down" flag), and the stats-printing code also wanted that same lock. Once issue 1 stopped the server from crashing early, every clean run ended up stuck in this three-way deadlock and had to be SIGKILLed. Fix: give each worker its own "please exit" flag, set it when the client disconnects, and release the lock before waiting for the worker to actually stop.

The Couldn't find server params for address … line that may show up once at startup is a benign timing race (the app polls slightly before the listener is registered), not the bug.

Regression test

Three new examples/ YAMLs exercising the cross-host data plane that no existing config reached:

  • daqiri_bench_raw_{tx,rx}_spark_xhost.yaml: split per-host (raw bench has no --mode).
  • daqiri_bench_rdma_tx_rx_spark_xhost.yaml: single config selected via --mode {client,server}, matching the existing one-file-per-workload pattern.

Reuses the 1.1.1.1 / 2.2.2.2 IPs and CPU pinning of the existing _spark.yaml configs, with each address bound to its host's p0 instead of both on one box.

Tests ran to verify

  • python scripts/check_doc_refs.py: 23 YAMLs, all covered.
  • mkdocs build --strict: clean.
  • python scripts/check_html_links.py site/: clean.

Hardware (two cross-cabled DGX Sparks, GB10, ConnectX-7 fw 28.45.4028, IGX OS / Ubuntu 24.04 ARM, kernel 6.14, CUDA 13.0, driver 580.95.05):

  • RDMA xhost: Server received: 21761, Client received: 15256 (was 0 / 0 pre-fix); zero transport retry counter exceeded; server reaches Removed leftover hugepage file … and exits ~3 s after its --seconds timer with no SIGKILL. Logs: logs/spark_rdma_{rx,tx}_post_fix3.log (vs. pre-fix logs/spark_rdma_{rx,tx}.log).
  • Raw GPUDirect xhost: lossless over 15 s. TX 24,657,920 / RX 24,657,920 / dropped 0 (~106 Gbps). Logs: logs/spark_raw_{rx,tx}_post_fix.log.
  • Single-host regression (daqiri_bench_rdma_tx_rx_spark.yaml): same code paths, unchanged behavior.

Signed-off-by: Chloe Crozier <chloecrozier@gmail.com>
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Jun 2, 2026

Greptile Summary

This PR fixes two bugs in the RDMA cross-host data path and adds cross-host DGX Spark example configs as a regression test. All three issues flagged in the previous review round (atomic flag, ring-return ordering, AGENTS.md doc-sync) have been addressed.

  • Interface index bug ([BUG] RDMA cross-host SEND immediately fails with transport_retry_exceeded and the server never posts receives #113): The server-side CONNECT_REQUEST handler was assigning cm_event->id->port_num (IB hardware port, 1-based) to params.if_idx instead of the listener's saved config-array index. The fix correctly uses listen_iter->second.if_idx, the value stored when the listener was set up.
  • Shutdown deadlock: The DISCONNECTED handler now signals the worker via ready_to_exit, moves the thread handle out of worker_threads_ under threads_mutex_, releases the lock, then joins — eliminating the three-way deadlock between the CM event loop, threads_mutex_, and print_stats(). Ring pool return is deferred until after the join, preserving safe ordering.
  • Three new examples/ YAMLs cover cross-host raw GPUDirect and RDMA bench paths, with matching updates to docs/tutorials/benchmarking_examples.md, docs/tutorials/configuration-walkthrough.md, and AGENTS.md.

Confidence Score: 5/5

Safe to merge — the two targeted RDMA fixes are correct and all three issues from the previous review round are resolved.

The interface-index substitution is a one-line swap of the correct pre-saved listener value for the hardware port number. The deadlock fix correctly sequences signal → lock → move → unlock → join → ring return, matching the intended ordering. The atomic flag, try_emplace, and ring-return-after-join patterns are all sound. New YAML examples follow the established placeholder convention, and the required doc-sync targets (benchmarking_examples.md, configuration-walkthrough.md, AGENTS.md) are all updated.

No files require special attention.

Important Files Changed

Filename Overview
src/managers/rdma/daqiri_rdma_mgr.cpp Two targeted bug fixes: replaces cm_event->id->port_num with listen_iter->second.if_idx in CONNECT_REQUEST handler, and restructures DISCONNECTED handler to signal worker, release mutex, join, then return rings — eliminating the deadlock.
src/managers/rdma/daqiri_rdma_mgr.h Changes ready_to_exit from bool to std::atomic{false}, making rdma_thread_params non-copyable; run() switches to try_emplace for in-place construction to accommodate the non-movable atomic.
examples/daqiri_bench_rdma_tx_rx_spark_xhost.yaml New cross-host RDMA bench config; both interfaces and both bench roles defined in one file, selected by --mode, matching the existing single-host spark pattern. Memory regions and CPU pinning consistent with sibling configs.
examples/daqiri_bench_raw_tx_spark_xhost.yaml New TX-side raw bench config; eth_dst_addr correctly uses the placeholder convention requiring operator substitution.
examples/daqiri_bench_raw_rx_spark_xhost.yaml New RX-side raw bench config; flow match tuple (udp_src/dst 4096) consistent with the TX-side YAML.
AGENTS.md Benchmark table updated to include the three new _xhost YAMLs; addresses the doc-sync gap flagged in the previous review round.
docs/tutorials/benchmarking_examples.md Adds a Cross-host two-DGX-Spark loopback section with run instructions for both raw GPUDirect and RDMA xhost configs; benign-race note for the startup log line is accurate.
docs/tutorials/configuration-walkthrough.md Adds cross-host DGX Spark entries to the raw GPUDirect and RDMA config-selection lists with correct links to the new YAMLs and the new benchmarking section.

Sequence Diagram

sequenceDiagram
    participant App as Application
    participant CM as CM Event Loop
    participant W as Worker Thread
    participant Pool as Ring Pool

    Note over CM,W: Normal connection lifecycle (fixed)

    CM->>CM: "CONNECT_REQUEST<br/>params.if_idx = listen_iter->if_idx<br/>ready_to_exit.store(false)"
    CM->>CM: ESTABLISHED — spawn worker thread
    CM->>W: rdma_thread(tparams)
    W->>Pool: use tx_ring / rx_ring

    Note over CM,W: Graceful disconnect (deadlock fixed)

    CM->>W: ready_to_exit.store(true)
    Note over CM: lock threads_mutex_
    CM->>CM: move thread to worker_to_join
    CM->>CM: erase from worker_threads_
    Note over CM: unlock threads_mutex_
    CM->>W: worker_to_join.join()
    W-->>CM: thread exits
    CM->>Pool: tx_rings_.push(ring)
    CM->>Pool: rx_rings_.push(ring)
    CM->>CM: clear slot

    Note over CM,W: Force-quit path
    App->>CM: rdma_force_quit.store(true)
    CM->>CM: exit event loop
    W-->>CM: exits via rdma_force_quit check
    CM->>CM: join remaining worker_threads_
Loading

Reviews (2): Last reviewed commit: "#113 - Fix RDMA cross-host SEND timeout ..." | Re-trigger Greptile

Comment thread src/managers/rdma/daqiri_rdma_mgr.h Outdated
Comment thread src/managers/rdma/daqiri_rdma_mgr.cpp Outdated
Comment thread examples/daqiri_bench_rdma_tx_rx_spark_xhost.yaml
Signed-off-by: Chloe Crozier <chloecrozier@gmail.com>
@chloecrozier
Copy link
Copy Markdown
Member Author

Addressed Greptile's review:

ready_to_exit is now std::atomic (matching the existing rdma_force_quit pattern, so the worker actually observes the write under the C++ memory model), and the per-connection rings are returned to the pool only after worker_to_join.join() so the worker can't be observed touching a "freed" ring.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Jun 2, 2026

Want your agent to iterate on Greptile's feedback? Try greploops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] RDMA cross-host SEND immediately fails with transport_retry_exceeded and the server never posts receives

1 participant