RDMA between two DGX Sparks back-to-back dies on the first SEND with IBV_WC_RETRY_EXC_ERR. The RDMA-CM handshake finishes fine (both sides hit RDMA_CM_EVENT_ESTABLISHED), but right after the server creates its listener it logs [CRITICAL] Couldn't find server params for address 169.254.95.47 — so whatever posts the receive WRs after establish never runs, and the client's first SEND has nothing to land in.
Single-host RDMA works on the same build, and ib_send_bw over the same cable runs at ~212 Gb/s, so I don't think it's the wire.
Looks like an address-key mismatch between where server_params gets inserted (listener-create path around daqiri_rdma_mgr.cpp:1056) and where it gets looked up (:692) — probably string-vs-binary or INADDR_ANY-vs-explicit.
Steps/Code to reproduce bug
Two Sparks back-to-back on ConnectX-7 p0, server on 169.254.95.47, client on 169.254.100.253. Build per docs/tutorials/bare-metal-cmake-build.md. Start the server first:
# server (spark-201a)
sudo build/.../daqiri_bench_rdma examples/daqiri_bench_rdma_server_spark_xhost.yaml --mode server --seconds 30
# client (spark-960b)
sudo build/.../daqiri_bench_rdma examples/daqiri_bench_rdma_client_spark_xhost.yaml --mode client --seconds 30
Client tail:
[ERROR] daqiri_rdma_mgr.cpp:489: CQ error on client: transport retry counter exceeded (12) for WRID 4660 ...
Client received messages: 0
Server tail (one line after the listener gets created at :1056):
[CRITICAL] daqiri_rdma_mgr.cpp:692: Couldn't find server params for address 169.254.95.47
[CRITICAL] daqiri_rdma_mgr.cpp:718: Couldn't find an available queue ID for server 169.254.95.47:4096 ...
Full logs attached: logs/spark_rdma_tx.log, logs/spark_rdma_rx.log.
Expected behavior
Server posts receives at RDMA_CM_EVENT_ESTABLISHED, client SENDs land, both sides report non-zero throughput, no RETRY_EXC_ERR. Single-host daqiri_bench_rdma_tx_rx_spark.yaml keeps working.
Environment overview
- Environment location: Bare-metal
- Method of DAQIRI install: source, branch
docs/bare-metal-cmake-build, configured with -DDAQIRI_MGR="dpdk socket rdma"
Environment details
- OS: IGX OS / Ubuntu 24.04 ARM (aarch64), kernel 6.14
- Hardware: 2x DGX Spark (GB10), ConnectX-7 fw
28.45.4028, back-to-back on enp1s0f0np0
- CUDA 13.0, driver 580.95.05, DPDK 25.11 patched with
dmabuf.patch + dpdk.nvidia.patch, libibverbs-dev/librdmacm-dev 2510.0.11-1
nvidia-peermem not loaded (dma-buf path); YAMLs use kind: host_pinned
Additional context
Raw GPUDirect cross-host works on the same hosts/YAMLs, so this is RDMA-specific. There's also a server-side segfault that shows up on stderr but didn't make it into my log capture — it seems like it's downstream of the missing receive-provisioning state, not a separate bug, but worth confirming once the lookup is fixed.
RDMA between two DGX Sparks back-to-back dies on the first SEND with
IBV_WC_RETRY_EXC_ERR. The RDMA-CM handshake finishes fine (both sides hitRDMA_CM_EVENT_ESTABLISHED), but right after the server creates its listener it logs[CRITICAL] Couldn't find server params for address 169.254.95.47— so whatever posts the receive WRs after establish never runs, and the client's first SEND has nothing to land in.Single-host RDMA works on the same build, and
ib_send_bwover the same cable runs at ~212 Gb/s, so I don't think it's the wire.Looks like an address-key mismatch between where
server_paramsgets inserted (listener-create path arounddaqiri_rdma_mgr.cpp:1056) and where it gets looked up (:692) — probably string-vs-binary orINADDR_ANY-vs-explicit.Steps/Code to reproduce bug
Two Sparks back-to-back on ConnectX-7 p0, server on
169.254.95.47, client on169.254.100.253. Build perdocs/tutorials/bare-metal-cmake-build.md. Start the server first:Client tail:
Server tail (one line after the listener gets created at
:1056):Full logs attached:
logs/spark_rdma_tx.log,logs/spark_rdma_rx.log.Expected behavior
Server posts receives at
RDMA_CM_EVENT_ESTABLISHED, client SENDs land, both sides report non-zero throughput, noRETRY_EXC_ERR. Single-hostdaqiri_bench_rdma_tx_rx_spark.yamlkeeps working.Environment overview
docs/bare-metal-cmake-build, configured with-DDAQIRI_MGR="dpdk socket rdma"Environment details
28.45.4028, back-to-back onenp1s0f0np0dmabuf.patch+dpdk.nvidia.patch,libibverbs-dev/librdmacm-dev2510.0.11-1nvidia-peermemnot loaded (dma-buf path); YAMLs usekind: host_pinnedAdditional context
Raw GPUDirect cross-host works on the same hosts/YAMLs, so this is RDMA-specific. There's also a server-side segfault that shows up on stderr but didn't make it into my log capture — it seems like it's downstream of the missing receive-provisioning state, not a separate bug, but worth confirming once the lookup is fixed.