Skip to content

[BUG] RDMA cross-host SEND immediately fails with transport_retry_exceeded and the server never posts receives #113

@chloecrozier

Description

@chloecrozier

RDMA between two DGX Sparks back-to-back dies on the first SEND with IBV_WC_RETRY_EXC_ERR. The RDMA-CM handshake finishes fine (both sides hit RDMA_CM_EVENT_ESTABLISHED), but right after the server creates its listener it logs [CRITICAL] Couldn't find server params for address 169.254.95.47 — so whatever posts the receive WRs after establish never runs, and the client's first SEND has nothing to land in.

Single-host RDMA works on the same build, and ib_send_bw over the same cable runs at ~212 Gb/s, so I don't think it's the wire.

Looks like an address-key mismatch between where server_params gets inserted (listener-create path around daqiri_rdma_mgr.cpp:1056) and where it gets looked up (:692) — probably string-vs-binary or INADDR_ANY-vs-explicit.

Steps/Code to reproduce bug

Two Sparks back-to-back on ConnectX-7 p0, server on 169.254.95.47, client on 169.254.100.253. Build per docs/tutorials/bare-metal-cmake-build.md. Start the server first:

# server (spark-201a)
sudo build/.../daqiri_bench_rdma examples/daqiri_bench_rdma_server_spark_xhost.yaml --mode server --seconds 30

# client (spark-960b)
sudo build/.../daqiri_bench_rdma examples/daqiri_bench_rdma_client_spark_xhost.yaml --mode client --seconds 30

Client tail:

[ERROR] daqiri_rdma_mgr.cpp:489: CQ error on client: transport retry counter exceeded (12) for WRID 4660 ...
Client received messages: 0

Server tail (one line after the listener gets created at :1056):

[CRITICAL] daqiri_rdma_mgr.cpp:692: Couldn't find server params for address 169.254.95.47
[CRITICAL] daqiri_rdma_mgr.cpp:718: Couldn't find an available queue ID for server 169.254.95.47:4096 ...

Full logs attached: logs/spark_rdma_tx.log, logs/spark_rdma_rx.log.

Expected behavior

Server posts receives at RDMA_CM_EVENT_ESTABLISHED, client SENDs land, both sides report non-zero throughput, no RETRY_EXC_ERR. Single-host daqiri_bench_rdma_tx_rx_spark.yaml keeps working.

Environment overview

  • Environment location: Bare-metal
  • Method of DAQIRI install: source, branch docs/bare-metal-cmake-build, configured with -DDAQIRI_MGR="dpdk socket rdma"

Environment details

  • OS: IGX OS / Ubuntu 24.04 ARM (aarch64), kernel 6.14
  • Hardware: 2x DGX Spark (GB10), ConnectX-7 fw 28.45.4028, back-to-back on enp1s0f0np0
  • CUDA 13.0, driver 580.95.05, DPDK 25.11 patched with dmabuf.patch + dpdk.nvidia.patch, libibverbs-dev/librdmacm-dev 2510.0.11-1
  • nvidia-peermem not loaded (dma-buf path); YAMLs use kind: host_pinned

Additional context

Raw GPUDirect cross-host works on the same hosts/YAMLs, so this is RDMA-specific. There's also a server-side segfault that shows up on stderr but didn't make it into my log capture — it seems like it's downstream of the missing receive-provisioning state, not a separate bug, but worth confirming once the lookup is fixed.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions