NVIDIA · chloecrozier · Jun 2, 2026 · Jun 2, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -30,11 +30,11 @@ There is no unit test suite. Verification is done via the benchmark executables
 
 | Executable | Source | Typical config |
 |---|---|---|
-| `daqiri_bench_raw_gpudirect` | `raw_gpudirect_bench.cpp` | `daqiri_bench_raw_tx_rx.yaml`, `daqiri_bench_raw_tx_rx_4q.yaml`, `daqiri_bench_raw_tx_rx_spark.yaml`, `daqiri_bench_raw_sw_loopback.yaml`, `daqiri_bench_raw_rx_multi_q.yaml` |
+| `daqiri_bench_raw_gpudirect` | `raw_gpudirect_bench.cpp` | `daqiri_bench_raw_tx_rx.yaml`, `daqiri_bench_raw_tx_rx_4q.yaml`, `daqiri_bench_raw_tx_rx_spark.yaml`, `daqiri_bench_raw_{tx,rx}_spark_xhost.yaml`, `daqiri_bench_raw_sw_loopback.yaml`, `daqiri_bench_raw_rx_multi_q.yaml` |
 | `daqiri_bench_raw_hds` | `raw_hds_bench.cpp` | `daqiri_bench_raw_tx_rx_hds.yaml` |
 | `daqiri_bench_raw_reorder_seq` | `raw_reorder_seq_bench.cpp` | `daqiri_bench_raw_tx_rx_reorder_seq_1024*.yaml`, `daqiri_bench_raw_rx_reorder_seq_*.yaml` |
 | `daqiri_bench_raw_reorder_quantize` | `raw_reorder_quantize_bench.cpp` | `daqiri_bench_raw_tx_rx_reorder_quantize_seq_batch.yaml` |
-| `daqiri_bench_rdma` | `rdma_bench.cpp` | `daqiri_bench_rdma_tx_rx.yaml`, `daqiri_bench_rdma_tx_rx_spark.yaml` |
+| `daqiri_bench_rdma` | `rdma_bench.cpp` | `daqiri_bench_rdma_tx_rx.yaml`, `daqiri_bench_rdma_tx_rx_spark.yaml`, `daqiri_bench_rdma_tx_rx_spark_xhost.yaml` |
 | `daqiri_bench_socket` | `socket_bench.cpp` | `daqiri_bench_socket_{udp,tcp}_tx_rx.yaml` |
 
 The four `raw_*` benches share `raw_bench_common.{cpp,h}` and accept `--seconds N`. `daqiri_bench_rdma` and `daqiri_bench_socket` also take `--mode {tx,rx,both}`.

diff --git a/docs/tutorials/benchmarking_examples.md b/docs/tutorials/benchmarking_examples.md
@@ -50,6 +50,34 @@ docker run --rm -it --privileged \
     - [`daqiri_bench_raw_tx_rx_spark.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_spark.yaml) for `daqiri_bench_raw_gpudirect` — still set `eth_dst_addr` to the RX MAC. The rx_port is `0002:01:00.1` (physical port p1), so read its MAC: `cat /sys/class/net/enP2p1s0f1np1/address`. This p0-to-p1 pairing is intentional for an over-the-wire single-machine loopback; using two PFs that map to the same physical port exercises the on-chip eswitch path instead.
     - [`daqiri_bench_rdma_tx_rx_spark.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_rdma_tx_rx_spark.yaml) for `daqiri_bench_rdma` — no further edits needed.
 
+#### Cross-host two-DGX-Spark loopback
+
+If you have two DGX Sparks cross-cabled p0↔p0 instead of a chassis QSFP loop on one machine, use the `_xhost` configs. Each host runs only its own role, so the YAML on each side configures one port instead of two. Both hosts must already be set up per the [DGX Spark profile](system_configuration.md#dgx-spark-profile), with one adjustment: the `daqiri-tx` (`1.1.1.1/24`) and `daqiri-rx` (`2.2.2.2/24`) nmcli profiles are *split across* the two hosts — bring up `daqiri-tx` on the TX host's p0 and `daqiri-rx` on the RX host's p0, instead of both on one box.
+
+**Raw GPUDirect.** Start the RX side first so the flow rule is installed before any traffic arrives:
+
+```bash
+# RX host
+sudo ./daqiri_bench_raw_gpudirect daqiri_bench_raw_rx_spark_xhost.yaml --seconds 30
+
+# TX host (set eth_dst_addr to the RX host p0's MAC first: cat /sys/class/net/enp1s0f0np0/address on the RX host)
+sudo ./daqiri_bench_raw_gpudirect daqiri_bench_raw_tx_spark_xhost.yaml --seconds 30
+```
+
+Verify both sides report non-zero packet counts and no `NO_FREE_BURST_BUFFERS` / `NO_FREE_PACKET_BUFFERS` errors.
+
+**RDMA.** Start the server side first:
+
+```bash
+# RX (server) host
+sudo ./daqiri_bench_rdma daqiri_bench_rdma_tx_rx_spark_xhost.yaml --mode server --seconds 30
+
+# TX (client) host
+sudo ./daqiri_bench_rdma daqiri_bench_rdma_tx_rx_spark_xhost.yaml --mode client --seconds 30
+```
+
+Verify both sides report non-zero send/receive completions. The server-side `Couldn't find server params for address …` log line that may appear once between the listener-create log and the "RDMA server successfully started" log is a benign startup race (the application thread polls for the listener before the CM thread finishes inserting it); subsequent lookups succeed.
+
 The benchmark executables and example YAML configurations are located at:
 
 | | Binaries | YAML configs |

diff --git a/docs/tutorials/configuration-walkthrough.md b/docs/tutorials/configuration-walkthrough.md
@@ -22,12 +22,14 @@ With a backend in mind, read down the questions below and stop at the first one
     - **Generic discrete GPU** (template — replace `<placeholders>`) — [`daqiri_bench_raw_tx_rx.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx.yaml). This is the file annotated line-by-line in the [walkthrough below](#annotated-walkthrough).
     - **Four queue closed-loop TX+RX** (template — replace `<placeholders>`) — [`daqiri_bench_raw_tx_rx_4q.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_4q.yaml). Uses one application worker per TX/RX queue, with each `bench_tx` entry sending a different UDP flow.
     - **DGX Spark / GB10** (prefilled) — [`daqiri_bench_raw_tx_rx_spark.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_spark.yaml). `kind: host_pinned` for the integrated GPU; cores, PCIe addresses, and IPs are prefilled. See the [Spark profile callout](benchmarking_examples.md#update-the-loopback-configuration) for run details.
+    - **DGX Spark cross-host** (prefilled, runs on two Sparks) — [`daqiri_bench_raw_tx_spark_xhost.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_spark_xhost.yaml) on the TX host and [`daqiri_bench_raw_rx_spark_xhost.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_rx_spark_xhost.yaml) on the RX host. Each host runs `daqiri_bench_raw_gpudirect` against its own half; cables connect p0↔p0 between the two boxes. See the [Cross-host two-DGX-Spark loopback](benchmarking_examples.md#cross-host-two-dgx-spark-loopback) section for run details.
     - **No physical NIC available** — [`daqiri_bench_raw_sw_loopback.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_sw_loopback.yaml). `loopback: "sw"`, no NIC required. Useful for first-time build verification, not representative of production performance.
 
     **RDMA / RoCE** — runs on `daqiri_bench_rdma` (use `--mode {tx,rx,both}`). Configs use `kind: host_pinned` regardless of platform.
 
     - **Generic** (template — replace IPs) — [`daqiri_bench_rdma_tx_rx.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_rdma_tx_rx.yaml).
     - **DGX Spark** (prefilled) — [`daqiri_bench_rdma_tx_rx_spark.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_rdma_tx_rx_spark.yaml). See the [Spark profile callout](benchmarking_examples.md#update-the-loopback-configuration) for run details.
+    - **DGX Spark cross-host** (prefilled, runs on two Sparks) — [`daqiri_bench_rdma_tx_rx_spark_xhost.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_rdma_tx_rx_spark_xhost.yaml). Run with `--mode server` on the RX host and `--mode client` on the TX host. See the [Cross-host two-DGX-Spark loopback](benchmarking_examples.md#cross-host-two-dgx-spark-loopback) section for run details.
 
     **Kernel TCP/UDP sockets** — runs on `daqiri_bench_socket`. Both bind to `127.0.0.1`.
 

diff --git a/examples/daqiri_bench_raw_rx_spark_xhost.yaml b/examples/daqiri_bench_raw_rx_spark_xhost.yaml
@@ -0,0 +1,54 @@
+# DGX Spark (GB10) cross-host RX-side config for daqiri_bench_raw_gpudirect.
+# Companion to daqiri_bench_raw_tx_spark_xhost.yaml on the peer TX host. Both
+# hosts must be configured per the DGX Spark profile in
+# docs/tutorials/system_configuration.md, with the chosen p0 port cross-cabled
+# host-to-host (no chassis QSFP loop).
+#
+# Spark substitutions baked in here:
+#   - rx_port BDF = 0000:01:00.0 (p0); change if your p0 sits elsewhere
+#   - kind: host_pinned (GB10 dma-buf path; nvidia-peermem N/A on Spark)
+#   - master_core: 8; cpu_core: 18 (isolated big-cluster X925 16-19)
+#   - flow match: udp_src/dst = 4096 -- same UDP tuple the TX side sends to
+#
+%YAML 1.2
+---
+daqiri:
+  cfg:
+    version: 1
+    stream_type: "raw"
+    master_core: 8
+    debug: false
+    log_level: "info"
+    loopback: ""
+
+    memory_regions:
+    - name: "Data_RX_GPU"
+      kind: "host_pinned"
+      affinity: 0
+      num_bufs: 51200
+      buf_size: 8064
+
+    interfaces:
+    - name: "rx_port"
+      address: 0000:01:00.0
+      rx:
+        flow_isolation: true
+        queues:
+        - name: "rq_q_0"
+          id: 0
+          cpu_core: 18
+          batch_size: 10240
+          memory_regions:
+            - "Data_RX_GPU"
+        flows:
+        - name: "flow_0"
+          id: 0
+          action:
+            type: queue
+            id: 0
+          match:
+            udp_src: 4096
+            udp_dst: 4096
+
+bench_rx:
+  interface_name: "rx_port"
diff --git a/examples/daqiri_bench_raw_tx_spark_xhost.yaml b/examples/daqiri_bench_raw_tx_spark_xhost.yaml
@@ -0,0 +1,58 @@
+# DGX Spark (GB10) cross-host TX-side config for daqiri_bench_raw_gpudirect.
+# Companion to daqiri_bench_raw_rx_spark_xhost.yaml on the peer RX host. Both
+# hosts must be configured per the DGX Spark profile in
+# docs/tutorials/system_configuration.md, with the chosen p0 port cross-cabled
+# host-to-host (no chassis QSFP loop).
+#
+# Spark substitutions baked in here:
+#   - tx_port BDF = 0000:01:00.0 (p0); change if your p0 sits elsewhere
+#   - kind: host_pinned (GB10 dma-buf path; nvidia-peermem N/A on Spark)
+#   - master_core: 8; cpu_core: 17 (isolated big-cluster X925 16-19)
+#   - eth_dst_addr is the *peer* RX port's MAC -- replace with your own:
+#       cat /sys/class/net/<peer rx ifname>/address  # on the RX host
+#   - ip_src/ip_dst: arbitrary 1.1.1.1 -> 2.2.2.2 (kernel stack bypassed by
+#     the DPDK PMD; only the UDP src/dst ports below are matched by the RX
+#     flow rule in daqiri_bench_raw_rx_spark_xhost.yaml)
+#
+%YAML 1.2
+---
+daqiri:
+  cfg:
+    version: 1
+    stream_type: "raw"
+    master_core: 8
+    debug: false
+    log_level: "info"
+    loopback: ""
+
+    memory_regions:
+    - name: "Data_TX_GPU"
+      kind: "host_pinned"
+      affinity: 0
+      num_bufs: 51200
+      buf_size: 8064
+
+    interfaces:
+    - name: "tx_port"
+      address: 0000:01:00.0
+      tx:
+        queues:
+        - name: "tx_q_0"
+          id: 0
+          batch_size: 10240
+          cpu_core: 17
+          memory_regions:
+            - "Data_TX_GPU"
+          offloads:
+            - "tx_eth_src"
+
+bench_tx:
+  interface_name: "tx_port"
+  batch_size: 10240
+  payload_size: 8000
+  header_size: 64
+  eth_dst_addr: <00:00:00:00:00:00>
+  ip_src_addr: 1.1.1.1
+  ip_dst_addr: 2.2.2.2
+  udp_src_port: 4096
+  udp_dst_port: 4096
diff --git a/examples/daqiri_bench_rdma_tx_rx_spark_xhost.yaml b/examples/daqiri_bench_rdma_tx_rx_spark_xhost.yaml
@@ -0,0 +1,112 @@
+# DGX Spark (GB10) cross-host config for daqiri_bench_rdma.
+# Adapts the single-host daqiri_bench_rdma_tx_rx_spark.yaml to a two-host
+# setup with each side's p0 cross-cabled to the peer's p0. Both hosts must
+# be configured per the DGX Spark profile in
+# docs/tutorials/system_configuration.md, except the IP assignment is
+# split across hosts: put 1.1.1.1/24 on the client host's p0 and 2.2.2.2/24
+# on the server host's p0 (instead of both addresses on one machine).
+#
+# Run with --mode client on the TX host and --mode server on the RX host:
+#   server (RX): sudo ./daqiri_bench_rdma <this.yaml> --mode server --seconds 30
+#   client (TX): sudo ./daqiri_bench_rdma <this.yaml> --mode client --seconds 30
+#
+# This config is the regression test for the RDMA cross-host receive-
+# provisioning bug fixed in issue #113. Before the fix, the server-side
+# worker thread crashed on launch and no receives were ever posted.
+#
+# Spark substitutions baked in here:
+#   - IPs: 1.1.1.1 (client/TX p0) and 2.2.2.2 (server/RX p0)
+#   - cpu_core values from isolated big-cluster X925 16-19; master_core: 8
+#   - kind: host_pinned (required upstream on GB10; peermem N/A, dma-buf used)
+#
+%YAML 1.2
+---
+daqiri:
+  cfg:
+    version: 1
+    stream_type: "socket"
+    protocol: "roce"
+    master_core: 8
+    debug: false
+    log_level: "info"
+
+    memory_regions:
+    - name: "DATA_RX_GPU_SERVER"
+      kind: "host_pinned"
+      affinity: 0
+      num_bufs: 20
+      buf_size: 9000000
+    - name: "DATA_TX_GPU_SERVER"
+      kind: "host_pinned"
+      affinity: 0
+      num_bufs: 20
+      buf_size: 9000000
+    - name: "DATA_TX_GPU_CLIENT"
+      kind: "host_pinned"
+      affinity: 0
+      num_bufs: 20
+      buf_size: 90000000
+    - name: "DATA_RX_GPU_CLIENT"
+      kind: "host_pinned"
+      affinity: 0
+      num_bufs: 20
+      buf_size: 90000000
+
+    interfaces:
+    - name: my_client
+      address: 1.1.1.1
+      socket_config:
+        mode: client
+        remote_ip: 2.2.2.2
+        remote_port: 4096
+      roce_config:
+        transport_mode: RC
+      tx:
+        queues:
+        - name: "Client_TX_Queue"
+          id: 0
+          batch_size: 1
+          cpu_core: 17
+      rx:
+        queues:
+        - name: "Client_RX_Queue"
+          id: 0
+          cpu_core: 18
+          batch_size: 1
+    - name: my_server
+      address: 2.2.2.2
+      socket_config:
+        mode: server
+        local_ip: 2.2.2.2
+        local_port: 4096
+      roce_config:
+        transport_mode: RC
+      rx:
+        queues:
+        - name: "Server_RX_Queue"
+          id: 0
+          cpu_core: 19
+          batch_size: 1
+      tx:
+        queues:
+        - name: "Server_TX_Queue"
+          id: 0
+          cpu_core: 16
+          batch_size: 1
+
+rdma_bench_server:
+  server_address: 2.2.2.2
+  server_port: 4096
+  message_size: 8000000
+  send: true
+  receive: true
+  server: true
+
+rdma_bench_client:
+  message_size: 8000000
+  client_address: 1.1.1.1
+  server_address: 2.2.2.2
+  server_port: 4096
+  receive: true
+  send: true
+  server: false