diff --git a/AGENTS.md b/AGENTS.md index effe7ff..64160ba 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -96,6 +96,7 @@ The web docs live in `docs/` and are built with [MkDocs Material](https://squidf - `docs/concepts.md` — terminology glossary (stream types and protocols, GPUDirect, packet/burst/segment, flow/queue, memory region, zero-copy ownership, RX reorder). Meant to be opened in parallel with the rest of the docs. - `docs/api-reference/index.md` — API guide (6-step application lifecycle, configuration-first model) - `docs/api-reference/configuration.md`, `docs/api-reference/cpp.md`, `docs/api-reference/python.md` — YAML schema, C++ API, and Python bindings docs +- `docs/performance-dgx-spark.md` — per-platform performance report for DGX Spark stream/protocol combinations - `docs/tutorials/` — tutorial walkthroughs (system config, config-file walkthrough) - `docs/tutorials/benchmarking_examples.md` — surfaced as a top-level "Benchmarks" nav entry in `mkdocs.yml` and `docs/index.html`; file kept at its original path for inbound-link stability - `docs/stylesheets/extra.css` — custom theme overrides @@ -105,9 +106,10 @@ The web docs live in `docs/` and are built with [MkDocs Material](https://squidf **Keeping docs in sync with code:** before committing changes, scan for the recurring drift hotspots: - **Stream-type list** (`src/managers/*/`) — README Backends table, `docs/getting-started.md`, `docs/concepts.md` (Stream Types section + Support and testing admonition), `docs/api-reference/configuration.md` - **CMake options / `DAQIRI_MGR` default** (`src/CMakeLists.txt:137`) — README Quick Start, `docs/getting-started.md`, this file's Build & run section -- **Benchmark binary or YAML names** (`examples/`) — the benchmark table above, `docs/tutorials/benchmarking_examples.md`, and the "Choosing an example config" decision tree in `docs/tutorials/configuration-walkthrough.md` (every YAML must have a leaf; CI's `scripts/check_doc_refs.py` enforces coverage) +- **Benchmark binary or YAML names** (`examples/`) — the benchmark table above, `docs/tutorials/benchmarking_examples.md`, the "Choosing an example config" decision tree in `docs/tutorials/configuration-walkthrough.md` (every YAML must have a leaf; CI's `scripts/check_doc_refs.py` enforces coverage), and per-platform performance docs (`docs/performance-*.md`) - **Public API include** (`#include `; source files under `include/daqiri/`) — `docs/api-reference/index.md`, `docs/api-reference/cpp.md`, `docs/api-reference/python.md`; if the change adds or renames a user-facing concept, also `docs/concepts.md` - **Python bindings** (`python/daqiri_common_pybind.cpp`) — `docs/api-reference/python.md` (function reference tables, enums/classes tables, GIL Behavior section) +- **Bench CLI flags or output format** (`examples/raw_bench_common.{h,cpp}`, `*_bench.cpp`) — per-platform performance docs' Methodology section, `examples/run_spark_bench.sh` parsing logic - **Doc reorganization** (any rename in `docs/`) — `docs/index.html` landing page, `mkdocs.yml` nav, README Documentation table The full mapping with rationale lives in the docs-sync agent rule. Internal-link, anchor, and nav drift is enforced by CI (`.github/workflows/docs.yml`); content drift (stale binary names, defaults) is still a manual check at commit time. diff --git a/README.md b/README.md index 44867bd..36db3d0 100644 --- a/README.md +++ b/README.md @@ -91,6 +91,7 @@ Reference material for the DAQIRI codebase: - [Configuration YAML Reference](https://nvidia.github.io/daqiri/api-reference/configuration/) — Full YAML config reference for all backends - [C++ API Usage](https://nvidia.github.io/daqiri/api-reference/cpp/) — C++ RX/TX workflows, buffer lifecycle, file writing, utilities, and status codes - [Python API Usage](https://nvidia.github.io/daqiri/api-reference/python/) — Python bindings, workflow examples, enums, config classes, and helper functions +- [Performance: DGX Spark](https://nvidia.github.io/daqiri/performance-dgx-spark/) — Per-platform throughput, drop, and utilization numbers for stream/protocol combinations on DGX Spark - [Contributing](CONTRIBUTING.md) — Contribution guidelines, coding standards, DCO sign-off ## Tutorials diff --git a/docs/index.html b/docs/index.html index f5b49e8..9e2f53a 100644 --- a/docs/index.html +++ b/docs/index.html @@ -684,6 +684,12 @@

News

+
+
Performance2026
+
DAQIRI Performance on DGX Spark
+
NVIDIA — Throughput, drops, and resource utilization for raw GPUDirect, Socket / RoCE, and Socket / UDP/TCP measured on a DGX Spark (GB10) workstation. First in a series of per-platform performance reports.
+ +
GitHub2025
DAQIRI Open-Sourced on GitHub
diff --git a/docs/performance-dgx-spark.md b/docs/performance-dgx-spark.md new file mode 100644 index 0000000..d7736fb --- /dev/null +++ b/docs/performance-dgx-spark.md @@ -0,0 +1,612 @@ +# Performance: DGX Spark + +This page reports DAQIRI throughput, drop, and resource-utilization numbers +measured on a DGX Spark (GB10) workstation. It is the first in a series of +per-platform performance reports; the same section layout will be reused for +IGX, x86-server, and other targets. + +The numbers below are reproducible — every cell is generated by +[`examples/run_spark_bench.sh`](https://github.com/nvidia/daqiri/blob/main/examples/run_spark_bench.sh) +against the YAML configs in `examples/`, with the system state captured by +[`examples/bench_capture_environment.sh`](https://github.com/nvidia/daqiri/blob/main/examples/bench_capture_environment.sh) +alongside each result set. + +**Contents** + +- [Summary](#summary) +- [Introduction](#introduction) — system under test, methodology +- [C++ Results](#c-results) — raw GPUDirect, Socket / RoCE, Socket / UDP and TCP, workload variants +- [Python Results](#python-results) +- [Reproduce these results](#reproduce-these-results) +- [TODO: Not Yet Implemented / Known Limitations](#todo-not-yet-implemented-known-limitations) + +## Summary + +Two headline tables. **Native-shape peak** reports each stream/protocol +combination at its best-case operation size. **Matched 8 KB** drives raw +GPUDirect, Socket / RoCE, and Socket / UDP at a common ~8 KB unit of work so +the comparison is apples-to-apples; TCP is omitted from that table since it +has no operation boundary. + +### Native-shape peak — max no-drop throughput (Gbps) + +| Stream / Protocol | C++ loopback | C++ + FFT | C++ + GEMM | Python loopback | Python + FFT | Python + GEMM | +| ------------------------------ | -------------- | ---------------- | ---------------- | ----------------- | ---------------- | ---------------- | +| Raw Ethernet / GPUDirect (8 KB) | **96.4** | _TBD (follow-up)_ | _TBD (follow-up)_ | _TBD (follow-up)_ | _TBD (follow-up)_ | _TBD (follow-up)_ | +| Socket / RoCE (8 MB SEND) | **83.4**[^1] | _TBD (follow-up)_ | _TBD (follow-up)_ | _TBD (follow-up)_ | _TBD (follow-up)_ | _TBD (follow-up)_ | +| Socket / UDP (MTU) | _deferred_[^2] | n/a | n/a | _TBD (follow-up)_ | n/a | n/a | +| Socket / TCP (stream) | _deferred_[^3] | n/a | n/a | _TBD (follow-up)_ | n/a | n/a | + +### Matched 8 KB operation — cross-transport Gbps + +| Stream / Protocol | C++ loopback | C++ + FFT | C++ + GEMM | Python loopback | Python + FFT | Python + GEMM | +| ------------------------------- | -------------- | ---------------- | ---------------- | ----------------- | ---------------- | ---------------- | +| Raw Ethernet / GPUDirect | **96.4** | _TBD (follow-up)_ | _TBD (follow-up)_ | _TBD (follow-up)_ | _TBD (follow-up)_ | _TBD (follow-up)_ | +| Socket / RoCE | 0.006[^1] | _TBD (follow-up)_ | _TBD (follow-up)_ | _TBD (follow-up)_ | _TBD (follow-up)_ | _TBD (follow-up)_ | +| Socket / UDP (1472 B, MTU) | _deferred_[^2] | n/a | n/a | _TBD (follow-up)_ | n/a | n/a | + +[^1]: RoCE single-host loopback on Spark requires host network prereqs before the bench can use the cable — see [Tuning prerequisites](#tuning-prerequisites). The native-shape 8 MB cell saturates at ~83 Gbps single-stream (single QP, batch 1); the matched 8 KB cell comes from the original Spark data set and should be treated as pending refresh rather than a final platform limit after the current RDMA bench changes. The aggregate bidirectional rate at native shape on the same run was ~175 Gbps (server TX 89.2 + client TX 83.4). +[^2]: Socket UDP rows are pending re-run of `./examples/run_spark_bench.sh socket-udp sweep` with the current wrapper-side fix that maps `sent_packets`/`sent_bytes` from `socket_bench` (the previous wrapper only knew the RDMA `send_completions`/`send_bytes` keys and zero-filled the CSV for socket runs). The earlier "peer-learning deadlock" diagnosis was a red herring — bench behaves correctly; the CSV was bogus because of the wrapper bug. Re-run tracked outside this PR. +[^3]: Socket TCP rows are pending re-run for the same reason as Socket UDP (see [^2]). The earlier "glibc heap-corruption" diagnosis was also a red herring; the bench exits cleanly under ASan across 20+ runs. + +!!! note "Why two tables" + A single Gbps number isn't enough to compare stream/protocol combinations + fairly. Raw GPUDirect at peak with 8 KB packets is doing very different + work than Socket / RoCE at peak with 8 MB messages, even when both report + the same Gbps. The matched 8 KB table makes the comparison honest; the + native-shape table shows each combination's design ceiling. + +## Introduction + +### System under test + +The reproducibility appendix has the full capture. Key fields: + +| Field | Value | +| ----------------- | ------------------------------------ | +| Platform | NVIDIA DGX Spark (GB10) | +| GPU | Blackwell (compute capability 12.1) | +| NIC | ConnectX-7 (two ports, tied / QSFP loopback) | +| Topology | `0000:01:00.0` (mlx5_0) → `0002:01:00.0` (mlx5_2), physical loopback cable | +| OS | _captured in `environment.txt`_ | +| Kernel | _captured in `environment.txt`_ | +| CUDA driver | _captured in `environment.txt`_ | +| DPDK | patched per `dpdk_patches/` (container build) | +| DAQIRI commit | _captured in `environment.txt`_ | + +### Methodology + +#### Bench commands + +Each stream/protocol combination has a dedicated bench executable in +`examples/`. The raw GPUDirect numbers in this report come from the first +command; the Socket / RoCE and Socket / UDP/TCP commands are listed here for +documentation and will be the basis for the future fill of those rows. + +```bash +# DPDK GPUDirect — physical loopback (used in this report) +./build/examples/daqiri_bench_raw_gpudirect \ + examples/daqiri_bench_raw_tx_rx_spark.yaml \ + --seconds 30 [--target-gbps G] + +# RoCE — same NIC, two ports cross-cabled. Host prereq: scripts/setup_spark_rdma_loopback.sh +./build/examples/daqiri_bench_rdma \ + examples/daqiri_bench_rdma_tx_rx_spark.yaml \ + --seconds 30 --mode both [--target-gbps G] + +# Socket UDP / TCP — localhost (deferred; see TODO / Known Limitations) +./build/examples/daqiri_bench_socket \ + examples/daqiri_bench_socket_udp_tx_rx.yaml \ + --seconds 30 --mode both [--target-gbps G] +``` + +The DPDK YAML expects `eth_dst_addr` filled from the RX iface MAC: + +```bash +ETH_DST_ADDR="$(cat /sys/class/net//address)" +``` + +#### Per stream/protocol sweep dimensions + +**Payload** is the user-data bytes in one packet (DPDK / Socket UDP), +one RDMA message, or one TCP send. **Batch** is how many packets DAQIRI +hands to (or pulls from) the NIC in one `rte_eth_tx_burst` / +`rte_eth_rx_burst` call — the burst size knob, not a packet-size knob. +Larger batches amortize doorbell and API overhead per packet; smaller +batches keep per-call latency lower. Batch only matters when the bench +is not yet at the link ceiling — at saturation, the NIC is the +bottleneck and batch size has near-zero effect. + +The "payload × batch" sweep doesn't map uniformly across stream/protocol +combinations. Each run target has its own sweep: + +| Stream / Protocol | Sweep dim 1 | Sweep dim 2 | Native-shape cell | Matched-size cell | +| ----------------------- | ------------------------------------------------------ | -------------------------------------- | -------------------- | ---------------------- | +| Raw Ethernet / GPUDirect | payload_size in {64, 256, 1024, 4096, 8000} B | batch_size in {256, 1024, 4096, 10240} | (8000, 10240) | (8000, 10240) — same | +| Socket / RoCE | message_size in {4 K, 64 K, 1 M, 8 M} B | batch_size fixed at 1 | (8 M, 1) | (8 K, 1) | +| Socket / UDP | payload_size in {64, 256, 1024, 1472} B (MTU-bound) | batch_size fixed at 1 | (1472, 1) | (1472, 1) — closest to 8 K under MTU cap | +| Socket / TCP | message_size in {1 K, 64 K, 1 M} B | n/a (single stream) | (64 K) | n/a | + +#### "No-drop" threshold + +A run is **drop-free** when reported `drops == 0` over a `--seconds 30` run. +The headline tables report the highest target rate at which a run was still +drop-free under this threshold. The methodology does not use a percentile cap +— either there are drops or there are not. + +#### Drop-curve sweep + +The drop curve sweeps `--target-gbps` while holding the native-shape cell +constant. The token-bucket pacer in the bench TX worker (`raw_bench_common`) +adds a software-paced sleep after each burst; accuracy is ~±5 % at high rates +due to OS sleep granularity and scheduler jitter. Hardware TX pacing (DPDK +`accurate_send`) is unused but would tighten DPDK-only precision; deferred to +a follow-up. + +#### Drop sources per transport + +- **DPDK** — `imissed + ierrors + rx_nombuf` parsed from `DAQIRI_LOG_INFO` + output (the bench's stderr). +- **RoCE** — count of `CQ error` lines in `DAQIRI_LOG_ERROR` output (RDMA + manager filters error completions but logs each one). +- **Socket UDP** — diff of the `drops` column in `/proc/net/udp` over the run. +- **Socket TCP** — `nstat -a` diff of `TcpExtTCPLostRetransmit` / + `TcpRetransSegs` / `TcpInErrs`. TCP has no clean "drops" semantic; this is + the closest proxy. + +#### External captures per run + +Each run records, alongside the bench: + +- `/proc/stat` snapshots before and after the bench process. The wrapper + computes per-core busy% for the master / TX / RX cores (cores 8 / 17 / 18 + on Spark) by delta over the run window. `mpstat` is not used — it is + often missing from minimal containers, and the per-core CPUs we care + about are pinned by the YAML so a targeted delta is more meaningful + than a system-wide average. +- `nvidia-smi dmon -s pucvmet -c ` — GPU SM%, mem-controller%, and + PCIe rxpci / txpci columns (the latter are reported as `-` on the + current Spark driver; SM% and mem% are near zero for plain GPUDirect + because the GPU is a DMA target, not a compute engine). + +Slow-moving state (kernel, OFED, NIC firmware, PCIe link, NUMA, +hugepages, GPU state, DAQIRI commit) is captured once per result set by +`bench_capture_environment.sh`. + +## C++ Results + +### DPDK GPUDirect + +Native shape on Spark is 8 KB payload, batch 10240 — the configuration the +raw GPUDirect path was designed around. All cells below ran for 30 s with +`drops == 0`. + +#### Drop curve at native shape + +Hold (payload=8000, batch=10240) constant; sweep `--target-gbps`. The +token-bucket pacer tracks target within ±0.02 Gbps until the link saturates +near 96 Gbps. Beyond that, target=100 and unpaced both report the +achievable ceiling. TX and RX cores spin in poll-mode regardless of target +rate (visible in the CPU table below). + +| target Gbps | achieved Gbps | RX pps | drops | TX core % | RX core % | +| ----------- | ------------- | --------- | ----- | --------- | --------- | +| 1 | 1.011 | 15,678 | 0 | 92.0 | 92.0 | +| 5 | 5.012 | 77,697 | 0 | 91.8 | 91.8 | +| 10 | 10.001 | 155,032 | 0 | 92.5 | 92.5 | +| 25 | 25.008 | 387,650 | 0 | 91.9 | 91.9 | +| 50 | 50.001 | 775,062 | 0 | 92.8 | 92.8 | +| 75 | 74.999 | 1,162,551 | 0 | 91.7 | 91.7 | +| 100 | 96.370 | 1,493,823 | 0 | 91.6 | 91.6 | +| unpaced | 95.897 | 1,486,498 | 0 | 91.6 | 90.5 | + +#### Payload × target_gbps matrix + +Holds batch=10240 constant; sweeps payload and `--target-gbps` +together. Each cell shows the achieved Gbps and drops over a 30 s +run. Coloring is relative to the **effective target** +`min(target_gbps, 96 Gbps link cap)`; the "unpaced" column uses the +link cap as its effective target: + +
+ green — no drops, achieved ≥ 95% of effective target + yellow — no drops, achieved ≥ 70% of effective target + red — drops, or achieved < 70% of effective target +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
payloadtarget Gbps
1510255075100unpaced
8000 B1.0 Gbps0 drops5.0 Gbps0 drops10.0 Gbps0 drops25.0 Gbps0 drops50.0 Gbps0 drops75.0 Gbps0 drops95.9 Gbps0 drops96.4 Gbps0 drops
4096 B1.0 Gbps0 drops5.0 Gbps0 drops10.0 Gbps0 drops25.0 Gbps0 drops50.0 Gbps0 drops75.0 Gbps0 drops100.0 Gbps0 drops102.7 Gbps0 drops
1024 B1.0 Gbps0 drops5.0 Gbps0 drops10.0 Gbps0 drops24.9 Gbps0 drops49.8 Gbps0 drops74.2 Gbps0 drops94.0 Gbps0 drops68.2 Gbps0 drops
256 B1.0 Gbps0 drops5.0 Gbps0 drops10.0 Gbps0 drops24.7 Gbps0 drops21.1 Gbps0 drops21.2 Gbps0 drops21.2 Gbps0 drops21.6 Gbps0 drops
64 B1.0 Gbps0 drops5.0 Gbps0 drops9.9 Gbps0 drops8.9 Gbps0 drops8.9 Gbps0 drops13.7 Gbps0 drops8.9 Gbps0 drops8.7 Gbps0 drops
+ +**Reading the matrix.** The token-bucket pacer tracks target within +sub-Gbps at payloads ≥ 1 KB right up to the link cap. The PPS +ceiling for each payload (~21 Gbps at 256 B, ~9 Gbps at 64 B) shows +up as a horizontal saturation band: once `target_gbps` crosses that +ceiling, increasing it further produces no additional throughput. +The 64 B / target=75 cell (13.7 Gbps) is a pacer transient when the +requested rate is well above achievable — the next cell at +target=100 falls back to the PPS ceiling. The 1024 B / unpaced cell +(68.2 Gbps, yellow) sits below its own paced cells; 1024 B is the +size where the master loop transitions from idle to fully busy +(`cpu_master_pct` jumps from ~3% to ~93% in the corresponding +unpaced run), and per-cell numbers in this regime carry roughly +±5 Gbps of run-to-run variance. + +#### Payload × batch matrix + +Each cell shows the achieved Gbps and drops over a 30 s unpaced run. +Coloring is relative to the global max across the matrix (here +**104.989 Gbps** at payload 4096 B, batch 4096): + +
+ green — no drops, Gbps ≥ 90% of max + yellow — no drops, Gbps ≥ 70% of max + red — drops, or Gbps < 70% of max +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
payloadbatch (packets per burst)
2561024409610240
8000 B96.4 Gbps0 drops96.4 Gbps0 drops96.4 Gbps0 drops95.9 Gbps0 drops
4096 B104.9 Gbps0 drops104.9 Gbps0 drops105.0 Gbps0 drops103.0 Gbps0 drops
1024 B64.9 Gbps0 drops86.2 Gbps0 drops71.0 Gbps0 drops72.7 Gbps0 drops
256 B24.5 Gbps0 drops24.8 Gbps0 drops29.3 Gbps0 drops21.1 Gbps0 drops
64 B10.1 Gbps0 drops10.3 Gbps0 drops9.8 Gbps0 drops10.3 Gbps0 drops
+ +**Reading the matrix.** At payload ≥ 4 KB the link saturates (~96–105 +Gbps) and batch size barely moves the number, so every cell is green. +The 1 KB row is the transition: pps and Gbps both matter, batch size +starts to influence which side dominates, and most cells fall under 70% +of the global max. At ≤ 256 B the bottleneck is packets-per-second +(~10 M pps ceiling at 64 B), so effective Gbps stays well below the +link ceiling regardless of batch. Run-to-run variance on the unpaced +cells is ~0.5 Gbps; row-internal Gbps differences smaller than that +should be treated as noise. + +#### CPU and GPU utilization (headline cell, payload 8000 B / batch 10240, unpaced) + +| Resource | Value | Note | +| -------------------- | ----- | ------------------------------------------ | +| Master core (CPU 8) | 3.3% | Mostly idle; orchestration only | +| TX core (CPU 17) | 91.4% | Poll-mode spin; rate-independent | +| RX core (CPU 18) | 91.4% | Poll-mode spin; rate-independent | +| GPU SM % | 0.0% | GPU is a DMA target, no compute | +| GPU mem-ctrl % | 0.0% | Payload writes traverse PCIe, not the GPU memory controller | + +The TX/RX cores stay at ~92% across every drop-curve step from 1 Gbps to +line rate — characteristic of DPDK's poll-mode driver, which spins waiting +for descriptors regardless of offered load. The master core handles +configuration only and idles below 5 % at the headline shape; at smaller +payload sizes (1 KB and below) it occasionally hits 90%+ as more bursts +flow through the orchestration path. That asymmetry is data, not a bug, +and is captured in the per-cell artifacts under `bench-results/`. + +### RoCE + +Native shape on Spark is an 8 MB SEND, batch 1, single QP — the +configuration `daqiri_bench_rdma_tx_rx_spark.yaml` is built around. The +bench runs `--mode both` (one client, one server) in a single process; both +endpoints are on the host, but RoCE bypasses kernel routing on the data +path so traffic actually crosses the QSFP loopback cable. The aggregate +rate at native shape was **~175 Gbps** bidirectional (server-TX 89.2 + +client-TX 83.4 single-direction). + +#### Payload sweep at native batch + +All cells: batch 1, unpaced, 30 s, `drops == 0`. These numbers are carried +forward from the original Spark data set and should be refreshed in a follow-up +hardware run after the current RDMA bench changes. + +| message_size | Pkts/s | Gbps | Notes | +| ------------ | -----: | -----: | ------------------------------------ | +| 8 MB | 1,303 | **83.4** | Native shape; saturates single QP. | +| 1 MB | 9,859 | 82.7 | Same wire ceiling, more pps. | +| 64 KB | 255 | 0.134 | Pending current-main refresh.[^1] | +| 8 KB | 88 | 0.006 | Matched cell; pending refresh.[^1] | +| 4 KB | 39 | 0.001 | Pending current-main refresh.[^1] | + +The 1 MB → 64 KB step is a 39× pps drop for a 16× smaller payload — well +out of line with what handshake-bound RC on a 200G link should do. The data +points are preserved here as historical Spark measurements, but the bench has +changed since they were captured. Treat the small-payload diagnosis as pending +refresh until the RoCE sweep is rerun on hardware with the current +`examples/rdma_bench.cpp` and RDMA manager. + +#### CPU and GPU utilization (headline cell, message 8 MB, batch 1, unpaced) + +| Channel | Busy% | Note | +| -------------------- | ----: | ------------------------------------------ | +| Master core (CPU 8) | 6.6% | Orchestration only | +| Client TX (CPU 17) | 90.3% | Post-and-poll spin, rate-independent | +| Server RX (CPU 18) | 2.6% | HCA writes directly to memory; CPU is idle for RDMA writes | +| GPU SM % | 0.0% | GPU is a DMA target, not a compute engine | +| GPU mem-ctrl % | 0.0% | Payload writes go through PCIe, not the GPU mem-controller | + +The RX-side core staying near idle is the expected RoCE RC signature — +the HCA places incoming data directly into registered memory without +CPU involvement. Note also that the `daqiri_bench_rdma_tx_rx_spark.yaml` +config nominally pins separate TX (16/17) and RX (18/19) cores per side, +but the current RDMA manager spawns only one thread per side and uses +the configured TX core; the RX cores are unused. + +### Socket + +**Data-fill pending.** Both UDP and TCP rows are blocked on a re-run of +`./examples/run_spark_bench.sh socket-{udp,tcp} sweep` with the current +wrapper parser — see footnotes [^2] / [^3] and the +[TODO / Known Limitations](#todo-not-yet-implemented-known-limitations) +section. The bench itself runs cleanly on Spark; the earlier "UDP +peer-learning deadlock" and "TCP glibc heap-corruption" diagnoses were +both closed as red herrings. + +### Workload variants (FFT, GEMM) + +The planned post-process layer ([issue #15](https://github.com/NVIDIA/daqiri/issues/15)) +adds a `--post-process {fft,gemm}` flag to the bench, runs `cuFFT` / +`cuBLAS` on the received GPU-resident payload, and reports the resulting +throughput delta and GPU utilization. + +**Sizes (representative):** + +- FFT: 1D complex-to-complex, length 1024. +- GEMM: fp32 square, N = 44 (largest tile that fits in an 8 KB payload). + +#### DPDK GPUDirect — FFT/GEMM + +_TBD (follow-up)._ + +#### RoCE — FFT/GEMM + +_TBD (follow-up)._ Note the unit-of-work mismatch when comparing across +transports: Socket / RoCE applies the post-process kernel once per ~8 MB SEND; +raw GPUDirect applies it once per packet. The throughput numbers are +comparable; "operations per burst" is not. + +## Python Results + +The planned Python benches ([issue #16](https://github.com/NVIDIA/daqiri/issues/16)) +mirror the C++ benches' CLI and stdout format, using the existing pybind11 +bindings. + +### Loopback + +_TBD (follow-up)._ + +### FFT / GEMM via pybind of the C++ post-process layer + +_TBD (follow-up)._ + +## Reproduce these results + +### Container + +All commands below assume execution inside the project container, as +required by [`AGENTS.md`](https://github.com/nvidia/daqiri/blob/main/AGENTS.md). +On the host, launch the container in privileged mode with all GPUs, +hugepage mounts, and `/sys` passed through, and bind-mount the repo at +`/workspace`: + +```bash +# RX-side NIC; auto-injects ETH_DST_ADDR for the DPDK bench wrappers. +RX_IFACE="${RX_IFACE:-enP2p1s0f0np0}" +sudo docker run --rm -it \ + --net host --ipc=host \ + --runtime=nvidia --gpus all \ + --privileged \ + --ulimit memlock=-1 --ulimit stack=67108864 \ + -v "$(pwd):/workspace" \ + -v /dev/hugepages:/dev/hugepages \ + -v /mnt/huge:/mnt/huge \ + -v /sys:/sys \ + -w /workspace \ + -e ETH_DST_ADDR="$(cat /sys/class/net/$RX_IFACE/address)" \ + -e RX_IFACE="$RX_IFACE" \ + daqiri:local \ + bash +``` + +### Build + +Inside the container: + +```bash +cmake -S . -B build -DBUILD_SHARED_LIBS=ON -DDAQIRI_BUILD_PYTHON=ON \ + -DDAQIRI_MGR="dpdk socket rdma" +cmake --build build -j +``` + +### One-shot driver + +The whole DPDK matrix (sweep + drop-curve) is driven by a single script +which handles pre-flight checks (hugepage availability, RX iface MAC, +link state), orphan-hugepage cleanup between cells, and a final summary of +per-target result directories: + +```bash +./scripts/spark_data_fill.sh dpdk +``` + +The script defaults to `dpdk socket-udp socket-tcp` if invoked with no +arguments. RDMA is currently rejected by the aggregate driver's pre-flight; +use `./examples/run_spark_bench.sh rdma ...` directly when refreshing RoCE +data. + +### Per-target wrapper invocations + +For ad-hoc runs of a single cell or a single mode: + +```bash +export DAQIRI_BUILD_DIR="$PWD/build" +export ETH_DST_ADDR="$(cat /sys/class/net//address)" # DPDK only + +./examples/run_spark_bench.sh dpdk smoke # native-shape headline cell +./examples/run_spark_bench.sh dpdk sweep # full payload × batch matrix +./examples/run_spark_bench.sh dpdk drop-curve # sweep --target-gbps +``` + +Each invocation emits `bench-results/-dpdk-/` containing +one subdirectory per cell (stdout / stderr / config / dmon / cpu_stat +captures), an `environment.txt` snapshot, and a `runs.csv` aggregating +the cell-level metrics. + +### Environment-only capture + +Useful for filing a bug report or comparing two Spark units without +running the bench: + +```bash +./examples/bench_capture_environment.sh /tmp/spark-env +``` + +### Tuning prerequisites + +System tuning is required before the numbers in this report are +reproducible. See +[`docs/tutorials/system_configuration.md`](tutorials/system_configuration.md) +for the DGX Spark tab — isolated cores, hugepages, governor, IRQ affinity. + +For RoCE specifically, the host needs source-based policy routing and +static ARP entries pinned before the bench can use the QSFP loopback +cable (single-NIC cross-cabled topology with both endpoint IPs local; +without pinning, `rdma_resolve_route`'s neighbour lookup is ambiguous): + +```bash +sudo ./scripts/setup_spark_rdma_loopback.sh +``` + +The script is idempotent and only modifies host state for the two +data-plane ports `enp1s0f0np0` (1.1.1.1) and `enP2p1s0f0np0` (2.2.2.2). + +## TODO: Not Yet Implemented / Known Limitations + +- **HDS (Header–Data Split) is deferred.** The generic HDS configuration uses + `kind: device` for GPU memory regions. Spark / GB10 cannot use device memory + for GPUDirect — `nvidia_peermem` does not load and DMA-BUF is unreachable — + so DAQIRI uses `host_pinned` instead. Under `host_pinned`, the HDS layout no + longer changes the memory path; it only changes the segment partition, + which makes "HDS vs. plain GPUDirect" a non-distinction on this platform. + HDS is characterized when this report extends to IGX and x86-server + platforms where device memory works. See + [issue #15](https://github.com/NVIDIA/daqiri/issues/15) for tracking. +- **RoCE small-payload pps needs a refresh on current main.** The carried-forward + Spark data saturates single-stream at ~83 Gbps for 8 MB and 1 MB messages, + but 64 KB and below collapse out of proportion to payload size (4 KB at + 0.001 Gbps). The bench and RDMA manager changed after those numbers were + captured, so the matched 8 KB RoCE cell is footnoted [^1] until a follow-up + hardware run refreshes the sweep. +- **Socket UDP / TCP results are pending re-run** with the wrapper-side + parse fix now on main. The previous sweep CSVs are zero-filled + because `run_spark_bench.sh`'s fallback only knew RDMA's + `send_completions`/`send_bytes` keys; `socket_bench` actually emits + `sent_packets`/`sent_bytes`. With that fixed, both socket transports are + expected to produce real numbers — neither of the earlier "deadlock" / "heap + corruption" diagnoses held up under investigation (both red herrings, + closed not-a-bug). See footnotes [^2] / [^3]. +- **p99/p999 latency is not in v1.** The bench output captures throughput, + drops, and resource utilization. Per-burst RX timestamping and percentile + aggregation are deferred to a follow-up issue. diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css index afc366a..ce8f976 100644 --- a/docs/stylesheets/extra.css +++ b/docs/stylesheets/extra.css @@ -368,3 +368,64 @@ body[data-active-tab="igx-orin"] .md-nav--secondary li:has(> a[data-tab="dgx-spa body[data-active-tab="dgx-spark"] .md-nav--secondary li:has(> a[data-tab="igx-orin"]) { display: none; } + +/* ── Performance-report heatmap cells ───────────────────────────────── */ +/* Used by the payload×batch and payload×target_gbps matrices in + docs/performance-*.md. Threshold logic (vs. matrix-global max): + green = no drops AND Gbps ≥ 90% of max + yellow = no drops AND Gbps ≥ 70% of max + red = any drops OR Gbps < 70% of max */ +.md-typeset table.perf-matrix { + width: 100%; + table-layout: auto; + border-collapse: separate; + border-spacing: 5px; + font-size: 0.64rem; +} +.md-typeset table.perf-matrix th, +.md-typeset table.perf-matrix td { + text-align: center; + vertical-align: middle; + font-variant-numeric: tabular-nums; + padding: 0.55em 0.5em; + border-radius: 4px; + white-space: nowrap; +} +.md-typeset table.perf-matrix td small { + display: block; + opacity: 0.75; + font-size: 0.85em; + margin-top: 0.25em; +} +.md-typeset table.perf-matrix td.cell-green { + background-color: rgba(118, 185, 0, 0.28); + color: inherit; +} +.md-typeset table.perf-matrix td.cell-yellow { + background-color: rgba(255, 196, 0, 0.32); + color: inherit; +} +.md-typeset table.perf-matrix td.cell-red { + background-color: rgba(220, 60, 60, 0.32); + color: inherit; +} +.md-typeset table.perf-matrix th { + background-color: rgba(255, 255, 255, 0.05); + font-weight: 600; +} +/* Compact legend chips that pair with the matrix. */ +.md-typeset .perf-legend { + display: flex; + gap: 0.75em; + margin: 0.5em 0 1em 0; + font-size: 0.85em; + flex-wrap: wrap; +} +.md-typeset .perf-legend span { + padding: 0.1em 0.55em; + border-radius: 0.25em; + white-space: nowrap; +} +.md-typeset .perf-legend .cell-green { background-color: rgba(118, 185, 0, 0.28); } +.md-typeset .perf-legend .cell-yellow { background-color: rgba(255, 196, 0, 0.32); } +.md-typeset .perf-legend .cell-red { background-color: rgba(220, 60, 60, 0.32); } diff --git a/mkdocs.yml b/mkdocs.yml index 65d9e7d..8a10698 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -50,6 +50,8 @@ nav: - Getting Started: getting-started.md - Concepts: concepts.md - Benchmarks: tutorials/benchmarking_examples.md + - Performance: + - DGX Spark: performance-dgx-spark.md - API Reference: - API Guide: api-reference/index.md - Configuration YAML Reference: api-reference/configuration.md @@ -64,6 +66,7 @@ markdown_extensions: - admonition - attr_list - def_list + - footnotes - md_in_html - tables - toc: