diff --git a/AGENTS.md b/AGENTS.md
index effe7ff..64160ba 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -96,6 +96,7 @@ The web docs live in `docs/` and are built with [MkDocs Material](https://squidf
 - `docs/concepts.md` — terminology glossary (stream types and protocols, GPUDirect, packet/burst/segment, flow/queue, memory region, zero-copy ownership, RX reorder). Meant to be opened in parallel with the rest of the docs.
 - `docs/api-reference/index.md` — API guide (6-step application lifecycle, configuration-first model)
 - `docs/api-reference/configuration.md`, `docs/api-reference/cpp.md`, `docs/api-reference/python.md` — YAML schema, C++ API, and Python bindings docs
+- `docs/performance-dgx-spark.md` — per-platform performance report for DGX Spark stream/protocol combinations
 - `docs/tutorials/` — tutorial walkthroughs (system config, config-file walkthrough)
 - `docs/tutorials/benchmarking_examples.md` — surfaced as a top-level "Benchmarks" nav entry in `mkdocs.yml` and `docs/index.html`; file kept at its original path for inbound-link stability
 - `docs/stylesheets/extra.css` — custom theme overrides
@@ -105,9 +106,10 @@ The web docs live in `docs/` and are built with [MkDocs Material](https://squidf
 **Keeping docs in sync with code:** before committing changes, scan for the recurring drift hotspots:
 - **Stream-type list** (`src/managers/*/`) — README Backends table, `docs/getting-started.md`, `docs/concepts.md` (Stream Types section + Support and testing admonition), `docs/api-reference/configuration.md`
 - **CMake options / `DAQIRI_MGR` default** (`src/CMakeLists.txt:137`) — README Quick Start, `docs/getting-started.md`, this file's Build & run section
-- **Benchmark binary or YAML names** (`examples/`) — the benchmark table above, `docs/tutorials/benchmarking_examples.md`, and the "Choosing an example config" decision tree in `docs/tutorials/configuration-walkthrough.md` (every YAML must have a leaf; CI's `scripts/check_doc_refs.py` enforces coverage)
+- **Benchmark binary or YAML names** (`examples/`) — the benchmark table above, `docs/tutorials/benchmarking_examples.md`, the "Choosing an example config" decision tree in `docs/tutorials/configuration-walkthrough.md` (every YAML must have a leaf; CI's `scripts/check_doc_refs.py` enforces coverage), and per-platform performance docs (`docs/performance-*.md`)
 - **Public API include** (`#include <daqiri/daqiri.h>`; source files under `include/daqiri/`) — `docs/api-reference/index.md`, `docs/api-reference/cpp.md`, `docs/api-reference/python.md`; if the change adds or renames a user-facing concept, also `docs/concepts.md`
 - **Python bindings** (`python/daqiri_common_pybind.cpp`) — `docs/api-reference/python.md` (function reference tables, enums/classes tables, GIL Behavior section)
+- **Bench CLI flags or output format** (`examples/raw_bench_common.{h,cpp}`, `*_bench.cpp`) — per-platform performance docs' Methodology section, `examples/run_spark_bench.sh` parsing logic
 - **Doc reorganization** (any rename in `docs/`) — `docs/index.html` landing page, `mkdocs.yml` nav, README Documentation table
 
 The full mapping with rationale lives in the docs-sync agent rule. Internal-link, anchor, and nav drift is enforced by CI (`.github/workflows/docs.yml`); content drift (stale binary names, defaults) is still a manual check at commit time.
diff --git a/README.md b/README.md
index 44867bd..36db3d0 100644
--- a/README.md
+++ b/README.md
@@ -91,6 +91,7 @@ Reference material for the DAQIRI codebase:
 - [Configuration YAML Reference](https://nvidia.github.io/daqiri/api-reference/configuration/) — Full YAML config reference for all backends
 - [C++ API Usage](https://nvidia.github.io/daqiri/api-reference/cpp/) — C++ RX/TX workflows, buffer lifecycle, file writing, utilities, and status codes
 - [Python API Usage](https://nvidia.github.io/daqiri/api-reference/python/) — Python bindings, workflow examples, enums, config classes, and helper functions
+- [Performance: DGX Spark](https://nvidia.github.io/daqiri/performance-dgx-spark/) — Per-platform throughput, drop, and utilization numbers for stream/protocol combinations on DGX Spark
 - [Contributing](CONTRIBUTING.md) — Contribution guidelines, coding standards, DCO sign-off
 
 ## Tutorials
diff --git a/docs/index.html b/docs/index.html
index f5b49e8..9e2f53a 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -684,6 +684,12 @@ <h2 class="section-title">News</h2>
         </div>
       </div>
       <div class="pub-grid">
+        <div class="pub-card">
+          <div class="pub-venue"><span class="pub-badge">Performance</span><span class="pub-year">2026</span></div>
+          <div class="pub-title">DAQIRI Performance on DGX Spark</div>
+          <div class="pub-authors">NVIDIA — Throughput, drops, and resource utilization for raw GPUDirect, Socket / RoCE, and Socket / UDP/TCP measured on a DGX Spark (GB10) workstation. First in a series of per-platform performance reports.</div>
+          <div class="pub-links"><a href="performance-dgx-spark/" class="pub-link">Read report →</a></div>
+        </div>
         <div class="pub-card">
           <div class="pub-venue"><span class="pub-badge">GitHub</span><span class="pub-year">2025</span></div>
           <div class="pub-title">DAQIRI Open-Sourced on GitHub</div>
diff --git a/docs/performance-dgx-spark.md b/docs/performance-dgx-spark.md
new file mode 100644
index 0000000..d7736fb
--- /dev/null
+++ b/docs/performance-dgx-spark.md
@@ -0,0 +1,612 @@
+# Performance: DGX Spark
+
+This page reports DAQIRI throughput, drop, and resource-utilization numbers
+measured on a DGX Spark (GB10) workstation. It is the first in a series of
+per-platform performance reports; the same section layout will be reused for
+IGX, x86-server, and other targets.
+
+The numbers below are reproducible — every cell is generated by
+[`examples/run_spark_bench.sh`](https://github.com/nvidia/daqiri/blob/main/examples/run_spark_bench.sh)
+against the YAML configs in `examples/`, with the system state captured by
+[`examples/bench_capture_environment.sh`](https://github.com/nvidia/daqiri/blob/main/examples/bench_capture_environment.sh)
+alongside each result set.
+
+**Contents**
+
+- [Summary](#summary)
+- [Introduction](#introduction) — system under test, methodology
+- [C++ Results](#c-results) — raw GPUDirect, Socket / RoCE, Socket / UDP and TCP, workload variants
+- [Python Results](#python-results)
+- [Reproduce these results](#reproduce-these-results)
+- [TODO: Not Yet Implemented / Known Limitations](#todo-not-yet-implemented-known-limitations)
+
+## Summary
+
+Two headline tables. **Native-shape peak** reports each stream/protocol
+combination at its best-case operation size. **Matched 8 KB** drives raw
+GPUDirect, Socket / RoCE, and Socket / UDP at a common ~8 KB unit of work so
+the comparison is apples-to-apples; TCP is omitted from that table since it
+has no operation boundary.
+
+### Native-shape peak — max no-drop throughput (Gbps)
+
+| Stream / Protocol              | C++ loopback   | C++ + FFT        | C++ + GEMM       | Python loopback   | Python + FFT     | Python + GEMM    |
+| ------------------------------ | -------------- | ---------------- | ---------------- | ----------------- | ---------------- | ---------------- |
+| Raw Ethernet / GPUDirect (8 KB) | **96.4**       | _TBD (follow-up)_ | _TBD (follow-up)_ | _TBD (follow-up)_ | _TBD (follow-up)_ | _TBD (follow-up)_ |
+| Socket / RoCE (8 MB SEND)      | **83.4**[^1]   | _TBD (follow-up)_ | _TBD (follow-up)_ | _TBD (follow-up)_ | _TBD (follow-up)_ | _TBD (follow-up)_ |
+| Socket / UDP (MTU)             | _deferred_[^2] | n/a              | n/a              | _TBD (follow-up)_ | n/a              | n/a              |
+| Socket / TCP (stream)          | _deferred_[^3] | n/a              | n/a              | _TBD (follow-up)_ | n/a              | n/a              |
+
+### Matched 8 KB operation — cross-transport Gbps
+
+| Stream / Protocol               | C++ loopback   | C++ + FFT        | C++ + GEMM       | Python loopback   | Python + FFT     | Python + GEMM    |
+| ------------------------------- | -------------- | ---------------- | ---------------- | ----------------- | ---------------- | ---------------- |
+| Raw Ethernet / GPUDirect        | **96.4**       | _TBD (follow-up)_ | _TBD (follow-up)_ | _TBD (follow-up)_ | _TBD (follow-up)_ | _TBD (follow-up)_ |
+| Socket / RoCE                   | 0.006[^1]      | _TBD (follow-up)_ | _TBD (follow-up)_ | _TBD (follow-up)_ | _TBD (follow-up)_ | _TBD (follow-up)_ |
+| Socket / UDP (1472 B, MTU)      | _deferred_[^2] | n/a              | n/a              | _TBD (follow-up)_ | n/a              | n/a              |
+
+[^1]: RoCE single-host loopback on Spark requires host network prereqs before the bench can use the cable — see [Tuning prerequisites](#tuning-prerequisites). The native-shape 8 MB cell saturates at ~83 Gbps single-stream (single QP, batch 1); the matched 8 KB cell comes from the original Spark data set and should be treated as pending refresh rather than a final platform limit after the current RDMA bench changes. The aggregate bidirectional rate at native shape on the same run was ~175 Gbps (server TX 89.2 + client TX 83.4).
+[^2]: Socket UDP rows are pending re-run of `./examples/run_spark_bench.sh socket-udp sweep` with the current wrapper-side fix that maps `sent_packets`/`sent_bytes` from `socket_bench` (the previous wrapper only knew the RDMA `send_completions`/`send_bytes` keys and zero-filled the CSV for socket runs). The earlier "peer-learning deadlock" diagnosis was a red herring — bench behaves correctly; the CSV was bogus because of the wrapper bug. Re-run tracked outside this PR.
+[^3]: Socket TCP rows are pending re-run for the same reason as Socket UDP (see [^2]). The earlier "glibc heap-corruption" diagnosis was also a red herring; the bench exits cleanly under ASan across 20+ runs.
+
+!!! note "Why two tables"
+    A single Gbps number isn't enough to compare stream/protocol combinations
+    fairly. Raw GPUDirect at peak with 8 KB packets is doing very different
+    work than Socket / RoCE at peak with 8 MB messages, even when both report
+    the same Gbps. The matched 8 KB table makes the comparison honest; the
+    native-shape table shows each combination's design ceiling.
+
+## Introduction
+
+### System under test
+
+The reproducibility appendix has the full capture. Key fields:
+
+| Field             | Value                                |
+| ----------------- | ------------------------------------ |
+| Platform          | NVIDIA DGX Spark (GB10)              |
+| GPU               | Blackwell (compute capability 12.1)  |
+| NIC               | ConnectX-7 (two ports, tied / QSFP loopback) |
+| Topology          | `0000:01:00.0` (mlx5_0) → `0002:01:00.0` (mlx5_2), physical loopback cable |
+| OS                | _captured in `environment.txt`_      |
+| Kernel            | _captured in `environment.txt`_      |
+| CUDA driver       | _captured in `environment.txt`_      |
+| DPDK              | patched per `dpdk_patches/` (container build) |
+| DAQIRI commit     | _captured in `environment.txt`_      |
+
+### Methodology
+
+#### Bench commands
+
+Each stream/protocol combination has a dedicated bench executable in
+`examples/`. The raw GPUDirect numbers in this report come from the first
+command; the Socket / RoCE and Socket / UDP/TCP commands are listed here for
+documentation and will be the basis for the future fill of those rows.
+
+```bash
+# DPDK GPUDirect — physical loopback (used in this report)
+./build/examples/daqiri_bench_raw_gpudirect \
+    examples/daqiri_bench_raw_tx_rx_spark.yaml \
+    --seconds 30 [--target-gbps G]
+
+# RoCE — same NIC, two ports cross-cabled. Host prereq: scripts/setup_spark_rdma_loopback.sh
+./build/examples/daqiri_bench_rdma \
+    examples/daqiri_bench_rdma_tx_rx_spark.yaml \
+    --seconds 30 --mode both [--target-gbps G]
+
+# Socket UDP / TCP — localhost (deferred; see TODO / Known Limitations)
+./build/examples/daqiri_bench_socket \
+    examples/daqiri_bench_socket_udp_tx_rx.yaml \
+    --seconds 30 --mode both [--target-gbps G]
+```
+
+The DPDK YAML expects `eth_dst_addr` filled from the RX iface MAC:
+
+```bash
+ETH_DST_ADDR="$(cat /sys/class/net/<rx-iface>/address)"
+```
+
+#### Per stream/protocol sweep dimensions
+
+**Payload** is the user-data bytes in one packet (DPDK / Socket UDP),
+one RDMA message, or one TCP send. **Batch** is how many packets DAQIRI
+hands to (or pulls from) the NIC in one `rte_eth_tx_burst` /
+`rte_eth_rx_burst` call — the burst size knob, not a packet-size knob.
+Larger batches amortize doorbell and API overhead per packet; smaller
+batches keep per-call latency lower. Batch only matters when the bench
+is not yet at the link ceiling — at saturation, the NIC is the
+bottleneck and batch size has near-zero effect.
+
+The "payload × batch" sweep doesn't map uniformly across stream/protocol
+combinations. Each run target has its own sweep:
+
+| Stream / Protocol       | Sweep dim 1                                            | Sweep dim 2                            | Native-shape cell    | Matched-size cell      |
+| ----------------------- | ------------------------------------------------------ | -------------------------------------- | -------------------- | ---------------------- |
+| Raw Ethernet / GPUDirect | payload_size in {64, 256, 1024, 4096, 8000} B          | batch_size in {256, 1024, 4096, 10240} | (8000, 10240)        | (8000, 10240) — same   |
+| Socket / RoCE           | message_size in {4 K, 64 K, 1 M, 8 M} B                | batch_size fixed at 1                  | (8 M, 1)             | (8 K, 1)               |
+| Socket / UDP            | payload_size in {64, 256, 1024, 1472} B (MTU-bound)    | batch_size fixed at 1                  | (1472, 1)            | (1472, 1) — closest to 8 K under MTU cap |
+| Socket / TCP            | message_size in {1 K, 64 K, 1 M} B                     | n/a (single stream)                    | (64 K)               | n/a                    |
+
+#### "No-drop" threshold
+
+A run is **drop-free** when reported `drops == 0` over a `--seconds 30` run.
+The headline tables report the highest target rate at which a run was still
+drop-free under this threshold. The methodology does not use a percentile cap
+— either there are drops or there are not.
+
+#### Drop-curve sweep
+
+The drop curve sweeps `--target-gbps` while holding the native-shape cell
+constant. The token-bucket pacer in the bench TX worker (`raw_bench_common`)
+adds a software-paced sleep after each burst; accuracy is ~±5 % at high rates
+due to OS sleep granularity and scheduler jitter. Hardware TX pacing (DPDK
+`accurate_send`) is unused but would tighten DPDK-only precision; deferred to
+a follow-up.
+
+#### Drop sources per transport
+
+- **DPDK** — `imissed + ierrors + rx_nombuf` parsed from `DAQIRI_LOG_INFO`
+  output (the bench's stderr).
+- **RoCE** — count of `CQ error` lines in `DAQIRI_LOG_ERROR` output (RDMA
+  manager filters error completions but logs each one).
+- **Socket UDP** — diff of the `drops` column in `/proc/net/udp` over the run.
+- **Socket TCP** — `nstat -a` diff of `TcpExtTCPLostRetransmit` /
+  `TcpRetransSegs` / `TcpInErrs`. TCP has no clean "drops" semantic; this is
+  the closest proxy.
+
+#### External captures per run
+
+Each run records, alongside the bench:
+
+- `/proc/stat` snapshots before and after the bench process. The wrapper
+  computes per-core busy% for the master / TX / RX cores (cores 8 / 17 / 18
+  on Spark) by delta over the run window. `mpstat` is not used — it is
+  often missing from minimal containers, and the per-core CPUs we care
+  about are pinned by the YAML so a targeted delta is more meaningful
+  than a system-wide average.
+- `nvidia-smi dmon -s pucvmet -c <N>` — GPU SM%, mem-controller%, and
+  PCIe rxpci / txpci columns (the latter are reported as `-` on the
+  current Spark driver; SM% and mem% are near zero for plain GPUDirect
+  because the GPU is a DMA target, not a compute engine).
+
+Slow-moving state (kernel, OFED, NIC firmware, PCIe link, NUMA,
+hugepages, GPU state, DAQIRI commit) is captured once per result set by
+`bench_capture_environment.sh`.
+
+## C++ Results
+
+### DPDK GPUDirect
+
+Native shape on Spark is 8 KB payload, batch 10240 — the configuration the
+raw GPUDirect path was designed around. All cells below ran for 30 s with
+`drops == 0`.
+
+#### Drop curve at native shape
+
+Hold (payload=8000, batch=10240) constant; sweep `--target-gbps`. The
+token-bucket pacer tracks target within ±0.02 Gbps until the link saturates
+near 96 Gbps. Beyond that, target=100 and unpaced both report the
+achievable ceiling. TX and RX cores spin in poll-mode regardless of target
+rate (visible in the CPU table below).
+
+| target Gbps | achieved Gbps | RX pps    | drops | TX core % | RX core % |
+| ----------- | ------------- | --------- | ----- | --------- | --------- |
+| 1           | 1.011         |    15,678 | 0     | 92.0      | 92.0      |
+| 5           | 5.012         |    77,697 | 0     | 91.8      | 91.8      |
+| 10          | 10.001        |   155,032 | 0     | 92.5      | 92.5      |
+| 25          | 25.008        |   387,650 | 0     | 91.9      | 91.9      |
+| 50          | 50.001        |   775,062 | 0     | 92.8      | 92.8      |
+| 75          | 74.999        | 1,162,551 | 0     | 91.7      | 91.7      |
+| 100         | 96.370        | 1,493,823 | 0     | 91.6      | 91.6      |
+| unpaced     | 95.897        | 1,486,498 | 0     | 91.6      | 90.5      |
+
+#### Payload × target_gbps matrix
+
+Holds batch=10240 constant; sweeps payload and `--target-gbps`
+together. Each cell shows the achieved Gbps and drops over a 30 s
+run. Coloring is relative to the **effective target**
+`min(target_gbps, 96 Gbps link cap)`; the "unpaced" column uses the
+link cap as its effective target:
+
+<div class="perf-legend" markdown="0">
+  <span class="cell-green">green — no drops, achieved ≥ 95% of effective target</span>
+  <span class="cell-yellow">yellow — no drops, achieved ≥ 70% of effective target</span>
+  <span class="cell-red">red — drops, or achieved &lt; 70% of effective target</span>
+</div>
+
+<table class="perf-matrix" markdown="0">
+  <thead>
+    <tr>
+      <th rowspan="2">payload</th>
+      <th colspan="8">target Gbps</th>
+    </tr>
+    <tr>
+      <th>1</th><th>5</th><th>10</th><th>25</th><th>50</th><th>75</th><th>100</th><th>unpaced</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <th>8000 B</th>
+      <td class="cell-green">1.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">5.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">10.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">25.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">50.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">75.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">95.9 Gbps<small>0 drops</small></td>
+      <td class="cell-green">96.4 Gbps<small>0 drops</small></td>
+    </tr>
+    <tr>
+      <th>4096 B</th>
+      <td class="cell-green">1.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">5.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">10.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">25.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">50.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">75.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">100.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">102.7 Gbps<small>0 drops</small></td>
+    </tr>
+    <tr>
+      <th>1024 B</th>
+      <td class="cell-green">1.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">5.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">10.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">24.9 Gbps<small>0 drops</small></td>
+      <td class="cell-green">49.8 Gbps<small>0 drops</small></td>
+      <td class="cell-green">74.2 Gbps<small>0 drops</small></td>
+      <td class="cell-green">94.0 Gbps<small>0 drops</small></td>
+      <td class="cell-yellow">68.2 Gbps<small>0 drops</small></td>
+    </tr>
+    <tr>
+      <th>256 B</th>
+      <td class="cell-green">1.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">5.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">10.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">24.7 Gbps<small>0 drops</small></td>
+      <td class="cell-red">21.1 Gbps<small>0 drops</small></td>
+      <td class="cell-red">21.2 Gbps<small>0 drops</small></td>
+      <td class="cell-red">21.2 Gbps<small>0 drops</small></td>
+      <td class="cell-red">21.6 Gbps<small>0 drops</small></td>
+    </tr>
+    <tr>
+      <th>64 B</th>
+      <td class="cell-green">1.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">5.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">9.9 Gbps<small>0 drops</small></td>
+      <td class="cell-red">8.9 Gbps<small>0 drops</small></td>
+      <td class="cell-red">8.9 Gbps<small>0 drops</small></td>
+      <td class="cell-red">13.7 Gbps<small>0 drops</small></td>
+      <td class="cell-red">8.9 Gbps<small>0 drops</small></td>
+      <td class="cell-red">8.7 Gbps<small>0 drops</small></td>
+    </tr>
+  </tbody>
+</table>
+
+**Reading the matrix.** The token-bucket pacer tracks target within
+sub-Gbps at payloads ≥ 1 KB right up to the link cap. The PPS
+ceiling for each payload (~21 Gbps at 256 B, ~9 Gbps at 64 B) shows
+up as a horizontal saturation band: once `target_gbps` crosses that
+ceiling, increasing it further produces no additional throughput.
+The 64 B / target=75 cell (13.7 Gbps) is a pacer transient when the
+requested rate is well above achievable — the next cell at
+target=100 falls back to the PPS ceiling. The 1024 B / unpaced cell
+(68.2 Gbps, yellow) sits below its own paced cells; 1024 B is the
+size where the master loop transitions from idle to fully busy
+(`cpu_master_pct` jumps from ~3% to ~93% in the corresponding
+unpaced run), and per-cell numbers in this regime carry roughly
+±5 Gbps of run-to-run variance.
+
+#### Payload × batch matrix
+
+Each cell shows the achieved Gbps and drops over a 30 s unpaced run.
+Coloring is relative to the global max across the matrix (here
+**104.989 Gbps** at payload 4096 B, batch 4096):
+
+<div class="perf-legend" markdown="0">
+  <span class="cell-green">green — no drops, Gbps ≥ 90% of max</span>
+  <span class="cell-yellow">yellow — no drops, Gbps ≥ 70% of max</span>
+  <span class="cell-red">red — drops, or Gbps &lt; 70% of max</span>
+</div>
+
+<table class="perf-matrix" markdown="0">
+  <thead>
+    <tr>
+      <th rowspan="2">payload</th>
+      <th colspan="4">batch (packets per burst)</th>
+    </tr>
+    <tr>
+      <th>256</th><th>1024</th><th>4096</th><th>10240</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <th>8000 B</th>
+      <td class="cell-green">96.4 Gbps<small>0 drops</small></td>
+      <td class="cell-green">96.4 Gbps<small>0 drops</small></td>
+      <td class="cell-green">96.4 Gbps<small>0 drops</small></td>
+      <td class="cell-green">95.9 Gbps<small>0 drops</small></td>
+    </tr>
+    <tr>
+      <th>4096 B</th>
+      <td class="cell-green">104.9 Gbps<small>0 drops</small></td>
+      <td class="cell-green">104.9 Gbps<small>0 drops</small></td>
+      <td class="cell-green">105.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">103.0 Gbps<small>0 drops</small></td>
+    </tr>
+    <tr>
+      <th>1024 B</th>
+      <td class="cell-red">64.9 Gbps<small>0 drops</small></td>
+      <td class="cell-yellow">86.2 Gbps<small>0 drops</small></td>
+      <td class="cell-red">71.0 Gbps<small>0 drops</small></td>
+      <td class="cell-red">72.7 Gbps<small>0 drops</small></td>
+    </tr>
+    <tr>
+      <th>256 B</th>
+      <td class="cell-red">24.5 Gbps<small>0 drops</small></td>
+      <td class="cell-red">24.8 Gbps<small>0 drops</small></td>
+      <td class="cell-red">29.3 Gbps<small>0 drops</small></td>
+      <td class="cell-red">21.1 Gbps<small>0 drops</small></td>
+    </tr>
+    <tr>
+      <th>64 B</th>
+      <td class="cell-red">10.1 Gbps<small>0 drops</small></td>
+      <td class="cell-red">10.3 Gbps<small>0 drops</small></td>
+      <td class="cell-red">9.8 Gbps<small>0 drops</small></td>
+      <td class="cell-red">10.3 Gbps<small>0 drops</small></td>
+    </tr>
+  </tbody>
+</table>
+
+**Reading the matrix.** At payload ≥ 4 KB the link saturates (~96–105
+Gbps) and batch size barely moves the number, so every cell is green.
+The 1 KB row is the transition: pps and Gbps both matter, batch size
+starts to influence which side dominates, and most cells fall under 70%
+of the global max. At ≤ 256 B the bottleneck is packets-per-second
+(~10 M pps ceiling at 64 B), so effective Gbps stays well below the
+link ceiling regardless of batch. Run-to-run variance on the unpaced
+cells is ~0.5 Gbps; row-internal Gbps differences smaller than that
+should be treated as noise.
+
+#### CPU and GPU utilization (headline cell, payload 8000 B / batch 10240, unpaced)
+
+| Resource             | Value | Note                                       |
+| -------------------- | ----- | ------------------------------------------ |
+| Master core (CPU 8)  |  3.3% | Mostly idle; orchestration only            |
+| TX core (CPU 17)     | 91.4% | Poll-mode spin; rate-independent           |
+| RX core (CPU 18)     | 91.4% | Poll-mode spin; rate-independent           |
+| GPU SM %             |  0.0% | GPU is a DMA target, no compute            |
+| GPU mem-ctrl %       |  0.0% | Payload writes traverse PCIe, not the GPU memory controller |
+
+The TX/RX cores stay at ~92% across every drop-curve step from 1 Gbps to
+line rate — characteristic of DPDK's poll-mode driver, which spins waiting
+for descriptors regardless of offered load. The master core handles
+configuration only and idles below 5 % at the headline shape; at smaller
+payload sizes (1 KB and below) it occasionally hits 90%+ as more bursts
+flow through the orchestration path. That asymmetry is data, not a bug,
+and is captured in the per-cell artifacts under `bench-results/`.
+
+### RoCE
+
+Native shape on Spark is an 8 MB SEND, batch 1, single QP — the
+configuration `daqiri_bench_rdma_tx_rx_spark.yaml` is built around. The
+bench runs `--mode both` (one client, one server) in a single process; both
+endpoints are on the host, but RoCE bypasses kernel routing on the data
+path so traffic actually crosses the QSFP loopback cable. The aggregate
+rate at native shape was **~175 Gbps** bidirectional (server-TX 89.2 +
+client-TX 83.4 single-direction).
+
+#### Payload sweep at native batch
+
+All cells: batch 1, unpaced, 30 s, `drops == 0`. These numbers are carried
+forward from the original Spark data set and should be refreshed in a follow-up
+hardware run after the current RDMA bench changes.
+
+| message_size | Pkts/s | Gbps   | Notes                                |
+| ------------ | -----: | -----: | ------------------------------------ |
+| 8 MB         |  1,303 | **83.4** | Native shape; saturates single QP. |
+| 1 MB         |  9,859 |   82.7 | Same wire ceiling, more pps.         |
+| 64 KB        |    255 |  0.134 | Pending current-main refresh.[^1]    |
+| 8 KB         |     88 |  0.006 | Matched cell; pending refresh.[^1]   |
+| 4 KB         |     39 |  0.001 | Pending current-main refresh.[^1]    |
+
+The 1 MB → 64 KB step is a 39× pps drop for a 16× smaller payload — well
+out of line with what handshake-bound RC on a 200G link should do. The data
+points are preserved here as historical Spark measurements, but the bench has
+changed since they were captured. Treat the small-payload diagnosis as pending
+refresh until the RoCE sweep is rerun on hardware with the current
+`examples/rdma_bench.cpp` and RDMA manager.
+
+#### CPU and GPU utilization (headline cell, message 8 MB, batch 1, unpaced)
+
+| Channel              | Busy% | Note                                       |
+| -------------------- | ----: | ------------------------------------------ |
+| Master core (CPU 8)  |  6.6% | Orchestration only                         |
+| Client TX (CPU 17)   | 90.3% | Post-and-poll spin, rate-independent       |
+| Server RX (CPU 18)   |  2.6% | HCA writes directly to memory; CPU is idle for RDMA writes |
+| GPU SM %             |  0.0% | GPU is a DMA target, not a compute engine  |
+| GPU mem-ctrl %       |  0.0% | Payload writes go through PCIe, not the GPU mem-controller |
+
+The RX-side core staying near idle is the expected RoCE RC signature —
+the HCA places incoming data directly into registered memory without
+CPU involvement. Note also that the `daqiri_bench_rdma_tx_rx_spark.yaml`
+config nominally pins separate TX (16/17) and RX (18/19) cores per side,
+but the current RDMA manager spawns only one thread per side and uses
+the configured TX core; the RX cores are unused.
+
+### Socket
+
+**Data-fill pending.** Both UDP and TCP rows are blocked on a re-run of
+`./examples/run_spark_bench.sh socket-{udp,tcp} sweep` with the current
+wrapper parser — see footnotes [^2] / [^3] and the
+[TODO / Known Limitations](#todo-not-yet-implemented-known-limitations)
+section. The bench itself runs cleanly on Spark; the earlier "UDP
+peer-learning deadlock" and "TCP glibc heap-corruption" diagnoses were
+both closed as red herrings.
+
+### Workload variants (FFT, GEMM)
+
+The planned post-process layer ([issue #15](https://github.com/NVIDIA/daqiri/issues/15))
+adds a `--post-process {fft,gemm}` flag to the bench, runs `cuFFT` /
+`cuBLAS` on the received GPU-resident payload, and reports the resulting
+throughput delta and GPU utilization.
+
+**Sizes (representative):**
+
+- FFT: 1D complex-to-complex, length 1024.
+- GEMM: fp32 square, N = 44 (largest tile that fits in an 8 KB payload).
+
+#### DPDK GPUDirect — FFT/GEMM
+
+_TBD (follow-up)._
+
+#### RoCE — FFT/GEMM
+
+_TBD (follow-up)._ Note the unit-of-work mismatch when comparing across
+transports: Socket / RoCE applies the post-process kernel once per ~8 MB SEND;
+raw GPUDirect applies it once per packet. The throughput numbers are
+comparable; "operations per burst" is not.
+
+## Python Results
+
+The planned Python benches ([issue #16](https://github.com/NVIDIA/daqiri/issues/16))
+mirror the C++ benches' CLI and stdout format, using the existing pybind11
+bindings.
+
+### Loopback
+
+_TBD (follow-up)._
+
+### FFT / GEMM via pybind of the C++ post-process layer
+
+_TBD (follow-up)._
+
+## Reproduce these results
+
+### Container
+
+All commands below assume execution inside the project container, as
+required by [`AGENTS.md`](https://github.com/nvidia/daqiri/blob/main/AGENTS.md).
+On the host, launch the container in privileged mode with all GPUs,
+hugepage mounts, and `/sys` passed through, and bind-mount the repo at
+`/workspace`:
+
+```bash
+# RX-side NIC; auto-injects ETH_DST_ADDR for the DPDK bench wrappers.
+RX_IFACE="${RX_IFACE:-enP2p1s0f0np0}"
+sudo docker run --rm -it \
+  --net host --ipc=host \
+  --runtime=nvidia --gpus all \
+  --privileged \
+  --ulimit memlock=-1 --ulimit stack=67108864 \
+  -v "$(pwd):/workspace" \
+  -v /dev/hugepages:/dev/hugepages \
+  -v /mnt/huge:/mnt/huge \
+  -v /sys:/sys \
+  -w /workspace \
+  -e ETH_DST_ADDR="$(cat /sys/class/net/$RX_IFACE/address)" \
+  -e RX_IFACE="$RX_IFACE" \
+  daqiri:local \
+  bash
+```
+
+### Build
+
+Inside the container:
+
+```bash
+cmake -S . -B build -DBUILD_SHARED_LIBS=ON -DDAQIRI_BUILD_PYTHON=ON \
+    -DDAQIRI_MGR="dpdk socket rdma"
+cmake --build build -j
+```
+
+### One-shot driver
+
+The whole DPDK matrix (sweep + drop-curve) is driven by a single script
+which handles pre-flight checks (hugepage availability, RX iface MAC,
+link state), orphan-hugepage cleanup between cells, and a final summary of
+per-target result directories:
+
+```bash
+./scripts/spark_data_fill.sh dpdk
+```
+
+The script defaults to `dpdk socket-udp socket-tcp` if invoked with no
+arguments. RDMA is currently rejected by the aggregate driver's pre-flight;
+use `./examples/run_spark_bench.sh rdma ...` directly when refreshing RoCE
+data.
+
+### Per-target wrapper invocations
+
+For ad-hoc runs of a single cell or a single mode:
+
+```bash
+export DAQIRI_BUILD_DIR="$PWD/build"
+export ETH_DST_ADDR="$(cat /sys/class/net/<rx-iface>/address)"   # DPDK only
+
+./examples/run_spark_bench.sh dpdk smoke         # native-shape headline cell
+./examples/run_spark_bench.sh dpdk sweep         # full payload × batch matrix
+./examples/run_spark_bench.sh dpdk drop-curve    # sweep --target-gbps
+```
+
+Each invocation emits `bench-results/<timestamp>-dpdk-<mode>/` containing
+one subdirectory per cell (stdout / stderr / config / dmon / cpu_stat
+captures), an `environment.txt` snapshot, and a `runs.csv` aggregating
+the cell-level metrics.
+
+### Environment-only capture
+
+Useful for filing a bug report or comparing two Spark units without
+running the bench:
+
+```bash
+./examples/bench_capture_environment.sh /tmp/spark-env
+```
+
+### Tuning prerequisites
+
+System tuning is required before the numbers in this report are
+reproducible. See
+[`docs/tutorials/system_configuration.md`](tutorials/system_configuration.md)
+for the DGX Spark tab — isolated cores, hugepages, governor, IRQ affinity.
+
+For RoCE specifically, the host needs source-based policy routing and
+static ARP entries pinned before the bench can use the QSFP loopback
+cable (single-NIC cross-cabled topology with both endpoint IPs local;
+without pinning, `rdma_resolve_route`'s neighbour lookup is ambiguous):
+
+```bash
+sudo ./scripts/setup_spark_rdma_loopback.sh
+```
+
+The script is idempotent and only modifies host state for the two
+data-plane ports `enp1s0f0np0` (1.1.1.1) and `enP2p1s0f0np0` (2.2.2.2).
+
+## TODO: Not Yet Implemented / Known Limitations
+
+- **HDS (Header–Data Split) is deferred.** The generic HDS configuration uses
+  `kind: device` for GPU memory regions. Spark / GB10 cannot use device memory
+  for GPUDirect — `nvidia_peermem` does not load and DMA-BUF is unreachable —
+  so DAQIRI uses `host_pinned` instead. Under `host_pinned`, the HDS layout no
+  longer changes the memory path; it only changes the segment partition,
+  which makes "HDS vs. plain GPUDirect" a non-distinction on this platform.
+  HDS is characterized when this report extends to IGX and x86-server
+  platforms where device memory works. See
+  [issue #15](https://github.com/NVIDIA/daqiri/issues/15) for tracking.
+- **RoCE small-payload pps needs a refresh on current main.** The carried-forward
+  Spark data saturates single-stream at ~83 Gbps for 8 MB and 1 MB messages,
+  but 64 KB and below collapse out of proportion to payload size (4 KB at
+  0.001 Gbps). The bench and RDMA manager changed after those numbers were
+  captured, so the matched 8 KB RoCE cell is footnoted [^1] until a follow-up
+  hardware run refreshes the sweep.
+- **Socket UDP / TCP results are pending re-run** with the wrapper-side
+  parse fix now on main. The previous sweep CSVs are zero-filled
+  because `run_spark_bench.sh`'s fallback only knew RDMA's
+  `send_completions`/`send_bytes` keys; `socket_bench` actually emits
+  `sent_packets`/`sent_bytes`. With that fixed, both socket transports are
+  expected to produce real numbers — neither of the earlier "deadlock" / "heap
+  corruption" diagnoses held up under investigation (both red herrings,
+  closed not-a-bug). See footnotes [^2] / [^3].
+- **p99/p999 latency is not in v1.** The bench output captures throughput,
+  drops, and resource utilization. Per-burst RX timestamping and percentile
+  aggregation are deferred to a follow-up issue.
diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css
index afc366a..ce8f976 100644
--- a/docs/stylesheets/extra.css
+++ b/docs/stylesheets/extra.css
@@ -368,3 +368,64 @@ body[data-active-tab="igx-orin"] .md-nav--secondary li:has(> a[data-tab="dgx-spa
 body[data-active-tab="dgx-spark"] .md-nav--secondary li:has(> a[data-tab="igx-orin"]) {
   display: none;
 }
+
+/* ── Performance-report heatmap cells ───────────────────────────────── */
+/* Used by the payload×batch and payload×target_gbps matrices in
+   docs/performance-*.md. Threshold logic (vs. matrix-global max):
+     green  = no drops AND Gbps ≥ 90% of max
+     yellow = no drops AND Gbps ≥ 70% of max
+     red    = any drops OR Gbps < 70% of max                            */
+.md-typeset table.perf-matrix {
+  width: 100%;
+  table-layout: auto;
+  border-collapse: separate;
+  border-spacing: 5px;
+  font-size: 0.64rem;
+}
+.md-typeset table.perf-matrix th,
+.md-typeset table.perf-matrix td {
+  text-align: center;
+  vertical-align: middle;
+  font-variant-numeric: tabular-nums;
+  padding: 0.55em 0.5em;
+  border-radius: 4px;
+  white-space: nowrap;
+}
+.md-typeset table.perf-matrix td small {
+  display: block;
+  opacity: 0.75;
+  font-size: 0.85em;
+  margin-top: 0.25em;
+}
+.md-typeset table.perf-matrix td.cell-green {
+  background-color: rgba(118, 185, 0, 0.28);
+  color: inherit;
+}
+.md-typeset table.perf-matrix td.cell-yellow {
+  background-color: rgba(255, 196, 0, 0.32);
+  color: inherit;
+}
+.md-typeset table.perf-matrix td.cell-red {
+  background-color: rgba(220, 60, 60, 0.32);
+  color: inherit;
+}
+.md-typeset table.perf-matrix th {
+  background-color: rgba(255, 255, 255, 0.05);
+  font-weight: 600;
+}
+/* Compact legend chips that pair with the matrix. */
+.md-typeset .perf-legend {
+  display: flex;
+  gap: 0.75em;
+  margin: 0.5em 0 1em 0;
+  font-size: 0.85em;
+  flex-wrap: wrap;
+}
+.md-typeset .perf-legend span {
+  padding: 0.1em 0.55em;
+  border-radius: 0.25em;
+  white-space: nowrap;
+}
+.md-typeset .perf-legend .cell-green  { background-color: rgba(118, 185, 0, 0.28); }
+.md-typeset .perf-legend .cell-yellow { background-color: rgba(255, 196, 0, 0.32); }
+.md-typeset .perf-legend .cell-red    { background-color: rgba(220, 60, 60, 0.32); }
diff --git a/mkdocs.yml b/mkdocs.yml
index 65d9e7d..8a10698 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -50,6 +50,8 @@ nav:
   - Getting Started: getting-started.md
   - Concepts: concepts.md
   - Benchmarks: tutorials/benchmarking_examples.md
+  - Performance:
+    - DGX Spark: performance-dgx-spark.md
   - API Reference:
     - API Guide: api-reference/index.md
     - Configuration YAML Reference: api-reference/configuration.md
@@ -64,6 +66,7 @@ markdown_extensions:
   - admonition
   - attr_list
   - def_list
+  - footnotes
   - md_in_html
   - tables
   - toc: