Skip to content

#15 - Add DGX Spark sweep tooling and RDMA loopback prereqs#116

Merged
dleshchev merged 1 commit into
mainfrom
review/pr-97-spark-tooling
Jun 3, 2026
Merged

#15 - Add DGX Spark sweep tooling and RDMA loopback prereqs#116
dleshchev merged 1 commit into
mainfrom
review/pr-97-spark-tooling

Conversation

@dleshchev
Copy link
Copy Markdown
Collaborator

Rebuilds the Spark tooling stack item as a clean follow-up after PR #96 landed on main.\n\nChanges:\n- Adds the Spark benchmark sweep wrapper and data-fill driver.\n- Adds the Spark RDMA loopback setup helper with current p0-to-p1 defaults.\n- Raises RDMA bench outstanding depth to match the YAML buffer depth.\n- Keeps failed/malformed bench cells from being emitted as successful zero rows.\n\nValidation:\n- bash -n examples/run_spark_bench.sh\n- bash -n scripts/spark_data_fill.sh\n- bash -n scripts/setup_spark_rdma_loopback.sh\n- python3 scripts/check_doc_refs.py\n- git diff --check origin/main...HEAD\n\nLocal hardware benchmark/build not run in this shell; the repo requires compile/run inside the project container with the Spark NIC/GPU environment.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Jun 3, 2026

Greptile Summary

Adds the DGX Spark benchmark sweep stack on top of the base infra merged in PR #96: a sweep wrapper (run_spark_bench.sh), a one-shot data-fill driver (spark_data_fill.sh), and an idempotent RDMA loopback setup script (setup_spark_rdma_loopback.sh), plus a kMaxOutstanding bump in rdma_bench.cpp to match the YAML buffer depth.

  • examples/run_spark_bench.sh: Sweeps payload × batch × target-gbps across DPDK, RDMA, and socket backends; emits one CSV row per successful cell and propagates failures without producing false zero rows.
  • scripts/spark_data_fill.sh: Drives the DPDK/socket sweep and drop-curve modes with hugepage pre-flight, inter-run cleanup, and live log streaming via tee.
  • scripts/setup_spark_rdma_loopback.sh: Sets up the p0↔p1 RDMA loopback with per-port routing tables and static ARP entries; safe to re-run.

Confidence Score: 5/5

The changes are additive tooling scripts and a one-line constant bump; they do not touch the library core, the Manager vtable, or the BurstParams contract.

All five changed files are new shell scripts or a trivial constant change in an example binary. The kMaxOutstanding bump is well-documented with the pool-drain constraint inline. The scripts handle failure isolation correctly and do not emit false zero rows. DCO sign-offs and commit format are both present.

No files require special attention; the only finding is a misleading header comment in examples/run_spark_bench.sh.

Important Files Changed

Filename Overview
.gitignore Adds bench-results/ output directory and pcie_schematic.png artifact to gitignore; clean and correct.
examples/rdma_bench.cpp Raises kMaxOutstanding from 5 to 20 to match the YAML num_bufs, with an inline comment documenting the pool-drain constraint and the known structural limitation (interleaving drain/post is deferred).
examples/run_spark_bench.sh New Spark sweep wrapper; failure cells correctly return nonzero instead of emitting false zero rows. Header comment for RX_IFACE is misleading — the variable is unused in this script and /proc/net/udp provides no per-interface filtering.
scripts/setup_spark_rdma_loopback.sh New idempotent RDMA loopback setup script; correctly flushes routes/rules before re-adding, reads MACs from sysfs with env overrides, and uses per-port routing tables.
scripts/spark_data_fill.sh One-shot data-fill driver with pre-flight hugepage checks, orphan-hugepage cleanup between runs, and correct PIPESTATUS capture for pipeline exit-code propagation.

Sequence Diagram

sequenceDiagram
    participant User
    participant DataFill as spark_data_fill.sh
    participant Wrapper as run_spark_bench.sh
    participant EnvCapture as bench_capture_environment.sh
    participant Bench as BenchBinary
    participant CSV

    User->>DataFill: run with backends
    DataFill->>DataFill: preflight hugepages, MAC, carrier
    loop each backend x mode
        DataFill->>DataFill: clean_orphan_hugepages
        DataFill->>Wrapper: backend mode
        Wrapper->>EnvCapture: capture env state
        loop each cell payload x batch x target_gbps
            Wrapper->>Wrapper: snapshot udp/cpu/dmon
            Wrapper->>Bench: execute with generated YAML
            Bench-->>Wrapper: stdout stats + stderr drops
            alt cell succeeded
                Wrapper->>CSV: append row
            else cell failed
                Wrapper->>Wrapper: FAILURES++
            end
        end
        Wrapper-->>DataFill: exit status
        DataFill->>DataFill: clean_orphan_hugepages
    end
    DataFill-->>User: summary and result dirs
Loading

Reviews (2): Last reviewed commit: "#15 - Add DGX Spark sweep tooling and RD..." | Re-trigger Greptile

Comment on lines +94 to +99
BASE_YAML="$SCRIPT_DIR/daqiri_bench_socket_tcp_tx_rx.yaml"
BENCH_BIN="$BUILD_DIR/examples/daqiri_bench_socket"
CPU_MASTER=8; CPU_TX=17; CPU_RX=18
;;
*) echo "Unknown backend: $BACKEND" >&2; exit 1 ;;
esac
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Socket-udp BATCHES_SWEEP never varies the config

BATCHES_SWEEP=(256 32 1) is declared for socket-udp, but generate_yaml for both socket-udp and socket-tcp only substitutes message_size — the queue-level batch_size: 1 in daqiri_bench_socket_udp_tx_rx.yaml is never updated. Every element of BATCHES_SWEEP therefore produces an identical YAML, so the CSV will contain three rows per payload with different batch values but indistinguishable bench configs and throughput numbers. Either add a batch_size substitution leg to generate_yaml for socket backends (matching the DPDK -e "s|^( *batch_size: ).*|\1$batch|" pattern), or collapse BATCHES_SWEEP=(1) for socket-udp to make the sweep intent explicit.

Comment thread examples/rdma_bench.cpp
Comment on lines +70 to +75
// Matches the per-MR num_bufs in the YAML configs. Higher values deadlock
// the bench: post_req blocks in get_tx_packet_burst when the pool is empty,
// but free_tx_burst (which refills it) only runs later in the same loop
// iteration via get_rx_burst. Until the loop is refactored to interleave
// drain with post, this constant must stay <= num_bufs.
static constexpr int kMaxOutstanding = 20;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Doc-sync gap: examples/rdma_bench.cpp change not reflected in docs

Per the project doc-sync rule, any change to examples/*.cpp requires updating docs/tutorials/benchmarking_examples.md, docs/tutorials/configuration-walkthrough.md, and the benchmark table in AGENTS.md in the same PR. The new kMaxOutstanding value and its deadlock constraint are meaningful to anyone tuning RDMA buffer depth, and run_spark_bench.sh adds a new benchmark entry-point that both docs should mention. None of those three files are touched in this PR.

Rule Used: DAQIRI has no automated doc-sync gate beyond mkdoc... (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Adds Spark-focused benchmark sweep wrappers and host RDMA loopback setup now that the reusable bench infrastructure is on main.

- examples/run_spark_bench.sh: Spark-tuned sweep driver with per-backend payload/batch matrices, CPU pins, drop-source dispatch, and one CSV row per successful cell into bench-results/. Bench failures or missing completion stats now keep artifacts but return nonzero instead of producing false zero rows.

- scripts/spark_data_fill.sh: one-shot driver for the DPDK / socket-UDP / socket-TCP bench matrix, with hugepage pre-flight, orphan-hugepage cleanup between runs, and aggregate failure propagation.

- scripts/setup_spark_rdma_loopback.sh: idempotent Spark host prereq for the p0-to-p1 RoCE loopback. Defaults match the Spark profile and MACs are read from sysfs unless explicitly overridden.

- examples/rdma_bench.cpp: raise kMaxOutstanding from 5 to 20 to match the Spark RDMA YAML buffer depth and improve small-payload throughput without exceeding num_bufs.

- .gitignore: ignore generated Spark bench artifacts.

Includes fixes for two parsing bugs Greptile flagged on the original draft of run_spark_bench.sh: /proc/net/udp drops are decimal, and socket bench uses sent_packets / sent_bytes rather than RDMA send_completions / send_bytes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: rgurunathan <rgurunathan@nvidia.com>
Signed-off-by: Denis Leshchev <dleshchev@nvidia.com>
@dleshchev dleshchev force-pushed the review/pr-97-spark-tooling branch from ec36695 to 9837f99 Compare June 3, 2026 17:19
@dleshchev dleshchev merged commit 4cd7514 into main Jun 3, 2026
1 check passed
@dleshchev dleshchev deleted the review/pr-97-spark-tooling branch June 3, 2026 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants