Skip to content

#15 - Add reusable C++ bench infrastructure for per-platform performance reports#96

Merged
dleshchev merged 1 commit into
mainfrom
15-bench-infra
Jun 3, 2026
Merged

#15 - Add reusable C++ bench infrastructure for per-platform performance reports#96
dleshchev merged 1 commit into
mainfrom
15-bench-infra

Conversation

@RamyaGuru
Copy link
Copy Markdown
Collaborator

Summary

First of a 3-PR stack replacing #72 (DGX Spark performance report). This PR contains only the platform-agnostic bench plumbing — no Spark-specific tooling, no perf doc.

  • TokenBucketPacer in examples/raw_bench_common.{cpp,h}, plus a --target-gbps G CLI flag that wires it into the raw / RDMA / socket benches.
  • --seconds N CLI flag for time-bounded runs (replaces the previous YAML iterations requirement for smoke runs).
  • Atomic stringstream prints so the multi-thread bench threads don't garble their RX complete: / TX complete: lines.
  • Drop-count propagation — DPDK manager xstats summary, RDMA CQ error lines, and socket bench sent_packets/sent_bytes keys are all emitted in a grep-friendly format that the upstream sweep wrapper consumes.
  • examples/bench_capture_environment.sh — one-shot capture of slow-moving host state (kernel, NIC firmware, IOMMU, hugepages, GPU info) into a results directory.

Used by the upcoming DGX Spark, IGX, and x86 platform performance reports (issues #15, #16, and follow-on platform PRs).

Stack

This PR → #β (Spark sweep tooling)#γ (Spark perf doc) — opening next once their test plans pass.

Replaces #72 (which had all three concerns bundled into one ~1,545-line draft).

Test plan

  • Clean build: cmake -S . -B build -DBUILD_SHARED_LIBS=ON -DDAQIRI_BUILD_PYTHON=OFF -DDAQIRI_MGR="dpdk socket rdma" && cmake --build build -j — no warnings.
  • --seconds honored on DPDK QSFP loopback (daqiri_bench_raw_gpudirect against daqiri_bench_raw_tx_rx_spark.yaml) — TX/RX complete lines print cleanly, RX packets > 0, exits in ~5 s.
  • TokenBucketPacer rate-limits: --target-gbps 10 measured within ±20% of the target on the same cable.
  • Socket bench output format (Server complete: sent_packets=... sent_bytes=...) — deferred to PR β's regression test, which exercises the same field-name path through the sweep wrapper.

🤖 Generated with Claude Code

@RamyaGuru
Copy link
Copy Markdown
Collaborator Author

Force-pushed to add a missing BUILD_RPATH on daqiri_bench_socket in examples/CMakeLists.txt. The original commit linked raw_bench_common.cpp + CUDA::cudart into the socket bench (matching what was done for daqiri_bench_rdma) but missed the matching BUILD_RPATH property. Without it, the binary runtime-links against /opt/daqiri/lib/libdaqiri.so — the installed snapshot, which can be older than the source tree (it was, in this case: missing Cliff's set_connection_id accessors from #88). Found while running PR #97's regression test, which crashed with an undefined-symbol error until I added the RPATH. daqiri_bench_rdma already had the same property, so this is just bringing socket in line.

@RamyaGuru RamyaGuru marked this pull request as ready for review May 27, 2026 13:46
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 27, 2026

Greptile Summary

Adds platform-agnostic bench plumbing: TokenBucketPacer for --target-gbps rate limiting, --seconds time-bounded runs, atomic stats printing, drop-count propagation in structured output fields, and bench_capture_environment.sh for host/NIC/GPU state capture.

  • TokenBucketPacer (raw_bench_common.{h,cpp}): software token-bucket pacer wired into DPDK, RDMA, and socket bench TX workers; rdma/socket benches each give server and client their own pacer instance, but raw_gpudirect_bench.cpp shares a single tx_pacer across all TX threads — the mutex is held for the full sleep loop, which serializes parallel TX workers when --target-gbps is active and multiple queues are configured.
  • Structured output fields (seconds=, send_bytes=, recv_bytes=, etc.) are emitted consistently across all three benches in a grep-friendly name=value format suitable for the upstream sweep wrapper.
  • bench_capture_environment.sh: new one-shot script capturing kernel, NIC firmware, hugepages, IOMMU, and GPU state into a timestamped results directory.

Confidence Score: 4/5

Safe to merge for single-queue configs; multi-queue DPDK TX with --target-gbps will serialize TX threads due to the lock held during sleep.

The pacer mutex is held across the entire sleep loop in wait_for_bytes. In raw_gpudirect_bench.cpp a single tx_pacer is shared among all TX threads, so any multi-queue DPDK bench run with --target-gbps set will block all TX threads behind one sleeping thread rather than pacing them in parallel. The rdma and socket benches are unaffected because each role gets its own pacer. Fixing this requires releasing the mutex before the sleep loop.

examples/raw_bench_common.cpp (wait_for_bytes) and examples/raw_gpudirect_bench.cpp (shared tx_pacer)

Important Files Changed

Filename Overview
examples/raw_bench_common.cpp Adds TokenBucketPacer implementation; mutex is held across the sleep loop, serializing threads when a shared pacer is used.
examples/raw_bench_common.h Declares TokenBucketPacer and updates print_queue_stats signature to include elapsed seconds; clean.
examples/raw_gpudirect_bench.cpp Wires TokenBucketPacer (shared single instance across all TX threads) and per-thread elapsed timing into tx_worker; pacer serialization bug surfaces here under multi-queue configs.
examples/rdma_bench.cpp Adds per-role pacers (server/client each own a separate TokenBucketPacer), byte tracking, and structured output; no contention issue since pacers are not shared across threads.
examples/socket_bench.cpp Adds pacer, byte tracking, and structured output; pre-existing send_tx_burst failure leak and TX vs RX byte accounting mismatch are flagged in prior review threads.
examples/CMakeLists.txt Adds raw_bench_common.cpp and CUDA::cudart to rdma and socket bench targets; also adds missing BUILD_RPATH for socket bench.
examples/bench_capture_environment.sh New one-shot host/NIC/GPU state capture script; correctly handles missing commands and uses section headers for easy grepping.

Reviews (3): Last reviewed commit: "#15 - Add reusable C++ bench infrastruct..." | Re-trigger Greptile

Comment thread examples/socket_bench.cpp
Comment on lines 122 to +136
@@ -119,7 +131,9 @@ void socket_worker(const SocketBenchConfig& cfg, std::atomic<bool>& stop, Socket
daqiri::BurstParams* burst = nullptr;
if (daqiri::get_rx_burst(&burst, conn_id, cfg.server) == daqiri::Status::SUCCESS &&
burst != nullptr) {
stats.received_packets += static_cast<uint64_t>(daqiri::get_num_packets(burst));
const uint64_t rx_pkts = static_cast<uint64_t>(daqiri::get_num_packets(burst));
stats.received_packets += rx_pkts;
stats.received_bytes += daqiri::get_burst_tot_byte(burst);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 TX vs RX byte metrics use different accounting bases

sent_bytes is accumulated as cfg.message_size per packet (user-payload only), while received_bytes uses get_burst_tot_byte(burst) (actual bytes delivered by the DAQIRI burst, which may include framing). In a zero-loss loopback the "Server complete" line will show recv_bytes > sent_bytes, causing the upstream sweep wrapper — which parses both fields to compute efficiency — to report spurious byte loss or negative efficiency. rdma_bench.cpp consistently uses cfg.message_size for both sides; socket_bench should do the same, or switch both to the actual API measurement.

Comment on lines 30 to 60
namespace daqiri::bench {

// Software token-bucket pacer used by the bench TX workers. When
// target_gbps == 0 the wait_for_bytes() call is a no-op early return, so the
// pacer adds no overhead when --target-gbps is unset.
//
// Accuracy: ~5% at high rates due to Linux nanosleep granularity and scheduler
// jitter. Acceptable for drop-curve sweeps; tighter pacing would require
// hardware TX timestamping (DAQIRI's accurate_send YAML flag), deferred.
class TokenBucketPacer {
public:
TokenBucketPacer() = default;
explicit TokenBucketPacer(double target_gbps);

// Call after each TX burst. Sleeps in short slices until the pacer's notion
// of "time the configured target rate would have taken to send the
// accumulated bytes" catches up, OR `stop` flips true. Slicing keeps the
// bench responsive to --seconds expiry / Ctrl-C without truncating the total
// sleep (which would silently break pacing for low target rates).
void wait_for_bytes(size_t bytes, std::atomic<bool> &stop);

bool enabled() const { return target_bps_ > 0.0; }
double target_gbps() const { return target_bps_ / 1e9; }

private:
double target_bps_ = 0.0; // 0 means disabled
uint64_t total_bytes_ = 0;
std::chrono::steady_clock::time_point t0_;
};

struct RawBenchTxConfig {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Doc-sync required for new CLI flags and output format

This PR introduces the TokenBucketPacer class, the --target-gbps flag, and rewrites every bench's final summary line to use name=value fields (adding seconds=, send_bytes=, recv_bytes=, etc.). It also adds examples/bench_capture_environment.sh. Per the repo's doc-sync rule, changes under examples/*.cpp require updating docs/tutorials/benchmarking_examples.md, docs/tutorials/configuration-walkthrough.md, and the benchmark table in AGENTS.md. None of those appear to be updated in this PR.

Rule Used: DAQIRI has no automated doc-sync gate beyond mkdoc... (source)

…nce reports

Adds the platform-agnostic plumbing needed for upcoming performance reports.

Includes shared benchmark CLI parsing for --seconds and --target-gbps, TokenBucketPacer, environment capture, build-tree RPATH coverage for socket, and compatible raw/RDMA/socket benchmark output for sweep tooling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: rgurunathan <rgurunathan@nvidia.com>
Signed-off-by: Denis Leshchev <dleshchev@nvidia.com>
@dleshchev dleshchev merged commit 7aea2eb into main Jun 3, 2026
3 checks passed
@dleshchev dleshchev deleted the 15-bench-infra branch June 3, 2026 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants