#15 - Add reusable C++ bench infrastructure for per-platform performance reports by RamyaGuru · Pull Request #96 · NVIDIA/daqiri

RamyaGuru · 2026-05-26T20:14:37Z

Summary

First of a 3-PR stack replacing #72 (DGX Spark performance report). This PR contains only the platform-agnostic bench plumbing — no Spark-specific tooling, no perf doc.

TokenBucketPacer in examples/raw_bench_common.{cpp,h}, plus a --target-gbps G CLI flag that wires it into the raw / RDMA / socket benches.
--seconds N CLI flag for time-bounded runs (replaces the previous YAML iterations requirement for smoke runs).
Atomic stringstream prints so the multi-thread bench threads don't garble their RX complete: / TX complete: lines.
Drop-count propagation — DPDK manager xstats summary, RDMA CQ error lines, and socket bench sent_packets/sent_bytes keys are all emitted in a grep-friendly format that the upstream sweep wrapper consumes.
examples/bench_capture_environment.sh — one-shot capture of slow-moving host state (kernel, NIC firmware, IOMMU, hugepages, GPU info) into a results directory.

Used by the upcoming DGX Spark, IGX, and x86 platform performance reports (issues #15, #16, and follow-on platform PRs).

Stack

This PR → #β (Spark sweep tooling) → #γ (Spark perf doc) — opening next once their test plans pass.

Replaces #72 (which had all three concerns bundled into one ~1,545-line draft).

Test plan

Clean build: cmake -S . -B build -DBUILD_SHARED_LIBS=ON -DDAQIRI_BUILD_PYTHON=OFF -DDAQIRI_MGR="dpdk socket rdma" && cmake --build build -j — no warnings.
--seconds honored on DPDK QSFP loopback (daqiri_bench_raw_gpudirect against daqiri_bench_raw_tx_rx_spark.yaml) — TX/RX complete lines print cleanly, RX packets > 0, exits in ~5 s.
TokenBucketPacer rate-limits: --target-gbps 10 measured within ±20% of the target on the same cable.
Socket bench output format (Server complete: sent_packets=... sent_bytes=...) — deferred to PR β's regression test, which exercises the same field-name path through the sweep wrapper.

🤖 Generated with Claude Code

RamyaGuru · 2026-05-26T20:55:24Z

Force-pushed to add a missing BUILD_RPATH on daqiri_bench_socket in examples/CMakeLists.txt. The original commit linked raw_bench_common.cpp + CUDA::cudart into the socket bench (matching what was done for daqiri_bench_rdma) but missed the matching BUILD_RPATH property. Without it, the binary runtime-links against /opt/daqiri/lib/libdaqiri.so — the installed snapshot, which can be older than the source tree (it was, in this case: missing Cliff's set_connection_id accessors from #88). Found while running PR #97's regression test, which crashed with an undefined-symbol error until I added the RPATH. daqiri_bench_rdma already had the same property, so this is just bringing socket in line.

greptile-apps · 2026-05-27T13:53:22Z

Greptile Summary

Adds platform-agnostic bench plumbing: TokenBucketPacer for --target-gbps rate limiting, --seconds time-bounded runs, atomic stats printing, drop-count propagation in structured output fields, and bench_capture_environment.sh for host/NIC/GPU state capture.

TokenBucketPacer (raw_bench_common.{h,cpp}): software token-bucket pacer wired into DPDK, RDMA, and socket bench TX workers; rdma/socket benches each give server and client their own pacer instance, but raw_gpudirect_bench.cpp shares a single tx_pacer across all TX threads — the mutex is held for the full sleep loop, which serializes parallel TX workers when --target-gbps is active and multiple queues are configured.
Structured output fields (seconds=, send_bytes=, recv_bytes=, etc.) are emitted consistently across all three benches in a grep-friendly name=value format suitable for the upstream sweep wrapper.
bench_capture_environment.sh: new one-shot script capturing kernel, NIC firmware, hugepages, IOMMU, and GPU state into a timestamped results directory.

Confidence Score: 4/5

Safe to merge for single-queue configs; multi-queue DPDK TX with --target-gbps will serialize TX threads due to the lock held during sleep.

The pacer mutex is held across the entire sleep loop in wait_for_bytes. In raw_gpudirect_bench.cpp a single tx_pacer is shared among all TX threads, so any multi-queue DPDK bench run with --target-gbps set will block all TX threads behind one sleeping thread rather than pacing them in parallel. The rdma and socket benches are unaffected because each role gets its own pacer. Fixing this requires releasing the mutex before the sleep loop.

examples/raw_bench_common.cpp (wait_for_bytes) and examples/raw_gpudirect_bench.cpp (shared tx_pacer)

Important Files Changed

Filename	Overview
examples/raw_bench_common.cpp	Adds TokenBucketPacer implementation; mutex is held across the sleep loop, serializing threads when a shared pacer is used.
examples/raw_bench_common.h	Declares TokenBucketPacer and updates print_queue_stats signature to include elapsed seconds; clean.
examples/raw_gpudirect_bench.cpp	Wires TokenBucketPacer (shared single instance across all TX threads) and per-thread elapsed timing into tx_worker; pacer serialization bug surfaces here under multi-queue configs.
examples/rdma_bench.cpp	Adds per-role pacers (server/client each own a separate TokenBucketPacer), byte tracking, and structured output; no contention issue since pacers are not shared across threads.
examples/socket_bench.cpp	Adds pacer, byte tracking, and structured output; pre-existing send_tx_burst failure leak and TX vs RX byte accounting mismatch are flagged in prior review threads.
examples/CMakeLists.txt	Adds raw_bench_common.cpp and CUDA::cudart to rdma and socket bench targets; also adds missing BUILD_RPATH for socket bench.
examples/bench_capture_environment.sh	New one-shot host/NIC/GPU state capture script; correctly handles missing commands and uses section headers for easy grepping.

_{Reviews (3): Last reviewed commit: "#15 - Add reusable C++ bench infrastruct..." | Re-trigger Greptile}

greptile-apps · 2026-05-27T13:53:30Z

@@ -119,7 +131,9 @@ void socket_worker(const SocketBenchConfig& cfg, std::atomic<bool>& stop, Socket
      daqiri::BurstParams* burst = nullptr;
      if (daqiri::get_rx_burst(&burst, conn_id, cfg.server) == daqiri::Status::SUCCESS &&
          burst != nullptr) {
-        stats.received_packets += static_cast<uint64_t>(daqiri::get_num_packets(burst));
+        const uint64_t rx_pkts = static_cast<uint64_t>(daqiri::get_num_packets(burst));
+        stats.received_packets += rx_pkts;
+        stats.received_bytes += daqiri::get_burst_tot_byte(burst);


TX vs RX byte metrics use different accounting bases

sent_bytes is accumulated as cfg.message_size per packet (user-payload only), while received_bytes uses get_burst_tot_byte(burst) (actual bytes delivered by the DAQIRI burst, which may include framing). In a zero-loss loopback the "Server complete" line will show recv_bytes > sent_bytes, causing the upstream sweep wrapper — which parses both fields to compute efficiency — to report spurious byte loss or negative efficiency. rdma_bench.cpp consistently uses cfg.message_size for both sides; socket_bench should do the same, or switch both to the actual API measurement.

greptile-apps · 2026-05-27T13:53:31Z

 namespace daqiri::bench {

+// Software token-bucket pacer used by the bench TX workers. When
+// target_gbps == 0 the wait_for_bytes() call is a no-op early return, so the
+// pacer adds no overhead when --target-gbps is unset.
+//
+// Accuracy: ~5% at high rates due to Linux nanosleep granularity and scheduler
+// jitter. Acceptable for drop-curve sweeps; tighter pacing would require
+// hardware TX timestamping (DAQIRI's accurate_send YAML flag), deferred.
+class TokenBucketPacer {
+public:
+  TokenBucketPacer() = default;
+  explicit TokenBucketPacer(double target_gbps);
+
+  // Call after each TX burst. Sleeps in short slices until the pacer's notion
+  // of "time the configured target rate would have taken to send the
+  // accumulated bytes" catches up, OR `stop` flips true. Slicing keeps the
+  // bench responsive to --seconds expiry / Ctrl-C without truncating the total
+  // sleep (which would silently break pacing for low target rates).
+  void wait_for_bytes(size_t bytes, std::atomic<bool> &stop);
+
+  bool enabled() const { return target_bps_ > 0.0; }
+  double target_gbps() const { return target_bps_ / 1e9; }
+
+private:
+  double target_bps_ = 0.0;  // 0 means disabled
+  uint64_t total_bytes_ = 0;
+  std::chrono::steady_clock::time_point t0_;
+};
+
 struct RawBenchTxConfig {


Doc-sync required for new CLI flags and output format

This PR introduces the TokenBucketPacer class, the --target-gbps flag, and rewrites every bench's final summary line to use name=value fields (adding seconds=, send_bytes=, recv_bytes=, etc.). It also adds examples/bench_capture_environment.sh. Per the repo's doc-sync rule, changes under examples/*.cpp require updating docs/tutorials/benchmarking_examples.md, docs/tutorials/configuration-walkthrough.md, and the benchmark table in AGENTS.md. None of those appear to be updated in this PR.

Rule Used: DAQIRI has no automated doc-sync gate beyond mkdoc... (source)

…nce reports Adds the platform-agnostic plumbing needed for upcoming performance reports. Includes shared benchmark CLI parsing for --seconds and --target-gbps, TokenBucketPacer, environment capture, build-tree RPATH coverage for socket, and compatible raw/RDMA/socket benchmark output for sweep tooling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rgurunathan <rgurunathan@nvidia.com> Signed-off-by: Denis Leshchev <dleshchev@nvidia.com>

RamyaGuru force-pushed the 15-bench-infra branch from ed92368 to 1344643 Compare May 26, 2026 20:51

RamyaGuru mentioned this pull request May 26, 2026

#15 - Add DGX Spark sweep tooling and RDMA loopback prereqs #97

Merged

5 tasks

RamyaGuru mentioned this pull request May 26, 2026

#15 - Add C++ bench infrastructure for DGX Spark performance report #72

Closed

9 tasks

RamyaGuru marked this pull request as ready for review May 27, 2026 13:46

greptile-apps Bot reviewed May 27, 2026

View reviewed changes

dleshchev force-pushed the 15-bench-infra branch from 73cabc9 to eb981c8 Compare June 3, 2026 17:02

dleshchev merged commit 7aea2eb into main Jun 3, 2026
3 checks passed

dleshchev deleted the 15-bench-infra branch June 3, 2026 17:03

This was referenced Jun 3, 2026

#15 - Add DGX Spark sweep tooling and RDMA loopback prereqs #116

Merged

#15 - Reapply RDMA benchmark profiling updates #118

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#15 - Add reusable C++ bench infrastructure for per-platform performance reports#96

#15 - Add reusable C++ bench infrastructure for per-platform performance reports#96
dleshchev merged 1 commit into
mainfrom
15-bench-infra

RamyaGuru commented May 26, 2026

Uh oh!

RamyaGuru commented May 26, 2026

Uh oh!

greptile-apps Bot commented May 27, 2026 •

edited

Loading

Greptile Summary

Uh oh!

greptile-apps Bot May 27, 2026

Uh oh!

greptile-apps Bot May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RamyaGuru commented May 26, 2026

Summary

Stack

Test plan

Uh oh!

RamyaGuru commented May 26, 2026

Uh oh!

greptile-apps Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

greptile-apps Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented May 27, 2026 •

edited

Loading