#15 - Add reusable C++ bench infrastructure for per-platform performance reports#96
Conversation
|
Force-pushed to add a missing |
|
| Filename | Overview |
|---|---|
| examples/raw_bench_common.cpp | Adds TokenBucketPacer implementation; mutex is held across the sleep loop, serializing threads when a shared pacer is used. |
| examples/raw_bench_common.h | Declares TokenBucketPacer and updates print_queue_stats signature to include elapsed seconds; clean. |
| examples/raw_gpudirect_bench.cpp | Wires TokenBucketPacer (shared single instance across all TX threads) and per-thread elapsed timing into tx_worker; pacer serialization bug surfaces here under multi-queue configs. |
| examples/rdma_bench.cpp | Adds per-role pacers (server/client each own a separate TokenBucketPacer), byte tracking, and structured output; no contention issue since pacers are not shared across threads. |
| examples/socket_bench.cpp | Adds pacer, byte tracking, and structured output; pre-existing send_tx_burst failure leak and TX vs RX byte accounting mismatch are flagged in prior review threads. |
| examples/CMakeLists.txt | Adds raw_bench_common.cpp and CUDA::cudart to rdma and socket bench targets; also adds missing BUILD_RPATH for socket bench. |
| examples/bench_capture_environment.sh | New one-shot host/NIC/GPU state capture script; correctly handles missing commands and uses section headers for easy grepping. |
Reviews (3): Last reviewed commit: "#15 - Add reusable C++ bench infrastruct..." | Re-trigger Greptile
| @@ -119,7 +131,9 @@ void socket_worker(const SocketBenchConfig& cfg, std::atomic<bool>& stop, Socket | |||
| daqiri::BurstParams* burst = nullptr; | |||
| if (daqiri::get_rx_burst(&burst, conn_id, cfg.server) == daqiri::Status::SUCCESS && | |||
| burst != nullptr) { | |||
| stats.received_packets += static_cast<uint64_t>(daqiri::get_num_packets(burst)); | |||
| const uint64_t rx_pkts = static_cast<uint64_t>(daqiri::get_num_packets(burst)); | |||
| stats.received_packets += rx_pkts; | |||
| stats.received_bytes += daqiri::get_burst_tot_byte(burst); | |||
There was a problem hiding this comment.
TX vs RX byte metrics use different accounting bases
sent_bytes is accumulated as cfg.message_size per packet (user-payload only), while received_bytes uses get_burst_tot_byte(burst) (actual bytes delivered by the DAQIRI burst, which may include framing). In a zero-loss loopback the "Server complete" line will show recv_bytes > sent_bytes, causing the upstream sweep wrapper — which parses both fields to compute efficiency — to report spurious byte loss or negative efficiency. rdma_bench.cpp consistently uses cfg.message_size for both sides; socket_bench should do the same, or switch both to the actual API measurement.
| namespace daqiri::bench { | ||
|
|
||
| // Software token-bucket pacer used by the bench TX workers. When | ||
| // target_gbps == 0 the wait_for_bytes() call is a no-op early return, so the | ||
| // pacer adds no overhead when --target-gbps is unset. | ||
| // | ||
| // Accuracy: ~5% at high rates due to Linux nanosleep granularity and scheduler | ||
| // jitter. Acceptable for drop-curve sweeps; tighter pacing would require | ||
| // hardware TX timestamping (DAQIRI's accurate_send YAML flag), deferred. | ||
| class TokenBucketPacer { | ||
| public: | ||
| TokenBucketPacer() = default; | ||
| explicit TokenBucketPacer(double target_gbps); | ||
|
|
||
| // Call after each TX burst. Sleeps in short slices until the pacer's notion | ||
| // of "time the configured target rate would have taken to send the | ||
| // accumulated bytes" catches up, OR `stop` flips true. Slicing keeps the | ||
| // bench responsive to --seconds expiry / Ctrl-C without truncating the total | ||
| // sleep (which would silently break pacing for low target rates). | ||
| void wait_for_bytes(size_t bytes, std::atomic<bool> &stop); | ||
|
|
||
| bool enabled() const { return target_bps_ > 0.0; } | ||
| double target_gbps() const { return target_bps_ / 1e9; } | ||
|
|
||
| private: | ||
| double target_bps_ = 0.0; // 0 means disabled | ||
| uint64_t total_bytes_ = 0; | ||
| std::chrono::steady_clock::time_point t0_; | ||
| }; | ||
|
|
||
| struct RawBenchTxConfig { |
There was a problem hiding this comment.
Doc-sync required for new CLI flags and output format
This PR introduces the TokenBucketPacer class, the --target-gbps flag, and rewrites every bench's final summary line to use name=value fields (adding seconds=, send_bytes=, recv_bytes=, etc.). It also adds examples/bench_capture_environment.sh. Per the repo's doc-sync rule, changes under examples/*.cpp require updating docs/tutorials/benchmarking_examples.md, docs/tutorials/configuration-walkthrough.md, and the benchmark table in AGENTS.md. None of those appear to be updated in this PR.
Rule Used: DAQIRI has no automated doc-sync gate beyond mkdoc... (source)
…nce reports Adds the platform-agnostic plumbing needed for upcoming performance reports. Includes shared benchmark CLI parsing for --seconds and --target-gbps, TokenBucketPacer, environment capture, build-tree RPATH coverage for socket, and compatible raw/RDMA/socket benchmark output for sweep tooling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rgurunathan <rgurunathan@nvidia.com> Signed-off-by: Denis Leshchev <dleshchev@nvidia.com>
Summary
First of a 3-PR stack replacing #72 (DGX Spark performance report). This PR contains only the platform-agnostic bench plumbing — no Spark-specific tooling, no perf doc.
TokenBucketPacerinexamples/raw_bench_common.{cpp,h}, plus a--target-gbps GCLI flag that wires it into the raw / RDMA / socket benches.--seconds NCLI flag for time-bounded runs (replaces the previous YAMLiterationsrequirement for smoke runs).RX complete:/TX complete:lines.CQ errorlines, and socket benchsent_packets/sent_byteskeys are all emitted in a grep-friendly format that the upstream sweep wrapper consumes.examples/bench_capture_environment.sh— one-shot capture of slow-moving host state (kernel, NIC firmware, IOMMU, hugepages, GPU info) into a results directory.Used by the upcoming DGX Spark, IGX, and x86 platform performance reports (issues #15, #16, and follow-on platform PRs).
Stack
This PR → #β (Spark sweep tooling) → #γ (Spark perf doc) — opening next once their test plans pass.
Replaces #72 (which had all three concerns bundled into one ~1,545-line draft).
Test plan
cmake -S . -B build -DBUILD_SHARED_LIBS=ON -DDAQIRI_BUILD_PYTHON=OFF -DDAQIRI_MGR="dpdk socket rdma" && cmake --build build -j— no warnings.--secondshonored on DPDK QSFP loopback (daqiri_bench_raw_gpudirectagainstdaqiri_bench_raw_tx_rx_spark.yaml) — TX/RX complete lines print cleanly, RX packets > 0, exits in ~5 s.TokenBucketPacerrate-limits:--target-gbps 10measured within ±20% of the target on the same cable.Server complete: sent_packets=... sent_bytes=...) — deferred to PR β's regression test, which exercises the same field-name path through the sweep wrapper.🤖 Generated with Claude Code