Skip to content

#112 - Add OpenTelemetry Metrics#111

Merged
cliffburdick merged 7 commits into
mainfrom
cburdick/otel-metrics
Jun 2, 2026
Merged

#112 - Add OpenTelemetry Metrics#111
cliffburdick merged 7 commits into
mainfrom
cburdick/otel-metrics

Conversation

@cliffburdick
Copy link
Copy Markdown
Collaborator

This PR adds OpenTelementry-compatible metrics for use in exporting to libraries like Prometheus and Grafana. A working example using the raw_tx_rx is provided.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Jun 1, 2026

Greptile Summary

This PR adds opt-in OpenTelemetry metrics instrumentation to DAQIRI, gated behind a new DAQIRI_ENABLE_OTEL_METRICS CMake option (default OFF). A working Grafana/Prometheus example is provided under examples/grafana/, and all three backends (DPDK, RDMA, socket) are wired to report per-queue packet, byte, and drop counters through a central daqiri::metrics::Registry.

  • New src/metrics.h/src/metrics.cpp: Implements an observable-counter registry using the OTel C++ API. The previous deadlock concern in shutdown() (holding mutex_ while calling RemoveCallback) has been addressed — instruments are now moved into locals, state is cleared, the lock is released, and then RemoveCallback is called outside the lock.
  • Backend instrumentation: DPDK populates metrics from rte_eth_stats and per-queue xstats; RDMA increments counters on each work-completion event; socket backend hooks into the existing send/receive paths.
  • examples/grafana/: New compose stack with Prometheus scrape config, a pre-built Grafana dashboard JSON, and a helper script to launch the benchmark alongside the exporters.

Confidence Score: 5/5

Safe to merge; the metrics layer is entirely opt-in, all three backends wire up correctly, and the previous deadlock in shutdown() has been addressed.

The change is additive and off-by-default. The observable-counter registry is clean, the shutdown path is now lock-safe, BurstParams free contracts are preserved in every backend, and DCO sign-off is present on all commits. The only findings are minor style inconsistencies in CMake and the example files.

No files require special attention for correctness; src/CMakeLists.txt and examples/grafana/otel_prometheus.cpp have minor style issues noted above.

Important Files Changed

Filename Overview
src/metrics.h New header providing the daqiri::metrics API. Uses #if DAQIRI_ENABLE_OTEL_METRICS to switch between real declarations and inline no-op stubs; heavy implementation lives entirely in metrics.cpp which is only compiled when the CMake option is ON.
src/metrics.cpp New implementation of the OTel observable-counter registry. The previous deadlock concern (RemoveCallback called while holding mutex_) is resolved — shutdown now moves instruments to locals, clears state, releases the lock, then calls RemoveCallback outside the critical section.
src/managers/dpdk/daqiri_dpdk_stats.cpp Refactors xstat name parsing into a reusable parse_queue_xstat lambda and populates per-queue and per-port OTel counters from rte_eth_stats and queue-level xstats.
src/managers/rdma/daqiri_rdma_mgr.cpp Adds per-thread CounterSet handle captured at thread start; increments rx/tx on work completions and add_dropped on all error paths.
src/managers/socket/daqiri_socket_mgr.cpp Consolidates the pre-existing per-protocol tx_pkts_/tx_bytes_ update into a single post-send block and adds metrics::add_tx/add_dropped. No regressions to the free contract.
examples/grafana/otel_prometheus.cpp Wires the OTel Prometheus exporter from an environment variable. The entire implementation body is wrapped in #if defined(DAQIRI_GRAFANA_PROMETHEUS), which is redundant since CMakeLists.txt already handles file exclusion.
CMakeLists.txt Adds top-level option(DAQIRI_ENABLE_OTEL_METRICS ...) and propagates it to the pkg-config output. The same option() is duplicated in src/CMakeLists.txt.
src/CMakeLists.txt Conditionally compiles metrics.cpp and links opentelemetry-cpp::api when DAQIRI_ENABLE_OTEL_METRICS is ON. Duplicates the same option() already declared in the root CMakeLists.txt.
examples/CMakeLists.txt Correctly gates otel_prometheus.cpp on the Prometheus exporter target being available; falls back gracefully with a status message when the exporter is absent.

Sequence Diagram

sequenceDiagram
    participant App as Application
    participant Common as daqiri::shutdown()
    participant DPDK as DpdkStats::Run()
    participant RDMA as rdma_thread()
    participant Socket as SocketMgr TX/RX
    participant Reg as metrics::Registry
    participant CS as CounterSet (shared_ptr)
    participant OTel as OTel SDK (collection thread)
    participant Prom as Prometheus / Grafana

    App->>Common: daqiri_init()
    Common->>Reg: get_or_create_queue(backend, iface, port, queue)
    Reg-->>Common: "shared_ptr<CounterSet>"

    DPDK->>CS: set_rx_packets / set_tx_packets / set_dropped
    RDMA->>CS: add_rx / add_tx / add_dropped
    Socket->>CS: add_rx / add_tx / add_dropped

    OTel->>Reg: observe_rx_packets callback
    Reg->>Reg: snapshot_counters() [acquires mutex briefly]
    Reg->>CS: rx_packets.load()
    Reg-->>OTel: Observe(value, attrs)

    Prom->>OTel: HTTP GET /metrics
    OTel-->>Prom: "daqiri_rx_packets_total{...}"

    App->>Common: daqiri::shutdown()
    Common->>Reg: shutdown() [moves instruments, clears state, releases lock]
    Reg->>OTel: RemoveCallback() [outside lock — no deadlock]
Loading

Reviews (5): Last reviewed commit: "#111 - Populate DPDK port metrics" | Re-trigger Greptile

Comment thread examples/daqiri_bench_raw_tx_rx.yaml Outdated
Comment thread examples/daqiri_bench_raw_tx_rx.yaml Outdated
Comment thread examples/grafana/run-benchmark.sh Outdated
Comment thread src/metrics.cpp
Comment thread src/managers/dpdk/daqiri_dpdk_stats.cpp
@cliffburdick cliffburdick changed the title Add OpenTelemetry Metrics #112 - Add OpenTelemetry Metrics Jun 2, 2026
Signed-off-by: Cliff Burdick <cburdick@nvidia.com>
Signed-off-by: Cliff Burdick <cburdick@nvidia.com>
Signed-off-by: Cliff Burdick <cburdick@nvidia.com>
Signed-off-by: Cliff Burdick <cburdick@nvidia.com>
Signed-off-by: Cliff Burdick <cburdick@nvidia.com>
Signed-off-by: Cliff Burdick <cburdick@nvidia.com>
@cliffburdick cliffburdick force-pushed the cburdick/otel-metrics branch from 150478a to 5129d27 Compare June 2, 2026 01:04
Signed-off-by: Denis Leshchev <dleshchev@nvidia.com>
@dleshchev
Copy link
Copy Markdown
Collaborator

dleshchev commented Jun 2, 2026

While running the Grafana example, we noticed the DPDK per-interface Prometheus series were staying at zero even though the queue-level series were moving. Specifically, the queue_id="all" counters for daqiri_rx_bytes_total, daqiri_tx_bytes_total, daqiri_rx_packets_total, and daqiri_tx_packets_total were not populated when queue xstats were available, while queue_id="0" had live values.

I pushed commit 134ad5e to always populate the DPDK queue_id="all" packet/byte metrics from rte_eth_stats_get(), while keeping queue-specific metrics sourced from queue xstats. Rebuilt and reran the dashboard stack; the all RX/TX byte counters are now nonzero and track the port-level traffic.

@cliffburdick cliffburdick merged commit de4743e into main Jun 2, 2026
5 checks passed
@cliffburdick cliffburdick deleted the cburdick/otel-metrics branch June 2, 2026 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants