Skip to content

feat: implement observability metrics export (#20)#88

Merged
deepjoy merged 1 commit into
mainfrom
metrics
Mar 24, 2026
Merged

feat: implement observability metrics export (#20)#88
deepjoy merged 1 commit into
mainfrom
metrics

Conversation

@deepjoy
Copy link
Copy Markdown
Owner

@deepjoy deepjoy commented Mar 24, 2026

Summary

  • Add always-on internal AtomicU64 counters and a public MetricsSnapshot struct for consumers who don't use the metrics crate (CLIs, TUI dashboards)
  • Add optional metrics crate integration (behind the metrics Cargo feature) that emits ~30 counters, gauges, and histograms via the standard facade — consumers choose their exporter (Prometheus, StatsD, Datadog)
  • Add SchedulerBuilder methods for customizing metric names (metrics_prefix), global labels (metrics_label), and suppressing specific metrics (disable_metric)
  • Instrument all scheduler code paths: submit, dispatch, completion, failure, retry, dead-letter, gate denial, rate-limit throttle, group pause/resume, expiry, and dependency failure
  • Add duration tracking through CompletionMsg/FailureMsg for execution time histograms and queue wait histograms at dispatch time

Closes #20

New files

File Purpose
src/scheduler/counters.rs SchedulerCounters (always-on atomics) + MetricsSnapshot public struct
src/scheduler/metrics_bridge.rs MetricsEmitter — feature-gated metrics crate facade wrapper
tests/integration/metrics.rs 8 integration tests for counter correctness
docs/metrics.md User-facing guide: metric reference, dashboard layout, alert rules, builder API

Modified files (13)

File Change
Cargo.toml metrics = { version = "0.24", optional = true }, metrics feature flag
src/lib.rs Re-export MetricsSnapshot, feature flag + metrics crate docs
src/scheduler/mod.rs counters/metrics_bridge modules, MetricsConfig, duration on coalescing messages, new fields on SchedulerInner
src/scheduler/builder.rs metrics_prefix(), metrics_label(), disable_metric() builder methods; describe_metrics() at build time
src/scheduler/queries.rs Scheduler::metrics_snapshot() method
src/scheduler/gate.rs counters in GateContext; gate_denials + rate_limit_throttles at each denial path
src/scheduler/run_loop.rs Gauge updates in poll_and_dispatch(); expired counter; rate limit token gauges
src/scheduler/spawn.rs Dispatch counter + queue wait histogram; duration capture; inline retry counters
src/scheduler/spawn/context.rs counters + emitter in SpawnContext
src/scheduler/spawn/completion.rs Completion counter + duration histogram
src/scheduler/spawn/failure.rs Failure/retry/dead-letter/dependency counters + duration histogram
src/scheduler/submit.rs Submit/supersede/batch counters + emitter calls
src/scheduler/control.rs Group pause/resume counters + emitter calls

Design decisions

  • Dual-emit: each instrumentation point increments an AtomicU64 (always) AND calls the MetricsEmitter (only with #[cfg(feature = "metrics")]). The atomics serve non-metrics consumers; the metrics crate adds labels, histograms, and gauge semantics.
  • Zero-cost when unused: all metrics::* calls are behind #[cfg(feature = "metrics")]. Internal counters cost a few cache lines of atomics with Relaxed ordering.
  • Bounded label cardinality: only type, module, group, and reason appear as labels. Never task_id, key, or user-provided tags.
  • Inline retry coverage: the zero-delay inline retry path in spawn.rs now increments failed, failed_retryable, and retried counters (previously this fast path bypassed failure accounting).

Add always-on internal atomic counters and optional `metrics` crate
integration for production monitoring (Prometheus, StatsD, Datadog).

Phase 0 — SchedulerCounters + MetricsSnapshot:
- New `src/scheduler/counters.rs` with `SchedulerCounters` (AtomicU64)
  and public `MetricsSnapshot` struct
- `Scheduler::metrics_snapshot()` returns cumulative counters + gauges

Phase 1 — `metrics` crate integration:
- New `metrics` Cargo feature with `metrics = "0.24"` optional dep
- New `src/scheduler/metrics_bridge.rs` with `MetricsEmitter` that
  emits counters, gauges, and histograms via the `metrics` facade
- Metric descriptions registered once at build time

Phase 2 — Builder API:
- `metrics_prefix()`, `metrics_label()`, `disable_metric()` on
  SchedulerBuilder for customizing metric names and labels

Phase 3 — Instrumentation points:
- submit.rs: submitted, superseded, batch counters
- spawn.rs: dispatched counter, queue wait histogram, inline retry
  counters
- completion.rs: completed counter, duration histogram
- failure.rs: failed, retried, dead_lettered, dependency_failures
  counters + duration histogram
- gate.rs: gate_denials, rate_limit_throttles counters at each
  denial point
- control.rs: group_pauses, group_resumes counters
- run_loop.rs: expired counter, gauge updates (pending, running,
  blocked, paused, waiting, pressure, module running, rate limit
  tokens)

Phase 4 — Duration plumbing:
- Added `duration: Duration` to CompletionMsg and FailureMsg
- Captured via `started_at.elapsed()` in spawned task

Docs:
- New `docs/metrics.md` with full metric reference, dashboard layout,
  alert rules, and builder API examples
- Cross-linked from quick-start, configuration, progress-and-events,
  query-apis, and io-and-backpressure docs
- Feature flag and builder methods documented in configuration.md
- Metrics section added to lib.rs crate docs

Tests:
- 8 new integration tests covering submit/dispatch/complete, failure/
  retry, dead-letter, batch, group pause/resume, gauges, and supersede
- All 377 existing tests continue to pass
- Zero clippy warnings on both feature flag states
@deepjoy deepjoy enabled auto-merge (squash) March 24, 2026 14:44
@deepjoy deepjoy merged commit 685f93f into main Mar 24, 2026
2 checks passed
@github-actions github-actions Bot mentioned this pull request Mar 24, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Benchmark Comparison

Click to expand
group                                       main                                    pr
-----                                       ----                                    --
backoff_delay/constant                      1.01     44.2±0.22ns 431.4 MElem/sec    1.00     43.7±0.67ns 436.8 MElem/sec
backoff_delay/exponential                   1.02    190.8±0.42ns 100.0 MElem/sec    1.00    186.3±0.76ns 102.4 MElem/sec
backoff_delay/exponential_jitter            1.10    449.8±1.49ns 42.4 MElem/sec     1.00    407.6±1.49ns 46.8 MElem/sec
backoff_delay/linear                        1.00     75.9±0.09ns 251.4 MElem/sec    1.01     76.3±0.35ns 250.1 MElem/sec
batch_submit/1000                           1.04     33.5±2.06ms 29.1 KElem/sec     1.00     32.2±2.27ms 30.3 KElem/sec
byte_progress/byte_reporting_500            1.00    190.1±3.41ms  2.6 KElem/sec     1.02    194.8±4.30ms  2.5 KElem/sec
byte_progress/noop_500                      1.00    177.3±3.16ms  2.8 KElem/sec     1.09    193.3±4.11ms  2.5 KElem/sec
byte_progress_snapshot/100_tasks            1.00     80.8±2.27ms  1238 Elem/sec     1.03     83.3±2.83ms  1201 Elem/sec
concurrency_scaling/1                       1.00    367.2±3.86ms  1361 Elem/sec     1.60    586.7±6.52ms   852 Elem/sec
concurrency_scaling/2                       1.00    272.2±3.86ms  1837 Elem/sec     1.37    373.3±8.28ms  1339 Elem/sec
concurrency_scaling/4                       1.00    227.6±7.57ms  2.1 KElem/sec     1.01    229.8±4.46ms  2.1 KElem/sec
concurrency_scaling/8                       1.00    178.3±3.61ms  2.7 KElem/sec     1.08    192.2±3.44ms  2.5 KElem/sec
count_by_tags/100                           1.03    128.0±2.69µs  7.6 KElem/sec     1.00    124.5±2.97µs  7.8 KElem/sec
count_by_tags/1000                          1.01    215.7±3.09µs  4.5 KElem/sec     1.00    213.4±3.21µs  4.6 KElem/sec
count_by_tags/5000                          1.00    599.9±5.03µs  1667 Elem/sec     1.00    600.3±9.28µs  1665 Elem/sec
dep_chain_dispatch/10                       1.00     10.8±0.14ms   928 Elem/sec     1.40     15.0±0.15ms   665 Elem/sec
dep_chain_dispatch/25                       1.00     26.3±0.37ms   952 Elem/sec     1.42     37.2±0.75ms   671 Elem/sec
dep_chain_dispatch/50                       1.00     52.8±0.72ms   946 Elem/sec     1.42     74.7±1.26ms   668 Elem/sec
dep_chain_submit/10                         1.00      3.0±0.11ms  3.2 KElem/sec     1.00      3.0±0.10ms  3.3 KElem/sec
dep_chain_submit/200                        1.00     76.4±4.13ms  2.6 KElem/sec     1.01     77.1±3.91ms  2.5 KElem/sec
dep_chain_submit/50                         1.00     16.5±0.65ms  3.0 KElem/sec     1.02     16.8±0.95ms  2.9 KElem/sec
dep_fan_in_dispatch/10                      1.00      5.9±0.08ms  1877 Elem/sec     1.21      7.1±0.16ms  1557 Elem/sec
dep_fan_in_dispatch/100                     1.00     40.4±0.88ms  2.4 KElem/sec     1.06     42.8±0.97ms  2.3 KElem/sec
dep_fan_in_dispatch/50                      1.00     21.2±0.25ms  2.3 KElem/sec     1.10     23.3±0.45ms  2.1 KElem/sec
dispatch_and_complete/1000                  1.00    357.5±6.01ms  2.7 KElem/sec     1.09    389.4±5.55ms  2.5 KElem/sec
dispatch_group_scaling/1                    1.00    410.9±5.74ms  1216 Elem/sec     1.01    415.4±5.93ms  1203 Elem/sec
dispatch_group_scaling/10                   1.00    413.1±6.35ms  1210 Elem/sec     1.01    416.3±6.33ms  1200 Elem/sec
dispatch_group_scaling/100                  1.00    414.8±5.78ms  1205 Elem/sec     1.01    419.6±8.40ms  1191 Elem/sec
dispatch_group_scaling/50                   1.00    413.2±7.12ms  1209 Elem/sec     1.01    417.3±7.18ms  1198 Elem/sec
dispatch_no_groups/500                      1.00    178.8±4.44ms  2.7 KElem/sec     1.10    196.8±4.32ms  2.5 KElem/sec
dispatch_one_group/500                      1.00   410.7±17.43ms  1217 Elem/sec     1.02    418.2±8.52ms  1195 Elem/sec
dispatch_permanent_failure/500              1.00    345.8±3.85ms  1445 Elem/sec     1.00    345.4±5.64ms  1447 Elem/sec
history_by_type/100                         1.00    219.8±6.44µs  4.4 KElem/sec     1.01    221.8±6.97µs  4.4 KElem/sec
history_by_type/1000                        1.00   800.4±44.94µs  1249 Elem/sec     1.00   800.9±41.50µs  1248 Elem/sec
history_by_type/5000                        1.00   796.5±55.97µs  1255 Elem/sec     1.00   793.0±36.42µs  1260 Elem/sec
history_query/100                           1.00   428.4±21.36µs  2.3 KElem/sec     1.00   428.5±19.51µs  2.3 KElem/sec
history_query/1000                          1.00   425.0±17.02µs  2.3 KElem/sec     1.04   439.9±18.09µs  2.2 KElem/sec
history_query/5000                          1.01   429.1±26.79µs  2.3 KElem/sec     1.00   424.3±14.43µs  2.3 KElem/sec
history_stats/100                           1.03    125.7±0.86µs  7.8 KElem/sec     1.00    122.3±0.71µs  8.0 KElem/sec
history_stats/1000                          1.00    190.4±0.90µs  5.1 KElem/sec     1.00    189.5±0.84µs  5.2 KElem/sec
history_stats/5000                          1.00    478.9±3.26µs  2.0 KElem/sec     1.02    488.1±2.69µs  2.0 KElem/sec
mixed_priority_dispatch/500                 1.00    226.7±7.14ms  2.2 KElem/sec     1.01    230.0±3.96ms  2.1 KElem/sec
peek_next/100                               1.05    121.5±3.01µs  8.0 KElem/sec     1.00    115.9±3.00µs  8.4 KElem/sec
peek_next/1000                              1.04    118.3±2.94µs  8.3 KElem/sec     1.00    114.1±2.85µs  8.6 KElem/sec
peek_next/5000                              1.05    121.9±3.36µs  8.0 KElem/sec     1.00    116.3±5.91µs  8.4 KElem/sec
query_ids_by_tags/100                       1.05    185.0±2.03µs  5.3 KElem/sec     1.00    176.3±3.63µs  5.5 KElem/sec
query_ids_by_tags/1000                      1.01   818.2±10.04µs  1222 Elem/sec     1.00    811.0±6.13µs  1233 Elem/sec
query_ids_by_tags/5000                      1.04      3.6±0.03ms   274 Elem/sec     1.00      3.5±0.03ms   284 Elem/sec
retryable_dead_letter/constant              1.00    106.3±1.24ms   940 Elem/sec     1.00    106.6±1.27ms   937 Elem/sec
retryable_dead_letter/exponential           1.00    105.1±1.15ms   951 Elem/sec     1.02    107.6±1.41ms   929 Elem/sec
retryable_dead_letter/exponential_jitter    1.00    105.7±0.79ms   945 Elem/sec     1.02    108.1±1.75ms   925 Elem/sec
retryable_dead_letter/linear                1.00    106.2±1.16ms   941 Elem/sec     1.00    105.9±1.42ms   944 Elem/sec
submit_dedup_hit/1000                       1.02    207.0±5.50ms  4.7 KElem/sec     1.00    202.3±6.34ms  4.8 KElem/sec
submit_tasks/1000                           1.01    178.9±5.02ms  5.5 KElem/sec     1.00    176.8±5.18ms  5.5 KElem/sec
submit_with_tags/0                          1.02     89.4±2.65ms  5.5 KElem/sec     1.00     87.6±3.07ms  5.6 KElem/sec
submit_with_tags/10                         1.01    236.9±9.51ms  2.1 KElem/sec     1.00   234.4±11.35ms  2.1 KElem/sec
submit_with_tags/20                         1.01   386.1±15.26ms  1295 Elem/sec     1.00   382.0±17.78ms  1308 Elem/sec
submit_with_tags/5                          1.01    163.3±6.59ms  3.0 KElem/sec     1.00    161.3±7.75ms  3.0 KElem/sec
tag_values/100                              1.03    133.0±3.54µs  7.3 KElem/sec     1.00    129.6±3.11µs  7.5 KElem/sec
tag_values/1000                             1.00    193.2±3.09µs  5.1 KElem/sec     1.00    193.5±3.48µs  5.0 KElem/sec
tag_values/5000                             1.00    455.7±5.02µs  2.1 KElem/sec     1.01    459.8±6.40µs  2.1 KElem/sec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Observability metrics export (metrics crate integration)

1 participant