Skip to content

feat(otel): reduce metric cardinality ~30-55%#11

Merged
dotwaffle merged 3 commits intomainfrom
otel-cardinality-reduction
Apr 14, 2026
Merged

feat(otel): reduce metric cardinality ~30-55%#11
dotwaffle merged 3 commits intomainfrom
otel-cardinality-reduction

Conversation

@dotwaffle
Copy link
Copy Markdown
Owner

Summary

Quick task 260414-2rc — cut OTel metric series cost by an estimated 30–55% without losing actionable signal. Four cardinality-reduction changes landed, guided by the inventory at .planning/quick/260414-2rc-reduce-otel-metric-cardinality-per-plan-/260414-2rc-PLAN.md:

  • Delete pdbplus.sync.type.duration — per-type sync histogram was redundant with the aggregate pdbplus.sync.duration, cost ~143 series.
  • Drop http.server.{request,response}.body.size via sdkmetric.AggregationDrop{} views — ~50–100 series.
  • Explicit 5-bucket boundaries on rpc.server.duration ({10ms, 50ms, 250ms, 1s, 5s}) replacing the SDK-default 14-bucket set — ~250–550 series.
  • Omit fly.machine_id from the metric resource (keep it on traces/logs). Eliminates per-VM fan-out multiplier across every retained metric.

Changes

  • internal/otel/metrics.go — remove SyncTypeDuration instrument + registration.
  • internal/sync/worker.go — remove three SyncTypeDuration.Record call sites (fetch/upsert/delete-error loops) and their paired stepStart := time.Now(). Span, typeAttr, and per-type counters preserved.
  • internal/otel/provider.go — three sdkmetric.WithView entries on the MeterProvider; new buildResourceFiltered shared impl with buildResource (traces/logs) and buildMetricResource (metrics, no fly.machine_id) as thin wrappers.
  • internal/otel/provider_test.goTestBuildMetricResource_OmitsFlyMachineID + TestBuildResource_IncludesFlyMachineID lock in the resource split.
  • internal/otel/metrics_test.go + internal/sync/worker_test.go — assertions on the deleted metric removed.

Verification

  • go build ./... — clean
  • go vet ./... — clean
  • go test -race ./... — all packages pass, no data races
  • golangci-lint run — 0 issues
  • govulncheck ./... — no vulnerabilities
  • Grep confirms SyncTypeDuration / pdbplus.sync.type.duration are gone from the tree.

Deferred manual check

Task 3 of the plan is a console-exporter smoke test (recipe in 260414-2rc-SUMMARY.md under "Deferred Human Verification"): run the binary with OTEL_METRICS_EXPORTER=console, hit /ui/, /rest/v1/net, and a gRPC List, then confirm the absence of the dropped metrics / fly.machine_id in the metric export stream and that rpc.server.duration shows Bounds=[0.01, 0.05, 0.25, 1, 5].

Key decisions

  • Keep pdbplus.sync.duration aggregate; drop only the per-type variant.
  • Per-type counters (SyncTypeObjects, SyncTypeDeleted, SyncTypeFetchErrors, SyncTypeUpsertErrors, SyncTypeFallback) retained — they're cheap (13 series each) and actionable.
  • Bucket set {10ms, 50ms, 250ms, 1s, 5s} chosen to span typical RPC latency without over-sampling.
  • buildResourceFiltered is the shared impl so the trace/log resource continues to carry fly.machine_id unchanged.

Plan & summary

  • Plan: .planning/quick/260414-2rc-reduce-otel-metric-cardinality-per-plan-/260414-2rc-PLAN.md
  • Summary: .planning/quick/260414-2rc-reduce-otel-metric-cardinality-per-plan-/260414-2rc-SUMMARY.md

🤖 Generated with Claude Code

The per-type sync duration histogram produced ~143 series (7 buckets x
13 types x ~1.5 fan-out). Aggregate `pdbplus.sync.duration` already
covers overall sync latency, so per-type latency was redundant.

- Remove SyncTypeDuration var and registration from internal/otel/metrics.go
- Remove three pdbotel.SyncTypeDuration.Record call sites in internal/sync/worker.go
  (fetch, upsert, delete passes). stepStart removed since it was only
  referenced by the deleted Record calls; stepSpan, typeAttr, and
  error-attribution counters (SyncTypeFetchErrors, SyncTypeUpsertErrors,
  SyncTypeFallback, SyncTypeObjects, SyncTypeDeleted) are retained.
- Update internal/otel/metrics_test.go: remove SyncTypeDuration from the
  per-type instrument table, drop the Record call in the no-panic test,
  drop the duration assertion in the records-values test.
- Update internal/sync/worker_test.go: drop the per-type duration metric
  lookup in TestSyncRecordsSuccessMetrics.

Part of the OTel metric cardinality reduction plan (ethereal-petting-pelican).
MeterProvider now registers three views to curb cardinality:
  1. Drop http.server.request.body.size (low debugging value).
  2. Drop http.server.response.body.size (low debugging value).
  3. Override rpc.server.duration to explicit bucket boundaries
     {0.01, 0.05, 0.25, 1, 5} (5 boundaries -> 6 buckets) instead of
     the SDK default 14-boundary set.

Split resource attributes so MeterProvider receives a resource that
omits fly.machine_id (prevents per-VM metric fan-out across every Fly
replica), while TracerProvider and LoggerProvider continue to receive
the full resource including fly.machine_id for per-VM debugging of
traces and logs.

Implementation: buildResourceFiltered carries the shared body;
buildResource wraps it with includeMachineID=true, buildMetricResource
wraps it with includeMachineID=false. No new dependencies; uses
existing sdkmetric alias.

Tests:
  - TestBuildMetricResource_OmitsFlyMachineID asserts absence on metrics.
  - TestBuildResource_IncludesFlyMachineID asserts presence on traces/logs.

Expected impact (per approved plan ethereal-petting-pelican): 30-55%
metric series reduction.
@github-actions
Copy link
Copy Markdown

Code Metrics Report

Coverage Test Execution Time
83.6% 2m28s

Code coverage of files in pull request scope (83.1%)

Files Coverage
internal/otel/metrics.go 76.0%
internal/otel/provider.go 92.3%
internal/sync/worker.go 83.0%

Reported by octocov

@dotwaffle dotwaffle merged commit 3e0e56b into main Apr 14, 2026
5 checks passed
@dotwaffle dotwaffle deleted the otel-cardinality-reduction branch April 14, 2026 02:21
dotwaffle added a commit that referenced this pull request Apr 14, 2026
)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant