feat(otel): reduce metric cardinality ~30-55%#11
Merged
Conversation
The per-type sync duration histogram produced ~143 series (7 buckets x 13 types x ~1.5 fan-out). Aggregate `pdbplus.sync.duration` already covers overall sync latency, so per-type latency was redundant. - Remove SyncTypeDuration var and registration from internal/otel/metrics.go - Remove three pdbotel.SyncTypeDuration.Record call sites in internal/sync/worker.go (fetch, upsert, delete passes). stepStart removed since it was only referenced by the deleted Record calls; stepSpan, typeAttr, and error-attribution counters (SyncTypeFetchErrors, SyncTypeUpsertErrors, SyncTypeFallback, SyncTypeObjects, SyncTypeDeleted) are retained. - Update internal/otel/metrics_test.go: remove SyncTypeDuration from the per-type instrument table, drop the Record call in the no-panic test, drop the duration assertion in the records-values test. - Update internal/sync/worker_test.go: drop the per-type duration metric lookup in TestSyncRecordsSuccessMetrics. Part of the OTel metric cardinality reduction plan (ethereal-petting-pelican).
MeterProvider now registers three views to curb cardinality:
1. Drop http.server.request.body.size (low debugging value).
2. Drop http.server.response.body.size (low debugging value).
3. Override rpc.server.duration to explicit bucket boundaries
{0.01, 0.05, 0.25, 1, 5} (5 boundaries -> 6 buckets) instead of
the SDK default 14-boundary set.
Split resource attributes so MeterProvider receives a resource that
omits fly.machine_id (prevents per-VM metric fan-out across every Fly
replica), while TracerProvider and LoggerProvider continue to receive
the full resource including fly.machine_id for per-VM debugging of
traces and logs.
Implementation: buildResourceFiltered carries the shared body;
buildResource wraps it with includeMachineID=true, buildMetricResource
wraps it with includeMachineID=false. No new dependencies; uses
existing sdkmetric alias.
Tests:
- TestBuildMetricResource_OmitsFlyMachineID asserts absence on metrics.
- TestBuildResource_IncludesFlyMachineID asserts presence on traces/logs.
Expected impact (per approved plan ethereal-petting-pelican): 30-55%
metric series reduction.
…eal-petting-pelican.md
Code Metrics Report
Code coverage of files in pull request scope (83.1%)
Reported by octocov |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Quick task
260414-2rc— cut OTel metric series cost by an estimated 30–55% without losing actionable signal. Four cardinality-reduction changes landed, guided by the inventory at.planning/quick/260414-2rc-reduce-otel-metric-cardinality-per-plan-/260414-2rc-PLAN.md:pdbplus.sync.type.duration— per-type sync histogram was redundant with the aggregatepdbplus.sync.duration, cost ~143 series.http.server.{request,response}.body.sizeviasdkmetric.AggregationDrop{}views — ~50–100 series.rpc.server.duration({10ms, 50ms, 250ms, 1s, 5s}) replacing the SDK-default 14-bucket set — ~250–550 series.fly.machine_idfrom the metric resource (keep it on traces/logs). Eliminates per-VM fan-out multiplier across every retained metric.Changes
internal/otel/metrics.go— removeSyncTypeDurationinstrument + registration.internal/sync/worker.go— remove threeSyncTypeDuration.Recordcall sites (fetch/upsert/delete-error loops) and their pairedstepStart := time.Now(). Span,typeAttr, and per-type counters preserved.internal/otel/provider.go— threesdkmetric.WithViewentries on theMeterProvider; newbuildResourceFilteredshared impl withbuildResource(traces/logs) andbuildMetricResource(metrics, nofly.machine_id) as thin wrappers.internal/otel/provider_test.go—TestBuildMetricResource_OmitsFlyMachineID+TestBuildResource_IncludesFlyMachineIDlock in the resource split.internal/otel/metrics_test.go+internal/sync/worker_test.go— assertions on the deleted metric removed.Verification
go build ./...— cleango vet ./...— cleango test -race ./...— all packages pass, no data racesgolangci-lint run— 0 issuesgovulncheck ./...— no vulnerabilitiesSyncTypeDuration/pdbplus.sync.type.durationare gone from the tree.Deferred manual check
Task 3 of the plan is a console-exporter smoke test (recipe in
260414-2rc-SUMMARY.mdunder "Deferred Human Verification"): run the binary withOTEL_METRICS_EXPORTER=console, hit/ui/,/rest/v1/net, and a gRPCList, then confirm the absence of the dropped metrics /fly.machine_idin the metric export stream and thatrpc.server.durationshowsBounds=[0.01, 0.05, 0.25, 1, 5].Key decisions
pdbplus.sync.durationaggregate; drop only the per-type variant.SyncTypeObjects,SyncTypeDeleted,SyncTypeFetchErrors,SyncTypeUpsertErrors,SyncTypeFallback) retained — they're cheap (13 series each) and actionable.{10ms, 50ms, 250ms, 1s, 5s}chosen to span typical RPC latency without over-sampling.buildResourceFilteredis the shared impl so the trace/log resource continues to carryfly.machine_idunchanged.Plan & summary
.planning/quick/260414-2rc-reduce-otel-metric-cardinality-per-plan-/260414-2rc-PLAN.md.planning/quick/260414-2rc-reduce-otel-metric-cardinality-per-plan-/260414-2rc-SUMMARY.md🤖 Generated with Claude Code