Skip to content

Commit f7f7840

Browse files
fix(otel): make service.instance.id unique per process (#4891)
All app replicas shared a hardcoded service.instance.id ("mothership-sim"), so OTel metrics from every process collapsed into one Prometheus series. Their independent cumulative counters then interleaved, producing phantom counter resets that corrupt rate()/increase() — staging hosted-key cost inflated to ~$0.72 from a few cents, while no-`key` metrics (cost_charged, throttled, queue_wait_*) were affected fleet-wide. Append the hostname (the container id under ECS, unique per task) so each replica gets its own series and sum(rate(...)) / sum(increase(...)) aggregate correctly. The mothership-sim prefix is kept so Jaeger's clock-skew adjuster still separates Sim from Go. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent ce7ddd1 commit f7f7840

1 file changed

Lines changed: 7 additions & 4 deletions

File tree

apps/sim/instrumentation-node.ts

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
// prefix (`sim-mothership:` / `go-mothership:`) to separate the two
33
// halves of a mothership trace in the OTLP backend.
44

5+
import { hostname } from 'node:os'
56
import type { Attributes, Context, Link, SpanKind } from '@opentelemetry/api'
67
import { DiagConsoleLogger, DiagLogLevel, diag, TraceFlags, trace } from '@opentelemetry/api'
78
import type {
@@ -259,10 +260,12 @@ async function initializeOpenTelemetry() {
259260
exportIntervalMillis: 60000,
260261
})
261262

262-
// Unique instance id per origin keeps Jaeger's clock-skew adjuster
263-
// from grouping Sim+Go spans together (they'd see multi-second
264-
// drift as intra-service and emit spurious warnings).
265-
const serviceInstanceId = `${telemetryConfig.serviceName}-${SERVICE_INSTANCE_SLUG}`
263+
// Must be unique per process: replicas sharing one instance id collapse
264+
// into a single Prometheus series, so their independent cumulative
265+
// counters interleave and corrupt rate()/increase(). The slug keeps Sim
266+
// distinct from Go for Jaeger's clock-skew grouping; the hostname (the
267+
// container id under ECS) makes each replica its own series.
268+
const serviceInstanceId = `${telemetryConfig.serviceName}-${SERVICE_INSTANCE_SLUG}-${hostname()}`
266269
const resource = defaultResource().merge(
267270
resourceFromAttributes({
268271
[ATTR_SERVICE_NAME]: telemetryConfig.serviceName,

0 commit comments

Comments
 (0)