Add distributed tracing with OpenTelemetry spans#283
Add distributed tracing with OpenTelemetry spans#283evantahler wants to merge 18 commits intomainfrom
Conversation
Instrument Keryx with OTel distributed tracing alongside existing metrics. Controlled independently via OTEL_TRACING_ENABLED, spans are exported via OTLP with W3C traceparent propagation across HTTP requests and background tasks. - Action execution wrapped in spans with duration and error recording - HTTP requests get parent SERVER spans with method/route/status attributes - W3C traceparent/tracestate extracted from incoming headers - Trace context propagated through task enqueue via _traceParent/_traceState params - DB queries (Drizzle) and Redis commands instrumented with CLIENT spans - Configurable sample rate and OTLP endpoint - Zero overhead when disabled (no-op tracer pattern matches existing metrics) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve version conflict (keep 0.17.0) and regenerate lockfile. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
evantahler
left a comment
There was a problem hiding this comment.
Review findings — 3 validated issues:
Issue 1: as any assertion in instrumentRedisTracing (CLAUDE.md rule)
packages/keryx/initializers/redis.ts line 84:
const result = originalSendCommand(command, ...rest) as any;CLAUDE.md requires: "Never use as any type assertions. Use @ts-expect-error with an explanatory comment when the type system can't express something." Use @ts-expect-error here instead.
Issue 2: DB query span captures zero duration
packages/keryx/initializers/db.ts lines 55–62:
const span = api.observability.tracing.tracer.startSpan("db.query", { ... });
span.end(); // ← ends synchronously, immediately after startSpan
logger.debug(message);Drizzle's LogWriter.write() is a post-execution log callback — it fires after the query completes. The span is created and immediately ended in the same synchronous call, so it records ~0ms duration and captures no errors. Every DB span in traces will appear as an instantaneous no-op, making them misleading rather than useful.
Issue 3: _traceParent / _traceState not stripped from params before connection.act()
packages/keryx/initializers/resque.ts lines 306–330:
_traceParent and _traceState are read from params on lines 306–307 but never deleted from plainParams before it's passed to connection.act() on line 330. These internal fields flow through to Zod validation on the action's input schema. Actions using .strict() schemas will fail validation unexpectedly. The same applies to the pre-existing _correlationId and _fanOutId fields — but this PR adds two new ones without stripping them.
- Replace `as any` with `@ts-expect-error` in redis.ts - Replace zero-duration db.query span with an event on the active span, since DrizzleLogger.write() fires after execution and can't wrap it - Strip _traceParent/_traceState from resque plainParams before act() - Update tracing test and docs to reflect event-based DB query recording Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Resolved conflicts: - Connection.ts: combined timeout/tracing (HEAD) with StreamingResponse guard in runAfter middleware (main) - web.ts: merged StreamingResponse early-return (main) into the tracing context.with block (HEAD), ending httpSpan before the streaming return Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| if (!serviceName) { | ||
| try { | ||
| const pkgPath = path.join(api.rootDir, "package.json"); | ||
| const pkg = await Bun.file(pkgPath).json(); |
There was a problem hiding this comment.
we just import the package.json directly
…iming Switches from event-based DB query logging (no duration data) to @kubiks/otel-drizzle which creates real child spans with accurate timing. Also fixes pre-existing TS errors in Connection.ts and redis.ts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Extract shared helpers into util/tracing.ts: finalizeSpan(), finalizeSpanOnPromise(), runWithSpan(), injectTraceToParams(), extractTraceFromParams() - Update to stable OTel semconv v1.20+ attribute names: http.method → http.request.method, http.status_code → http.response.status_code, db.system → db.system.name - Update HTTP span names to include route (e.g. "GET status") - Add BatchSpanProcessor tuning config (queue size, batch size, export delay) and shutdown timeout with graceful fallback - Update tests for new attribute names and add Redis span test - Update docs with new attribute names and config table Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve conflicts: integrate nested act() depth tracking and runAfter error parameter from main with tracing helpers (runWithSpan, finalizeSpan). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
evantahler
left a comment
There was a problem hiding this comment.
Version not bumped — Per CLAUDE.md: "Always bump the package version — Every PR must bump version in packages/keryx/package.json. Use patch for bug fixes, minor for new features." This PR adds distributed tracing (a new feature) but packages/keryx/package.json still shows 0.18.0. It should be bumped to at least 0.19.0.
| import { Initializer } from "../classes/Initializer"; | ||
| import { ErrorType, TypedError } from "../classes/TypedError"; | ||
| import { config } from "../config"; | ||
| import pkg from "../package.json"; |
There was a problem hiding this comment.
Service name regression (bug) — The previous code resolved the service name dynamically at runtime via path.join(api.rootDir, "package.json"), reading the application's package.json (e.g. "my-cool-api"). This import reads ../package.json, which is always the framework's package.json — always "keryx". When config.observability.serviceName is not explicitly set, all metrics and traces will incorrectly report service.name = "keryx" instead of the app's actual name. The old fallback chain was env var > app package.json > "keryx"; this change loses the middle step.
| */ | ||
| function createNoopTracer() { | ||
| return { | ||
| startSpan: (_name: string, _options?: any, _context?: any): Span => |
There was a problem hiding this comment.
any type annotations — Per CLAUDE.md: "No as any — Never use as any type assertions. Use @ts-expect-error with an explanatory comment when the type system can't express something, or add a proper type/interface." Multiple any types appear in new code: _options?: any, _context?: any, ...args: any[] here (lines 360–364), plus _metrics: any (line 377), resourceMetrics: any (line 390), Record<string, any> (line 442), dp: any (line 456). Replace with proper types or @ts-expect-error with explanatory comments.
| */ | ||
| private instrumentRedisTracing(client: RedisClient) { | ||
| const originalSendCommand = client.sendCommand.bind(client); | ||
| client.sendCommand = function (command: any, ...rest: any[]) { |
There was a problem hiding this comment.
any type annotations — Per CLAUDE.md: "No as any — Never use as any type assertions." command: any and ...rest: any[] should use proper ioredis types (e.g. Command from ioredis) or @ts-expect-error with an explanatory comment if the override signature can't be typed cleanly.
| * | ||
| * @param inputs - The task input object to enrich (mutated in place). | ||
| */ | ||
| export function injectTraceToParams(inputs: Record<string, any>): void { |
There was a problem hiding this comment.
any in type parameters — Per CLAUDE.md: "No as any — Never use as any type assertions." Record<string, any> uses any as a type argument. Consider Record<string, unknown> here and in the extractTraceFromParams signature (line 96), with explicit narrowing where needed.
| httpSpan!.spanContext().traceId, | ||
| ); | ||
| // Action's parent is the HTTP span (SDK v2 uses parentSpanContext) | ||
| const readableAction = actionSpan as any; |
There was a problem hiding this comment.
as any type assertion — Per CLAUDE.md: "No as any — Never use as any type assertions. Use @ts-expect-error with an explanatory comment when the type system can't express something." Replace with // @ts-expect-error -- OTel SDK v2 exposes parentSpanContext on ReadableSpan but it is not in the public type above the property access line.
- Restore dynamic service name resolution from api.rootDir/package.json (static import pkg from "../package.json" always resolved to "keryx") - Replace all `any` type annotations with proper types: SpanOptions/Context in createNoopTracer, ResourceMetrics/InstrumentType in prometheus serializers, Parameters<> spread in Redis instrumentation, Record<string, unknown> in tracing helpers - Fix otelTypeToPrometheus to use InstrumentType string enum (numeric constants were incorrect) - Replace `as any` in tracing test with @ts-expect-error + comment - Bump version 0.18.0 → 0.19.0 (new feature) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Resolve version conflict: bump to 0.20.0 (main was 0.19.1, branch added new OTel tracing feature warranting a minor bump). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Resolve version conflict: bump to 0.21.0 (main was 0.20.1, branch adds new OTel tracing feature warranting a minor bump). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move all heavy OTel SDK code out of the keryx core and into a new first-party plugin package at packages/plugins/observability/. Core changes: - packages/keryx/initializers/observability.ts stripped to a no-op stub that establishes api.observability via declare module augmentation; all SDK imports and start()/stop() removed - Remove 6 OTel SDK deps from keryx package.json (keep @opentelemetry/api since Connection.ts / util/tracing.ts / web.ts reference OTel types) - Bump keryx version 0.21.0 → 0.22.0 (minor: breaking for auto-loaded obs) - Update observability test to verify no-op behavior without the plugin New plugin package (@keryxjs/observability@0.1.0): - packages/plugins/observability/ — full OTel implementation - ObservabilityPlugin extends Initializer (loadPriority=60, after stub at 50) - Owns all SDK deps: context-async-hooks, exporter-trace-otlp-http, resources, sdk-metrics, sdk-trace-base, semantic-conventions, core - index.ts exports observabilityPlugin: KeryxPlugin - __tests__/observability.test.ts — integration tests with real Redis/Postgres - Mirrors keryx test infra: bunfig.toml, setup.ts, .env.example Example backend: - example/backend/config/plugins.ts — registers observabilityPlugin - Add @keryxjs/observability workspace dependency Infrastructure: - Root package.json: add workspace + lint/format/test scripts for plugin - .github/workflows/test.yaml: add test-plugin-observability CI job (sharded) - .github/workflows/publish.yaml: add publish-plugin-observability job; trigger on either package.json path change Docs: - docs/guide/observability.md: add Installation section with bun add + config - docs/guide/plugins.md: add First-Party Plugins table with @keryxjs/observability Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
globLoader auto-discovers all exported functions and tries to instantiate them as initializers. Exporting createNoopTracer caused it to be picked up as a fake initializer with no runModes, breaking api.start() across all test suites. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Conflicts: - packages/keryx/package.json: kept branch version 0.22.0 (ahead of main's 0.20.3 due to new OTel feature minor bumps) - bun.lockb: took main's binary, regenerated with bun install Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Conflicts: - packages/keryx/package.json: kept branch version 0.22.0 (ahead of main's 0.20.4 due to new OTel plugin feature bumps) - .github/workflows/test.yaml: merged needs lists — kept both test-plugin-observability (from branch) and benchmark (from main) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Conflicts: - packages/keryx/package.json: kept branch version 0.22.0 (ahead of main's 0.21.2) - .github/workflows/test.yaml: merged needs — added docker-build-backend and docker-build-frontend from main alongside test-plugin-observability from branch - bun.lockb: took main's binary, regenerated with bun install Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The root package.json declares packages/plugins/observability as a workspace. Both Dockerfiles copied individual package.json files for dependency caching but omitted the plugin workspace, causing 'bun install --frozen-lockfile' to fail with "Workspace not found". Also copy plugin source in the backend Dockerfile so the workspace package is available at runtime. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Closes #276. Adds distributed tracing support to Keryx via OpenTelemetry spans:
HTTP {METHOD}span with method, route, and status attributesaction:{name}child spans with duration and error tracking@kubiks/otel-drizzlefor proper child spans with accurate timing (not just events)redis.{command}child spanstraceparent/tracestateheaders are extracted from incoming requests and propagated through background tasksconfig.observability(tracing endpoint, sample rate, enable/disable independently from metrics)Test plan
bun run ci)🤖 Generated with Claude Code