Skip to content

Add distributed tracing with OpenTelemetry spans#283

Open
evantahler wants to merge 18 commits intomainfrom
evantahler/otel-tracing
Open

Add distributed tracing with OpenTelemetry spans#283
evantahler wants to merge 18 commits intomainfrom
evantahler/otel-tracing

Conversation

@evantahler
Copy link
Copy Markdown
Member

@evantahler evantahler commented Mar 22, 2026

Summary

Closes #276. Adds distributed tracing support to Keryx via OpenTelemetry spans:

  • HTTP requests get a parent HTTP {METHOD} span with method, route, and status attributes
  • Action execution creates action:{name} child spans with duration and error tracking
  • Database queries are instrumented via @kubiks/otel-drizzle for proper child spans with accurate timing (not just events)
  • Redis commands create redis.{command} child spans
  • W3C traceparent/tracestate headers are extracted from incoming requests and propagated through background tasks
  • Configurable via config.observability (tracing endpoint, sample rate, enable/disable independently from metrics)

Test plan

  • All 574 existing tests pass (bun run ci)
  • New tracing test suite covers span creation, parent-child hierarchy, W3C context extraction, DB spans, Redis spans, and context injection
  • Docs updated

🤖 Generated with Claude Code

evantahler and others added 2 commits March 21, 2026 23:12
Instrument Keryx with OTel distributed tracing alongside existing metrics.
Controlled independently via OTEL_TRACING_ENABLED, spans are exported via
OTLP with W3C traceparent propagation across HTTP requests and background tasks.

- Action execution wrapped in spans with duration and error recording
- HTTP requests get parent SERVER spans with method/route/status attributes
- W3C traceparent/tracestate extracted from incoming headers
- Trace context propagated through task enqueue via _traceParent/_traceState params
- DB queries (Drizzle) and Redis commands instrumented with CLIENT spans
- Configurable sample rate and OTLP endpoint
- Zero overhead when disabled (no-op tracer pattern matches existing metrics)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve version conflict (keep 0.17.0) and regenerate lockfile.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Member Author

@evantahler evantahler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review findings — 3 validated issues:


Issue 1: as any assertion in instrumentRedisTracing (CLAUDE.md rule)

packages/keryx/initializers/redis.ts line 84:

const result = originalSendCommand(command, ...rest) as any;

CLAUDE.md requires: "Never use as any type assertions. Use @ts-expect-error with an explanatory comment when the type system can't express something." Use @ts-expect-error here instead.


Issue 2: DB query span captures zero duration

packages/keryx/initializers/db.ts lines 55–62:

const span = api.observability.tracing.tracer.startSpan("db.query", { ... });
span.end();  // ← ends synchronously, immediately after startSpan
logger.debug(message);

Drizzle's LogWriter.write() is a post-execution log callback — it fires after the query completes. The span is created and immediately ended in the same synchronous call, so it records ~0ms duration and captures no errors. Every DB span in traces will appear as an instantaneous no-op, making them misleading rather than useful.


Issue 3: _traceParent / _traceState not stripped from params before connection.act()

packages/keryx/initializers/resque.ts lines 306–330:
_traceParent and _traceState are read from params on lines 306–307 but never deleted from plainParams before it's passed to connection.act() on line 330. These internal fields flow through to Zod validation on the action's input schema. Actions using .strict() schemas will fail validation unexpectedly. The same applies to the pre-existing _correlationId and _fanOutId fields — but this PR adds two new ones without stripping them.

evantahler and others added 2 commits March 22, 2026 10:41
- Replace `as any` with `@ts-expect-error` in redis.ts
- Replace zero-duration db.query span with an event on the active span,
  since DrizzleLogger.write() fires after execution and can't wrap it
- Strip _traceParent/_traceState from resque plainParams before act()
- Update tracing test and docs to reflect event-based DB query recording

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Resolved conflicts:
- Connection.ts: combined timeout/tracing (HEAD) with StreamingResponse guard
  in runAfter middleware (main)
- web.ts: merged StreamingResponse early-return (main) into the tracing
  context.with block (HEAD), ending httpSpan before the streaming return

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
if (!serviceName) {
try {
const pkgPath = path.join(api.rootDir, "package.json");
const pkg = await Bun.file(pkgPath).json();
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we just import the package.json directly

evantahler and others added 2 commits March 22, 2026 10:51
…iming

Switches from event-based DB query logging (no duration data) to
@kubiks/otel-drizzle which creates real child spans with accurate timing.
Also fixes pre-existing TS errors in Connection.ts and redis.ts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
evantahler and others added 2 commits March 22, 2026 11:42
- Extract shared helpers into util/tracing.ts: finalizeSpan(),
  finalizeSpanOnPromise(), runWithSpan(), injectTraceToParams(),
  extractTraceFromParams()
- Update to stable OTel semconv v1.20+ attribute names:
  http.method → http.request.method, http.status_code →
  http.response.status_code, db.system → db.system.name
- Update HTTP span names to include route (e.g. "GET status")
- Add BatchSpanProcessor tuning config (queue size, batch size,
  export delay) and shutdown timeout with graceful fallback
- Update tests for new attribute names and add Redis span test
- Update docs with new attribute names and config table

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve conflicts: integrate nested act() depth tracking and runAfter
error parameter from main with tracing helpers (runWithSpan, finalizeSpan).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Member Author

@evantahler evantahler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version not bumped — Per CLAUDE.md: "Always bump the package version — Every PR must bump version in packages/keryx/package.json. Use patch for bug fixes, minor for new features." This PR adds distributed tracing (a new feature) but packages/keryx/package.json still shows 0.18.0. It should be bumped to at least 0.19.0.

import { Initializer } from "../classes/Initializer";
import { ErrorType, TypedError } from "../classes/TypedError";
import { config } from "../config";
import pkg from "../package.json";
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Service name regression (bug) — The previous code resolved the service name dynamically at runtime via path.join(api.rootDir, "package.json"), reading the application's package.json (e.g. "my-cool-api"). This import reads ../package.json, which is always the framework's package.json — always "keryx". When config.observability.serviceName is not explicitly set, all metrics and traces will incorrectly report service.name = "keryx" instead of the app's actual name. The old fallback chain was env var > app package.json > "keryx"; this change loses the middle step.

*/
function createNoopTracer() {
return {
startSpan: (_name: string, _options?: any, _context?: any): Span =>
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any type annotations — Per CLAUDE.md: "No as any — Never use as any type assertions. Use @ts-expect-error with an explanatory comment when the type system can't express something, or add a proper type/interface." Multiple any types appear in new code: _options?: any, _context?: any, ...args: any[] here (lines 360–364), plus _metrics: any (line 377), resourceMetrics: any (line 390), Record<string, any> (line 442), dp: any (line 456). Replace with proper types or @ts-expect-error with explanatory comments.

Comment thread packages/keryx/initializers/redis.ts Outdated
*/
private instrumentRedisTracing(client: RedisClient) {
const originalSendCommand = client.sendCommand.bind(client);
client.sendCommand = function (command: any, ...rest: any[]) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any type annotations — Per CLAUDE.md: "No as any — Never use as any type assertions." command: any and ...rest: any[] should use proper ioredis types (e.g. Command from ioredis) or @ts-expect-error with an explanatory comment if the override signature can't be typed cleanly.

Comment thread packages/keryx/util/tracing.ts Outdated
*
* @param inputs - The task input object to enrich (mutated in place).
*/
export function injectTraceToParams(inputs: Record<string, any>): void {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any in type parameters — Per CLAUDE.md: "No as any — Never use as any type assertions." Record<string, any> uses any as a type argument. Consider Record<string, unknown> here and in the extractTraceFromParams signature (line 96), with explicit narrowing where needed.

httpSpan!.spanContext().traceId,
);
// Action's parent is the HTTP span (SDK v2 uses parentSpanContext)
const readableAction = actionSpan as any;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as any type assertion — Per CLAUDE.md: "No as any — Never use as any type assertions. Use @ts-expect-error with an explanatory comment when the type system can't express something." Replace with // @ts-expect-error -- OTel SDK v2 exposes parentSpanContext on ReadableSpan but it is not in the public type above the property access line.

evantahler and others added 10 commits March 22, 2026 12:01
- Restore dynamic service name resolution from api.rootDir/package.json
  (static import pkg from "../package.json" always resolved to "keryx")
- Replace all `any` type annotations with proper types: SpanOptions/Context
  in createNoopTracer, ResourceMetrics/InstrumentType in prometheus
  serializers, Parameters<> spread in Redis instrumentation,
  Record<string, unknown> in tracing helpers
- Fix otelTypeToPrometheus to use InstrumentType string enum (numeric
  constants were incorrect)
- Replace `as any` in tracing test with @ts-expect-error + comment
- Bump version 0.18.0 → 0.19.0 (new feature)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Resolve version conflict: bump to 0.20.0 (main was 0.19.1, branch added
new OTel tracing feature warranting a minor bump).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Resolve version conflict: bump to 0.21.0 (main was 0.20.1, branch adds
new OTel tracing feature warranting a minor bump).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move all heavy OTel SDK code out of the keryx core and into a new
first-party plugin package at packages/plugins/observability/.

Core changes:
- packages/keryx/initializers/observability.ts stripped to a no-op stub
  that establishes api.observability via declare module augmentation;
  all SDK imports and start()/stop() removed
- Remove 6 OTel SDK deps from keryx package.json (keep @opentelemetry/api
  since Connection.ts / util/tracing.ts / web.ts reference OTel types)
- Bump keryx version 0.21.0 → 0.22.0 (minor: breaking for auto-loaded obs)
- Update observability test to verify no-op behavior without the plugin

New plugin package (@keryxjs/observability@0.1.0):
- packages/plugins/observability/ — full OTel implementation
- ObservabilityPlugin extends Initializer (loadPriority=60, after stub at 50)
- Owns all SDK deps: context-async-hooks, exporter-trace-otlp-http, resources,
  sdk-metrics, sdk-trace-base, semantic-conventions, core
- index.ts exports observabilityPlugin: KeryxPlugin
- __tests__/observability.test.ts — integration tests with real Redis/Postgres
- Mirrors keryx test infra: bunfig.toml, setup.ts, .env.example

Example backend:
- example/backend/config/plugins.ts — registers observabilityPlugin
- Add @keryxjs/observability workspace dependency

Infrastructure:
- Root package.json: add workspace + lint/format/test scripts for plugin
- .github/workflows/test.yaml: add test-plugin-observability CI job (sharded)
- .github/workflows/publish.yaml: add publish-plugin-observability job;
  trigger on either package.json path change

Docs:
- docs/guide/observability.md: add Installation section with bun add + config
- docs/guide/plugins.md: add First-Party Plugins table with @keryxjs/observability

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
globLoader auto-discovers all exported functions and tries to instantiate
them as initializers. Exporting createNoopTracer caused it to be picked
up as a fake initializer with no runModes, breaking api.start() across all
test suites.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Conflicts:
- packages/keryx/package.json: kept branch version 0.22.0 (ahead of
  main's 0.20.3 due to new OTel feature minor bumps)
- bun.lockb: took main's binary, regenerated with bun install

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Conflicts:
- packages/keryx/package.json: kept branch version 0.22.0 (ahead of
  main's 0.20.4 due to new OTel plugin feature bumps)
- .github/workflows/test.yaml: merged needs lists — kept both
  test-plugin-observability (from branch) and benchmark (from main)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Conflicts:
- packages/keryx/package.json: kept branch version 0.22.0 (ahead of main's 0.21.2)
- .github/workflows/test.yaml: merged needs — added docker-build-backend
  and docker-build-frontend from main alongside test-plugin-observability
  from branch
- bun.lockb: took main's binary, regenerated with bun install

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The root package.json declares packages/plugins/observability as a
workspace. Both Dockerfiles copied individual package.json files for
dependency caching but omitted the plugin workspace, causing
'bun install --frozen-lockfile' to fail with "Workspace not found".

Also copy plugin source in the backend Dockerfile so the workspace
package is available at runtime.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add distributed tracing (OpenTelemetry spans)

1 participant