Skip to content

Design & implement observability — OpenTelemetry (§21) #123

@AlexChesser

Description

@AlexChesser

Summary

Design and implement OpenTelemetry integration for production debugging and monitoring.

Parent issue: #105 — Tier 3, Priority #13

Why

The turn log is good for audit; OpenTelemetry is good for understanding latency, cost, and failure patterns in real time. Production pipelines need distributed tracing, metrics (tokens used, step duration, error rates), and integration with existing observability stacks.

Design Decisions Needed

  • OTel SDK — which Rust crate? opentelemetry + tracing-opentelemetry?
  • What to instrument — spans per step? Per pipeline? Per runner call?
  • Attributes — step_id, runner, model, token_count, cost, error_type?
  • Export configuration — YAML observability: block? Environment variables? OTLP endpoint?
  • Metrics vs. traces vs. logs — which OTel signals to support?
  • Interaction with existing tracing instrumentation
  • Whether observability config is per-pipeline or global

Spec Reference

  • Referenced in spec/core/s21*.md as an exploratory feature
  • ail-core already uses tracing — this builds on that foundation

Acceptance Criteria

  • Spec section authored
  • OTel traces emitted for pipeline and step execution
  • Key attributes (step_id, duration, model, tokens) attached to spans
  • Configurable export endpoint
  • Zero overhead when OTel is not configured

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions