Skip to content

Add OpenTelemetry instrumentation for LLM calls, tool execution, and agent sessions #17

@abhi-bhat-lyzr

Description

@abhi-bhat-lyzr

Problem

There is no standardized observability for Gitclaw agent executions. Users have no visibility into:

  • What LLM calls are being made — which provider, model, token usage, cost, latency, finish reason
  • What tools are being executed — which tool, duration, success/failure
  • Overall session behavior — total cost, total tokens, number of LLM roundtrips vs tool calls

Without this, debugging agent behavior, optimizing cost, and monitoring production agents requires manual logging and guesswork.

Proposed Solution

Add OpenTelemetry-based instrumentation using a hybrid 3-layer approach:

Layer 1: HTTP-level interception (LLM calls)

  • Use @opentelemetry/instrumentation-undici to auto-instrument outbound HTTP calls to LLM providers (OpenAI, Anthropic, Google, Groq, Mistral, xAI, AWS Bedrock)
  • A custom SpanProcessor detects LLM provider URLs and enriches spans with gen_ai.* semantic conventions

Layer 2: Event-based enrichment (structured LLM data)

  • On each message_end event from the agent loop, create a gen_ai.chat span with:
    • gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.cost_usd
    • gen_ai.request.model, gen_ai.response.finish_reasons, gen_ai.system
  • This captures data that isn't available at the raw HTTP level (token counts, cost, stop reason)

Layer 3: Tool call wrapping (application level)

  • Every tool execution (built-in, declarative, plugin, SDK-injected) is wrapped in a gitclaw.tool.execute span
  • Captures: tool.name, tool.call_id, tool.duration_ms, tool.status, tool.error_message

Trace shape

gitclaw.session (root)
├── gen_ai.chat (LLM call → Anthropic)
│     gen_ai.usage.input_tokens=1523, output_tokens=200, cost_usd=0.003
│     gen_ai.response.finish_reasons=tool_use
├── gitclaw.tool.execute (cli)
│     tool.name=cli, duration_ms=2340, status=success
├── gen_ai.chat (LLM call → Anthropic)
│     gen_ai.usage.input_tokens=2100, output_tokens=150
├── gitclaw.tool.execute (write)
│     tool.name=write, duration_ms=12, status=success
├── gen_ai.chat (LLM call → Anthropic)
│     gen_ai.response.finish_reasons=stop
└── session totals: tokens=7073, cost=$0.012, tool_calls=2, llm_calls=3

Key design decisions

  • Zero overhead when disabled@opentelemetry/api returns no-op instances by default; no performance impact unless initTelemetry() is called
  • Opt-in SDK packages — Only @opentelemetry/api (~50KB) is a hard dependency; all SDK/exporter packages are optional peer dependencies
  • Backend agnostic — Exports via OTLP HTTP, compatible with Jaeger, Grafana Tempo, Datadog, Honeycomb, Axiom, or any OTel Collector
  • Plugin authors get accesstracer and meter exposed on GitclawPluginApi so plugins can emit custom spans/metrics

Usage

import { initTelemetry, query } from "gitclaw";

await initTelemetry({
  serviceName: "my-agent",
  exporterEndpoint: "http://localhost:4318",
});

for await (const msg of query({ prompt: "Fix the bug" })) {
  // traces + metrics exported automatically
}

Metrics emitted

Metric Type Description
gen_ai.client.token.usage Counter Token consumption by model and type
gen_ai.client.operation.duration Histogram LLM call latency
gitclaw.session.duration_ms Histogram End-to-end session duration
gitclaw.session.cost_usd Counter Session cost by agent and model
gitclaw.tool.calls Counter Tool invocations by name and status
gitclaw.tool.duration_ms Histogram Tool execution latency

Alternatives Considered

  1. Framework-level tracing (instrument every internal operation) — Rejected. Tracing manifest parsing, plugin loading, skill discovery, etc. high maintenance burden, and provides data most users don't need. Internal debugging can use standard logging.

  2. Custom tracing abstraction — Rejected. OpenTelemetry is the industry standard, vendor-neutral, and already supported by every major observability platform. Building a custom
    solution would fragment the ecosystem.

  3. SDK-level wrapping (wrap pi-ai client calls) — Partially adopted. HTTP-level interception is cleaner and survives SDK swaps, but message_end event enrichment is needed for
    structured data (tokens, cost) that isn't in raw HTTP responses.

Additional Context

  • Follows OpenTelemetry GenAI Semantic Conventions
  • The underlying LLM library (@mariozechner/pi-ai) uses Undici as its HTTP client, which is why @opentelemetry/instrumentation-undici is used instead of instrumentation-http
  • Compatible with quick local testing via Jaeger all-in-one:
    docker run -p 16686:16686 -p 4318:4318 jaegertracing/all-in-one

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions