-
Notifications
You must be signed in to change notification settings - Fork 28
Description
Problem
There is no standardized observability for Gitclaw agent executions. Users have no visibility into:
- What LLM calls are being made — which provider, model, token usage, cost, latency, finish reason
- What tools are being executed — which tool, duration, success/failure
- Overall session behavior — total cost, total tokens, number of LLM roundtrips vs tool calls
Without this, debugging agent behavior, optimizing cost, and monitoring production agents requires manual logging and guesswork.
Proposed Solution
Add OpenTelemetry-based instrumentation using a hybrid 3-layer approach:
Layer 1: HTTP-level interception (LLM calls)
- Use
@opentelemetry/instrumentation-undicito auto-instrument outbound HTTP calls to LLM providers (OpenAI, Anthropic, Google, Groq, Mistral, xAI, AWS Bedrock) - A custom
SpanProcessordetects LLM provider URLs and enriches spans withgen_ai.*semantic conventions
Layer 2: Event-based enrichment (structured LLM data)
- On each
message_endevent from the agent loop, create agen_ai.chatspan with:gen_ai.usage.input_tokens,gen_ai.usage.output_tokens,gen_ai.usage.cost_usdgen_ai.request.model,gen_ai.response.finish_reasons,gen_ai.system
- This captures data that isn't available at the raw HTTP level (token counts, cost, stop reason)
Layer 3: Tool call wrapping (application level)
- Every tool execution (built-in, declarative, plugin, SDK-injected) is wrapped in a
gitclaw.tool.executespan - Captures:
tool.name,tool.call_id,tool.duration_ms,tool.status,tool.error_message
Trace shape
gitclaw.session (root)
├── gen_ai.chat (LLM call → Anthropic)
│ gen_ai.usage.input_tokens=1523, output_tokens=200, cost_usd=0.003
│ gen_ai.response.finish_reasons=tool_use
├── gitclaw.tool.execute (cli)
│ tool.name=cli, duration_ms=2340, status=success
├── gen_ai.chat (LLM call → Anthropic)
│ gen_ai.usage.input_tokens=2100, output_tokens=150
├── gitclaw.tool.execute (write)
│ tool.name=write, duration_ms=12, status=success
├── gen_ai.chat (LLM call → Anthropic)
│ gen_ai.response.finish_reasons=stop
└── session totals: tokens=7073, cost=$0.012, tool_calls=2, llm_calls=3
Key design decisions
- Zero overhead when disabled —
@opentelemetry/apireturns no-op instances by default; no performance impact unlessinitTelemetry()is called - Opt-in SDK packages — Only
@opentelemetry/api(~50KB) is a hard dependency; all SDK/exporter packages are optional peer dependencies - Backend agnostic — Exports via OTLP HTTP, compatible with Jaeger, Grafana Tempo, Datadog, Honeycomb, Axiom, or any OTel Collector
- Plugin authors get access —
tracerandmeterexposed onGitclawPluginApiso plugins can emit custom spans/metrics
Usage
import { initTelemetry, query } from "gitclaw";
await initTelemetry({
serviceName: "my-agent",
exporterEndpoint: "http://localhost:4318",
});
for await (const msg of query({ prompt: "Fix the bug" })) {
// traces + metrics exported automatically
}Metrics emitted
| Metric | Type | Description |
|---|---|---|
gen_ai.client.token.usage |
Counter | Token consumption by model and type |
gen_ai.client.operation.duration |
Histogram | LLM call latency |
gitclaw.session.duration_ms |
Histogram | End-to-end session duration |
gitclaw.session.cost_usd |
Counter | Session cost by agent and model |
gitclaw.tool.calls |
Counter | Tool invocations by name and status |
gitclaw.tool.duration_ms |
Histogram | Tool execution latency |
Alternatives Considered
-
Framework-level tracing (instrument every internal operation) — Rejected. Tracing manifest parsing, plugin loading, skill discovery, etc. high maintenance burden, and provides data most users don't need. Internal debugging can use standard logging.
-
Custom tracing abstraction — Rejected. OpenTelemetry is the industry standard, vendor-neutral, and already supported by every major observability platform. Building a custom
solution would fragment the ecosystem. -
SDK-level wrapping (wrap pi-ai client calls) — Partially adopted. HTTP-level interception is cleaner and survives SDK swaps, but
message_endevent enrichment is needed for
structured data (tokens, cost) that isn't in raw HTTP responses.
Additional Context
- Follows OpenTelemetry GenAI Semantic Conventions
- The underlying LLM library (
@mariozechner/pi-ai) uses Undici as its HTTP client, which is why@opentelemetry/instrumentation-undiciis used instead ofinstrumentation-http - Compatible with quick local testing via Jaeger all-in-one:
docker run -p 16686:16686 -p 4318:4318 jaegertracing/all-in-one