Add OpenTelemetry instrumentation for LLM calls, tool execution, and agent sessions

 ## Problem

  There is no standardized observability for Gitclaw agent executions. Users have no visibility into:

  - **What LLM calls are being made** — which provider, model, token usage, cost, latency, finish reason
  - **What tools are being executed** — which tool, duration, success/failure
  - **Overall session behavior** — total cost, total tokens, number of LLM roundtrips vs tool calls

  Without this, debugging agent behavior, optimizing cost, and monitoring production agents requires manual logging and guesswork.

  ## Proposed Solution

  Add **OpenTelemetry-based instrumentation** using a hybrid 3-layer approach:

  ### Layer 1: HTTP-level interception (LLM calls)
  - Use `@opentelemetry/instrumentation-undici` to auto-instrument outbound HTTP calls to LLM providers (OpenAI, Anthropic, Google, Groq, Mistral, xAI, AWS Bedrock)
  - A custom `SpanProcessor` detects LLM provider URLs and enriches spans with `gen_ai.*` semantic conventions

  ### Layer 2: Event-based enrichment (structured LLM data)
  - On each `message_end` event from the agent loop, create a `gen_ai.chat` span with:
    - `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, `gen_ai.usage.cost_usd`
    - `gen_ai.request.model`, `gen_ai.response.finish_reasons`, `gen_ai.system`
  - This captures data that isn't available at the raw HTTP level (token counts, cost, stop reason)

  ### Layer 3: Tool call wrapping (application level)
  - Every tool execution (built-in, declarative, plugin, SDK-injected) is wrapped in a `gitclaw.tool.execute` span
  - Captures: `tool.name`, `tool.call_id`, `tool.duration_ms`, `tool.status`, `tool.error_message`

  ### Trace shape

  ```
  gitclaw.session (root)
  ├── gen_ai.chat (LLM call → Anthropic)
  │     gen_ai.usage.input_tokens=1523, output_tokens=200, cost_usd=0.003
  │     gen_ai.response.finish_reasons=tool_use
  ├── gitclaw.tool.execute (cli)
  │     tool.name=cli, duration_ms=2340, status=success
  ├── gen_ai.chat (LLM call → Anthropic)
  │     gen_ai.usage.input_tokens=2100, output_tokens=150
  ├── gitclaw.tool.execute (write)
  │     tool.name=write, duration_ms=12, status=success
  ├── gen_ai.chat (LLM call → Anthropic)
  │     gen_ai.response.finish_reasons=stop
  └── session totals: tokens=7073, cost=$0.012, tool_calls=2, llm_calls=3
  ```

  ### Key design decisions
  - **Zero overhead when disabled** — `@opentelemetry/api` returns no-op instances by default; no performance impact unless `initTelemetry()` is called
  - **Opt-in SDK packages** — Only `@opentelemetry/api` (~50KB) is a hard dependency; all SDK/exporter packages are optional peer dependencies
  - **Backend agnostic** — Exports via OTLP HTTP, compatible with Jaeger, Grafana Tempo, Datadog, Honeycomb, Axiom, or any OTel Collector
  - **Plugin authors get access** — `tracer` and `meter` exposed on `GitclawPluginApi` so plugins can emit custom spans/metrics

  ### Usage

  ```typescript
  import { initTelemetry, query } from "gitclaw";

  await initTelemetry({
    serviceName: "my-agent",
    exporterEndpoint: "http://localhost:4318",
  });

  for await (const msg of query({ prompt: "Fix the bug" })) {
    // traces + metrics exported automatically
  }
  ```

  ### Metrics emitted

  <table>
    <tr>
      <th>Metric</th>
      <th>Type</th>
      <th>Description</th>
    </tr>
    <tr>
      <td><code>gen_ai.client.token.usage</code></td>
      <td>Counter</td>
      <td>Token consumption by model and type</td>
    </tr>
    <tr>
      <td><code>gen_ai.client.operation.duration</code></td>
      <td>Histogram</td>
      <td>LLM call latency</td>
    </tr>
    <tr>
      <td><code>gitclaw.session.duration_ms</code></td>
      <td>Histogram</td>
      <td>End-to-end session duration</td>
    </tr>
    <tr>
      <td><code>gitclaw.session.cost_usd</code></td>
      <td>Counter</td>
      <td>Session cost by agent and model</td>
    </tr>
    <tr>
      <td><code>gitclaw.tool.calls</code></td>
      <td>Counter</td>
      <td>Tool invocations by name and status</td>
    </tr>
    <tr>
      <td><code>gitclaw.tool.duration_ms</code></td>
      <td>Histogram</td>
      <td>Tool execution latency</td>
    </tr>
  </table>

  ## Alternatives Considered

  1. **Framework-level tracing (instrument every internal operation)** — Rejected. Tracing manifest parsing, plugin loading, skill discovery, etc. high maintenance burden, and provides data most users don't need. Internal debugging can use standard logging.

  2. **Custom tracing abstraction** — Rejected. OpenTelemetry is the industry standard, vendor-neutral, and already supported by every major observability platform. Building a custom
  solution would fragment the ecosystem.

  3. **SDK-level wrapping (wrap pi-ai client calls)** — Partially adopted. HTTP-level interception is cleaner and survives SDK swaps, but `message_end` event enrichment is needed for
  structured data (tokens, cost) that isn't in raw HTTP responses.

  ## Additional Context

  - Follows [OpenTelemetry GenAI Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/)
  - The underlying LLM library (`@mariozechner/pi-ai`) uses **Undici** as its HTTP client, which is why `@opentelemetry/instrumentation-undici` is used instead of `instrumentation-http`
  - Compatible with quick local testing via Jaeger all-in-one:
    ```bash
    docker run -p 16686:16686 -p 4318:4318 jaegertracing/all-in-one
    ```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OpenTelemetry instrumentation for LLM calls, tool execution, and agent sessions #17

Problem

Proposed Solution

Layer 1: HTTP-level interception (LLM calls)

Layer 2: Event-based enrichment (structured LLM data)

Layer 3: Tool call wrapping (application level)

Trace shape

Key design decisions

Usage

Metrics emitted

Alternatives Considered

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Type	Description
`gen_ai.client.token.usage`	Counter	Token consumption by model and type
`gen_ai.client.operation.duration`	Histogram	LLM call latency
`gitclaw.session.duration_ms`	Histogram	End-to-end session duration
`gitclaw.session.cost_usd`	Counter	Session cost by agent and model
`gitclaw.tool.calls`	Counter	Tool invocations by name and status
`gitclaw.tool.duration_ms`	Histogram	Tool execution latency

Add OpenTelemetry instrumentation for LLM calls, tool execution, and agent sessions #17

Description

Problem

Proposed Solution

Layer 1: HTTP-level interception (LLM calls)

Layer 2: Event-based enrichment (structured LLM data)

Layer 3: Tool call wrapping (application level)

Trace shape

Key design decisions

Usage

Metrics emitted

Alternatives Considered

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions