feat: transcript import pipeline — grade existing Claude/Codex/Copilot sessions offline

## Objective

Add an `agentv import` command that reads existing AI coding sessions from Claude Code, Codex CLI, and Copilot CLI, normalizes them into a tool-agnostic **transcript** format, and feeds them into the existing evaluator pipeline for grading. This enables comparing different clients and workspace setups from manually-run sessions without re-executing anything.

Bundled: rename `agentv trace` → `agentv inspect` to free up "trace" for its industry-standard meaning (OTel spans) and avoid terminology collision.

## Motivation

- Teams run the same task in Claude Code, Codex CLI, and Copilot CLI to compare quality — today there's no way to grade those sessions without re-running them
- Workspace setup experiments (different CLAUDE.md files, system prompts, skills) produce sessions that should be gradeable after the fact
- Grader iteration: import once, re-grade many times with different evaluators without burning API tokens
- Industry alignment: every major eval framework (DeepEval, Braintrust, LangSmith) separates data import from evaluation

## Design

### Architecture: Two-step pipeline

Industry best practice separates import from evaluation. The data carries the output, not the provider.

```
Step 1: agentv import <source>  → transcript JSONL (.agentv/transcripts/)
Step 2: agentv eval <eval.yaml> --transcript <file>  → graded results (.agentv/results/)
```

**Why not a provider modifier (e.g., `provider: claude-cli` + `session:` block)?**
- Makes providers do two unrelated things (spawn live agent vs parse static files)
- Every agent provider would need a transcript parser coupled to it
- Can't reuse imported data across multiple eval runs cheaply
- Not how any major eval framework works (DeepEval uses pre-populated test cases, Braintrust uses dataset-from-traces, Promptfoo uses an echo provider)

### Terminology

| Term | Meaning | Context |
|------|---------|---------|
| **session** | Raw tool-specific file on disk | `~/.claude/.../*.jsonl`, `~/.codex/.../*.jsonl`, `~/.copilot/.../*.jsonl` |
| **transcript** | Normalized, tool-agnostic AgentV representation | Output of `agentv import`, input to `agentv eval --transcript` |
| **result** | Graded output from `agentv eval` | Existing concept, unchanged |
| **trace** | OTel spans for observability backends | Reserved for OTel export only |
| **dataset** | Eval test grouping in studio | Existing concept, unchanged |

### Session file locations (input)

| Tool | Path | Format |
|------|------|--------|
| Claude Code | `~/.claude/projects/<encoded-path>/<uuid>.jsonl` | JSONL — `type` field: `user`, `assistant`, `progress`, `system` |
| Codex CLI | `~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl` | JSONL — `RolloutItem`: `ResponseItem`, `EventMsg`, `TurnContext`, `SessionMeta` |
| Copilot CLI | `~/.copilot/session-state/{id}/events.jsonl` | JSONL — events: `session.start`, `user.message`, `assistant.message`, `tool.execution_*` |

### Transcript JSONL format (output of import)

Each line is a self-contained test case with pre-populated output:

```jsonl
{
  "input": "Add JWT authentication to the Express API",
  "output": [
    {
      "role": "user",
      "content": "Add JWT authentication to the Express API"
    },
    {
      "role": "assistant",
      "content": "I'll add JWT auth...",
      "tool_calls": [
        {"tool": "Read", "input": {"file_path": "/src/index.ts"}, "output": "...", "duration_ms": 45},
        {"tool": "Edit", "input": {"file_path": "/src/auth.ts", "old_string": "...", "new_string": "..."}, "duration_ms": 120}
      ]
    }
  ],
  "token_usage": {"input": 15000, "output": 3200, "cached": 8000},
  "duration_ms": 45000,
  "cost_usd": 0.12,
  "source": {
    "provider": "claude-cli",
    "session_id": "0763061e-9c91-4dee-9373-3bd69c817fd4",
    "model": "claude-opus-4-6",
    "version": "2.1.62",
    "timestamp": "2026-03-29T21:10:11.969Z",
    "git_branch": "main",
    "cwd": "/home/user/projects/myapp"
  }
}
```

The `output` field uses the existing `Message[]` schema from `packages/core/src/evaluation/providers/types.ts` (snake_case wire format).

### CLI: `agentv import`

```bash
# Import a specific Claude Code session
agentv import claude \
  --session-id 0763061e-9c91-4dee-9373-3bd69c817fd4 \
  --output .agentv/transcripts/claude-auth.jsonl

# Import latest Claude Code session for a project
agentv import claude \
  --discover latest \
  --project-path /home/user/projects/myapp \
  --output .agentv/transcripts/claude-auth.jsonl

# Import latest Codex CLI session
agentv import codex \
  --discover latest \
  --output .agentv/transcripts/codex-auth.jsonl

# Import specific Codex session by date
agentv import codex \
  --date 2026-03-29 \
  --discover latest \
  --output .agentv/transcripts/codex-auth.jsonl

# Import Copilot CLI session
agentv import copilot \
  --session-id abc-123 \
  --output .agentv/transcripts/copilot-auth.jsonl
```

Default output directory: `.agentv/transcripts/`

### Evaluation with transcripts

```bash
# Grade a transcript against an eval
agentv eval evals/compare-clients.yaml --transcript .agentv/transcripts/claude-auth.jsonl
agentv eval evals/compare-clients.yaml --transcript .agentv/transcripts/codex-auth.jsonl
agentv eval evals/compare-clients.yaml --transcript .agentv/transcripts/copilot-auth.jsonl

# Compare graded results
agentv compare \
  .agentv/results/runs/claude-auth-* \
  .agentv/results/runs/codex-auth-* \
  .agentv/results/runs/copilot-auth-*
```

Example eval YAML:

```yaml
description: Compare client implementations of auth feature
tests:
  - input: "Add JWT authentication to the Express API"
    assert:
      - type: llm-grader
        prompt: ./graders/auth-quality.md
      - type: tool-trajectory
        value: ["Read", "Edit", "Bash"]
      - type: code-grader
        value: ./graders/check-jwt-files.ts
      - type: cost
        budget: 0.50
      - type: execution-metrics
        max_tool_calls: 30
```

When `--transcript` is provided, the orchestrator skips provider invocation and uses the pre-populated `output` from the transcript as the `ProviderResponse`. Evaluators run identically to live eval.

### Bundled rename: `agentv trace` → `agentv inspect`

Current `agentv trace` subcommands are all result inspection tools, not OTel trace operations:

| Current | Renamed | Purpose |
|---------|---------|---------|
| `agentv trace list` | `agentv inspect list` | List result files |
| `agentv trace show` | `agentv inspect show` | Show result details with execution tree |
| `agentv trace stats` | `agentv inspect stats` | Compute percentile stats |
| `agentv trace score` | `agentv inspect score` | Re-run evaluators post-hoc |

Keep `agentv trace` as a deprecated alias for one release cycle.

## Existing code to build on

| File | Relevance |
|------|-----------|
| `packages/core/src/evaluation/providers/copilot-log.ts` | Existing transcript reader — same pattern, migrate parser to importer |
| `packages/core/src/evaluation/providers/copilot-log-parser.ts` | Copilot JSONL parsing logic — reusable |
| `packages/core/src/evaluation/providers/copilot-session-discovery.ts` | Session discovery logic — reusable pattern for Claude/Codex |
| `packages/core/src/evaluation/providers/types.ts` | `Message[]`, `ToolCall`, `ProviderResponse` — the target format |
| `packages/core/src/evaluation/providers/cli.ts` | Generic replay via external scripts — reference pattern |
| `packages/core/src/evaluation/orchestrator.ts` | Needs `--transcript` bypass (skip provider, use pre-populated output) |
| `apps/cli/src/commands/trace/` | All four subcommands to rename |
| `examples/showcase/offline-grader-benchmark/` | Existing offline grading example — validates the approach works |
| `examples/features/copilot-log-eval/` | Existing Copilot transcript eval example |

## Implementation order

1. **Claude importer** (`agentv import claude`) — highest value, primary tool, sessions already on disk
2. **Orchestrator `--transcript` flag** — skip provider invocation, use pre-populated output
3. **Codex importer** (`agentv import codex`) — enables cross-client comparison
4. **`agentv trace` → `agentv inspect` rename** — with deprecated alias
5. **Copilot importer** (`agentv import copilot`) — migrate existing `copilot-log` parser logic

## Acceptance signals

- [ ] `agentv import claude --session-id <uuid>` produces valid transcript JSONL
- [ ] `agentv import claude --discover latest --project-path <path>` auto-discovers sessions
- [ ] `agentv import codex --discover latest` produces valid transcript JSONL
- [ ] `agentv import copilot --session-id <uuid>` produces valid transcript JSONL
- [ ] `agentv eval <file> --transcript <path>` grades pre-populated transcripts without provider invocation
- [ ] All existing evaluators (llm-grader, code-grader, rubrics, tool-trajectory, execution-metrics, cost, latency) work on transcripts
- [ ] `agentv compare` works across results from different transcript sources
- [ ] `agentv inspect list/show/stats/score` work identically to current `agentv trace` subcommands
- [ ] `agentv trace` still works as deprecated alias
- [ ] Transcript JSONL includes `source` metadata (provider, session_id, model, timestamp, git_branch)
- [ ] Token usage, cost, and duration are preserved from session data

## Non-goals

- Real-time session capture (sessions are read post-hoc from disk)
- Cursor/Windsurf/Aider support (can be added later as additional importers)
- Modifying the `copilot-log` provider (keep working, deprecate later)
- Session-to-test-case auto-matching (user specifies which test a session corresponds to)
- Workspace state verification (git diff capture from sessions — future enhancement)

## Industry research

Research across 6 eval frameworks confirms this architecture:

| Framework | Pattern | How offline eval works |
|-----------|---------|----------------------|
| **DeepEval** | Data-driven | Pre-populated `actual_output` on `LLMTestCase` |
| **Braintrust** | Unified traces | Production traces converted to datasets, same evaluators |
| **LangSmith** | Dual-context | Datasets from traces, `aevaluate()` on pre-recorded data |
| **Promptfoo** | Echo provider | Separate `echo` provider returns cached data through provider interface |
| **RunLedger** | Cassette replay | JSONL cassettes replayed deterministically in CI |
| **WTG Agent Evaluator** | Event-stream-first | `aeval chats to-tests` converts sessions to eval test sets |

All separate import/transform from evaluation. None make the live provider double as a file reader.

Additional tools that parse these session formats:
- [code-insights](https://code-insights.app/) — parses Claude, Codex, Copilot, Cursor sessions into unified SQLite
- [Agent Trace](https://agent-trace.dev/) — emerging standard for AI code attribution (Cursor, Vercel, Google Jules)
- [entireio/cli](https://github.com/entireio/cli) — git-integrated session capture with checkpoint/rewind

---

## Design clarifications

These resolve ambiguities an implementing agent would otherwise need to ask about.

### How transcripts map to eval test cases

A transcript is **not matched to eval tests by input string**. Instead, `--transcript` provides a **pre-populated ProviderResponse** that replaces provider invocation entirely:

- The eval YAML still defines tests with `input` and `assert` — these provide the grading criteria
- The transcript provides the `output` (Message[]) that would normally come from a live provider
- Matching is **positional**: transcript line 1 → test 1, transcript line 2 → test 2
- If the eval has 1 test and the transcript has 1 line, they pair 1:1 (most common case)
- If counts don't match, error with a clear message

This means a typical workflow is:
1. Run a task manually in Claude Code
2. Import the session → produces 1-line transcript (the whole session is one entry)
3. Write an eval YAML with 1 test that has the same input + the assertions you want
4. `agentv eval <eval.yaml> --transcript <transcript.jsonl>`

### Multi-turn sessions → single transcript entry

A session with multiple user messages is **one transcript line** (one test case). The entire conversation is captured in the `output: Message[]` array. This matches how agent evals work: one task = one session, even if it spans many turns.

The `input` field in the transcript captures the **first user message** (the initial task). All subsequent messages (follow-ups, clarifications) are part of the `output` conversation.

### Orchestrator `--transcript` bypass mechanics

When `--transcript` is provided:
- `targets:` in the eval YAML is **ignored** (no provider is invoked)
- The `target` field in result JSONL is set to `${source.provider}` from the transcript (e.g., `claude-cli`)
- `trials.count` is forced to 1 (replaying the same transcript multiple times is meaningless)
- `workspace_template` is ignored (no workspace is created)
- The orchestrator constructs a `ProviderResponse` from the transcript: `{ output: line.output, tokenUsage: line.token_usage, durationMs: line.duration_ms, costUsd: line.cost_usd, startTime: line.source.timestamp }`

Multiple transcripts for comparison:
```bash
# Run the same eval three times with different transcripts
agentv eval evals/auth.yaml --transcript .agentv/transcripts/claude-auth.jsonl
agentv eval evals/auth.yaml --transcript .agentv/transcripts/codex-auth.jsonl
agentv eval evals/auth.yaml --transcript .agentv/transcripts/copilot-auth.jsonl
# Each produces a separate result run, then compare
```

### File placement in monorepo

```
packages/core/src/import/           # NEW — parser logic (reusable by CLI and SDK)
  claude-parser.ts                  # Parse Claude Code session JSONL → Message[]
  codex-parser.ts                   # Parse Codex CLI rollout JSONL → Message[]
  copilot-parser.ts                 # Wraps existing copilot-log-parser.ts
  session-discovery.ts              # Unified session discovery (find latest, by id, by project)
  types.ts                          # TranscriptEntry interface, ImportConfig
  index.ts                          # Public API

apps/cli/src/commands/import/       # NEW — CLI command
  index.ts                          # Subcommand registration (claude, codex, copilot)
  claude.ts                         # CLI handler for `agentv import claude`
  codex.ts                          # CLI handler for `agentv import codex`
  copilot.ts                        # CLI handler for `agentv import copilot`

apps/cli/src/commands/inspect/      # RENAMED from trace/
  index.ts                          # (rename trace → inspect)
  show.ts
  list.ts
  stats.ts
  score.ts
```

Parsers in `packages/core` so they're usable from both CLI and the programmatic SDK (`evaluate()` API).

### Default output naming

When `--output` is omitted:
```
.agentv/transcripts/<source>-<session-id-short>.jsonl

# Examples:
.agentv/transcripts/claude-0763061e.jsonl
.agentv/transcripts/codex-rollout-2026-03-29T14-22-01.jsonl
.agentv/transcripts/copilot-abc123.jsonl
```

The directory `.agentv/transcripts/` is created automatically if it doesn't exist.

### Cost calculation from token counts

Claude Code sessions store token counts but not dollar costs. The importer should:
1. Extract `usage.input_tokens`, `usage.output_tokens`, `usage.cache_creation_input_tokens`, `usage.cache_read_input_tokens` from assistant messages
2. Sum across all messages in the session
3. Set `cost_usd: null` (not computed) — cost calculation requires model pricing tables which change frequently and are out of scope
4. The `cost` evaluator will work if the user provides a `cost_usd` override in the eval YAML, or will skip/fail gracefully if null

Same approach for Codex CLI (has `usage.input_tokens` / `usage.output_tokens` in `turn.completed` events).

### Claude Code session parsing specifics

Event type mapping:
| Session event type | Transcript handling |
|---|---|
| `type: "user"` | → `Message { role: "user", content }` |
| `type: "assistant"` | → `Message { role: "assistant", content, toolCalls }` — extract tool_use/tool_result from content array |
| `type: "progress"` | Skip — these are hook/status events, not conversation |
| `type: "system"` | Skip — API errors, turn metrics (extract duration from here) |
| `type: "file-history-snapshot"` | Skip — file backup tracking |

Subagent sessions (`{uuid}/subagents/agent-{id}.jsonl`): **skip for v1**. Import only the main session. Subagent import can be added later.

Token usage: aggregate `usage` blocks from all `type: "assistant"` messages. Each has `{ input_tokens, output_tokens, cache_creation_input_tokens, cache_read_input_tokens }`.

Duration: use timestamp delta between first and last event in the session.

### Codex CLI session parsing specifics

| Rollout item type | Transcript handling |
|---|---|
| `ResponseItem` | → `Message[]` — map user messages, assistant messages, tool outputs |
| `EventMsg (TurnStarted/TurnComplete)` | Extract turn timing + token usage from `TurnComplete` |
| `TurnContext` | Extract model name, CWD, policies |
| `SessionMeta` | Extract thread_id, git info, CLI version → `source` metadata |
| `Compacted` | Skip — compaction metadata |

Tool calls: extract from `ExecCommandEnd` events within `EventMsg`. Note `stdout/stderr` are sanitized (cleared) — only `aggregated_output` is available.

Current	Renamed	Purpose
`agentv trace list`	`agentv inspect list`	List result files
`agentv trace show`	`agentv inspect show`	Show result details with execution tree
`agentv trace stats`	`agentv inspect stats`	Compute percentile stats
`agentv trace score`	`agentv inspect score`	Re-run evaluators post-hoc

File	Relevance
`packages/core/src/evaluation/providers/copilot-log.ts`	Existing transcript reader — same pattern, migrate parser to importer
`packages/core/src/evaluation/providers/copilot-log-parser.ts`	Copilot JSONL parsing logic — reusable
`packages/core/src/evaluation/providers/copilot-session-discovery.ts`	Session discovery logic — reusable pattern for Claude/Codex
`packages/core/src/evaluation/providers/types.ts`	`Message[]`, `ToolCall`, `ProviderResponse` — the target format
`packages/core/src/evaluation/providers/cli.ts`	Generic replay via external scripts — reference pattern
`packages/core/src/evaluation/orchestrator.ts`	Needs `--transcript` bypass (skip provider, use pre-populated output)
`apps/cli/src/commands/trace/`	All four subcommands to rename
`examples/showcase/offline-grader-benchmark/`	Existing offline grading example — validates the approach works
`examples/features/copilot-log-eval/`	Existing Copilot transcript eval example

Session event type	Transcript handling
`type: "user"`	→ `Message { role: "user", content }`
`type: "assistant"`	→ `Message { role: "assistant", content, toolCalls }` — extract tool_use/tool_result from content array
`type: "progress"`	Skip — these are hook/status events, not conversation
`type: "system"`	Skip — API errors, turn metrics (extract duration from here)
`type: "file-history-snapshot"`	Skip — file backup tracking

Rollout item type	Transcript handling
`ResponseItem`	→ `Message[]` — map user messages, assistant messages, tool outputs
`EventMsg (TurnStarted/TurnComplete)`	Extract turn timing + token usage from `TurnComplete`
`TurnContext`	Extract model name, CWD, policies
`SessionMeta`	Extract thread_id, git info, CLI version → `source` metadata
`Compacted`	Skip — compaction metadata

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: transcript import pipeline — grade existing Claude/Codex/Copilot sessions offline #872

Objective

Motivation

Design

Architecture: Two-step pipeline

Terminology

Session file locations (input)

Transcript JSONL format (output of import)

CLI: `agentv import`

Evaluation with transcripts

Bundled rename: `agentv trace` → `agentv inspect`

Existing code to build on

Implementation order

Acceptance signals

Non-goals

Industry research

Design clarifications

How transcripts map to eval test cases

Multi-turn sessions → single transcript entry

Orchestrator `--transcript` bypass mechanics

File placement in monorepo

Default output naming

Cost calculation from token counts

Claude Code session parsing specifics

Codex CLI session parsing specifics

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Term	Meaning	Context
session	Raw tool-specific file on disk	`~/.claude/.../.jsonl`, `~/.codex/.../.jsonl`, `~/.copilot/.../*.jsonl`
transcript	Normalized, tool-agnostic AgentV representation	Output of `agentv import`, input to `agentv eval --transcript`
result	Graded output from `agentv eval`	Existing concept, unchanged
trace	OTel spans for observability backends	Reserved for OTel export only
dataset	Eval test grouping in studio	Existing concept, unchanged

Tool	Path	Format
Claude Code	`~/.claude/projects/<encoded-path>/<uuid>.jsonl`	JSONL — `type` field: `user`, `assistant`, `progress`, `system`
Codex CLI	`~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl`	JSONL — `RolloutItem`: `ResponseItem`, `EventMsg`, `TurnContext`, `SessionMeta`
Copilot CLI	`~/.copilot/session-state/{id}/events.jsonl`	JSONL — events: `session.start`, `user.message`, `assistant.message`, `tool.execution_*`

Framework	Pattern	How offline eval works
DeepEval	Data-driven	Pre-populated `actual_output` on `LLMTestCase`
Braintrust	Unified traces	Production traces converted to datasets, same evaluators
LangSmith	Dual-context	Datasets from traces, `aevaluate()` on pre-recorded data
Promptfoo	Echo provider	Separate `echo` provider returns cached data through provider interface
RunLedger	Cassette replay	JSONL cassettes replayed deterministically in CI
WTG Agent Evaluator	Event-stream-first	`aeval chats to-tests` converts sessions to eval test sets

feat: transcript import pipeline — grade existing Claude/Codex/Copilot sessions offline #872

Description

Objective

Motivation

Design

Architecture: Two-step pipeline

Terminology

Session file locations (input)

Transcript JSONL format (output of import)

CLI: agentv import

Evaluation with transcripts

Bundled rename: agentv trace → agentv inspect

Existing code to build on

Implementation order

Acceptance signals

Non-goals

Industry research

Design clarifications

How transcripts map to eval test cases

Multi-turn sessions → single transcript entry

Orchestrator --transcript bypass mechanics

File placement in monorepo

Default output naming

Cost calculation from token counts

Claude Code session parsing specifics

Codex CLI session parsing specifics

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

CLI: `agentv import`

Bundled rename: `agentv trace` → `agentv inspect`

Orchestrator `--transcript` bypass mechanics