Skip to content

feat: transcript import pipeline — grade existing Claude/Codex/Copilot sessions offline #872

@christso

Description

@christso

Objective

Add an agentv import command that reads existing AI coding sessions from Claude Code, Codex CLI, and Copilot CLI, normalizes them into a tool-agnostic transcript format, and feeds them into the existing evaluator pipeline for grading. This enables comparing different clients and workspace setups from manually-run sessions without re-executing anything.

Bundled: rename agentv traceagentv inspect to free up "trace" for its industry-standard meaning (OTel spans) and avoid terminology collision.

Motivation

  • Teams run the same task in Claude Code, Codex CLI, and Copilot CLI to compare quality — today there's no way to grade those sessions without re-running them
  • Workspace setup experiments (different CLAUDE.md files, system prompts, skills) produce sessions that should be gradeable after the fact
  • Grader iteration: import once, re-grade many times with different evaluators without burning API tokens
  • Industry alignment: every major eval framework (DeepEval, Braintrust, LangSmith) separates data import from evaluation

Design

Architecture: Two-step pipeline

Industry best practice separates import from evaluation. The data carries the output, not the provider.

Step 1: agentv import <source>  → transcript JSONL (.agentv/transcripts/)
Step 2: agentv eval <eval.yaml> --transcript <file>  → graded results (.agentv/results/)

Why not a provider modifier (e.g., provider: claude-cli + session: block)?

  • Makes providers do two unrelated things (spawn live agent vs parse static files)
  • Every agent provider would need a transcript parser coupled to it
  • Can't reuse imported data across multiple eval runs cheaply
  • Not how any major eval framework works (DeepEval uses pre-populated test cases, Braintrust uses dataset-from-traces, Promptfoo uses an echo provider)

Terminology

Term Meaning Context
session Raw tool-specific file on disk ~/.claude/.../*.jsonl, ~/.codex/.../*.jsonl, ~/.copilot/.../*.jsonl
transcript Normalized, tool-agnostic AgentV representation Output of agentv import, input to agentv eval --transcript
result Graded output from agentv eval Existing concept, unchanged
trace OTel spans for observability backends Reserved for OTel export only
dataset Eval test grouping in studio Existing concept, unchanged

Session file locations (input)

Tool Path Format
Claude Code ~/.claude/projects/<encoded-path>/<uuid>.jsonl JSONL — type field: user, assistant, progress, system
Codex CLI ~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl JSONL — RolloutItem: ResponseItem, EventMsg, TurnContext, SessionMeta
Copilot CLI ~/.copilot/session-state/{id}/events.jsonl JSONL — events: session.start, user.message, assistant.message, tool.execution_*

Transcript JSONL format (output of import)

Each line is a self-contained test case with pre-populated output:

{
  "input": "Add JWT authentication to the Express API",
  "output": [
    {
      "role": "user",
      "content": "Add JWT authentication to the Express API"
    },
    {
      "role": "assistant",
      "content": "I'll add JWT auth...",
      "tool_calls": [
        {"tool": "Read", "input": {"file_path": "/src/index.ts"}, "output": "...", "duration_ms": 45},
        {"tool": "Edit", "input": {"file_path": "/src/auth.ts", "old_string": "...", "new_string": "..."}, "duration_ms": 120}
      ]
    }
  ],
  "token_usage": {"input": 15000, "output": 3200, "cached": 8000},
  "duration_ms": 45000,
  "cost_usd": 0.12,
  "source": {
    "provider": "claude-cli",
    "session_id": "0763061e-9c91-4dee-9373-3bd69c817fd4",
    "model": "claude-opus-4-6",
    "version": "2.1.62",
    "timestamp": "2026-03-29T21:10:11.969Z",
    "git_branch": "main",
    "cwd": "/home/user/projects/myapp"
  }
}

The output field uses the existing Message[] schema from packages/core/src/evaluation/providers/types.ts (snake_case wire format).

CLI: agentv import

# Import a specific Claude Code session
agentv import claude \
  --session-id 0763061e-9c91-4dee-9373-3bd69c817fd4 \
  --output .agentv/transcripts/claude-auth.jsonl

# Import latest Claude Code session for a project
agentv import claude \
  --discover latest \
  --project-path /home/user/projects/myapp \
  --output .agentv/transcripts/claude-auth.jsonl

# Import latest Codex CLI session
agentv import codex \
  --discover latest \
  --output .agentv/transcripts/codex-auth.jsonl

# Import specific Codex session by date
agentv import codex \
  --date 2026-03-29 \
  --discover latest \
  --output .agentv/transcripts/codex-auth.jsonl

# Import Copilot CLI session
agentv import copilot \
  --session-id abc-123 \
  --output .agentv/transcripts/copilot-auth.jsonl

Default output directory: .agentv/transcripts/

Evaluation with transcripts

# Grade a transcript against an eval
agentv eval evals/compare-clients.yaml --transcript .agentv/transcripts/claude-auth.jsonl
agentv eval evals/compare-clients.yaml --transcript .agentv/transcripts/codex-auth.jsonl
agentv eval evals/compare-clients.yaml --transcript .agentv/transcripts/copilot-auth.jsonl

# Compare graded results
agentv compare \
  .agentv/results/runs/claude-auth-* \
  .agentv/results/runs/codex-auth-* \
  .agentv/results/runs/copilot-auth-*

Example eval YAML:

description: Compare client implementations of auth feature
tests:
  - input: "Add JWT authentication to the Express API"
    assert:
      - type: llm-grader
        prompt: ./graders/auth-quality.md
      - type: tool-trajectory
        value: ["Read", "Edit", "Bash"]
      - type: code-grader
        value: ./graders/check-jwt-files.ts
      - type: cost
        budget: 0.50
      - type: execution-metrics
        max_tool_calls: 30

When --transcript is provided, the orchestrator skips provider invocation and uses the pre-populated output from the transcript as the ProviderResponse. Evaluators run identically to live eval.

Bundled rename: agentv traceagentv inspect

Current agentv trace subcommands are all result inspection tools, not OTel trace operations:

Current Renamed Purpose
agentv trace list agentv inspect list List result files
agentv trace show agentv inspect show Show result details with execution tree
agentv trace stats agentv inspect stats Compute percentile stats
agentv trace score agentv inspect score Re-run evaluators post-hoc

Keep agentv trace as a deprecated alias for one release cycle.

Existing code to build on

File Relevance
packages/core/src/evaluation/providers/copilot-log.ts Existing transcript reader — same pattern, migrate parser to importer
packages/core/src/evaluation/providers/copilot-log-parser.ts Copilot JSONL parsing logic — reusable
packages/core/src/evaluation/providers/copilot-session-discovery.ts Session discovery logic — reusable pattern for Claude/Codex
packages/core/src/evaluation/providers/types.ts Message[], ToolCall, ProviderResponse — the target format
packages/core/src/evaluation/providers/cli.ts Generic replay via external scripts — reference pattern
packages/core/src/evaluation/orchestrator.ts Needs --transcript bypass (skip provider, use pre-populated output)
apps/cli/src/commands/trace/ All four subcommands to rename
examples/showcase/offline-grader-benchmark/ Existing offline grading example — validates the approach works
examples/features/copilot-log-eval/ Existing Copilot transcript eval example

Implementation order

  1. Claude importer (agentv import claude) — highest value, primary tool, sessions already on disk
  2. Orchestrator --transcript flag — skip provider invocation, use pre-populated output
  3. Codex importer (agentv import codex) — enables cross-client comparison
  4. agentv traceagentv inspect rename — with deprecated alias
  5. Copilot importer (agentv import copilot) — migrate existing copilot-log parser logic

Acceptance signals

  • agentv import claude --session-id <uuid> produces valid transcript JSONL
  • agentv import claude --discover latest --project-path <path> auto-discovers sessions
  • agentv import codex --discover latest produces valid transcript JSONL
  • agentv import copilot --session-id <uuid> produces valid transcript JSONL
  • agentv eval <file> --transcript <path> grades pre-populated transcripts without provider invocation
  • All existing evaluators (llm-grader, code-grader, rubrics, tool-trajectory, execution-metrics, cost, latency) work on transcripts
  • agentv compare works across results from different transcript sources
  • agentv inspect list/show/stats/score work identically to current agentv trace subcommands
  • agentv trace still works as deprecated alias
  • Transcript JSONL includes source metadata (provider, session_id, model, timestamp, git_branch)
  • Token usage, cost, and duration are preserved from session data

Non-goals

  • Real-time session capture (sessions are read post-hoc from disk)
  • Cursor/Windsurf/Aider support (can be added later as additional importers)
  • Modifying the copilot-log provider (keep working, deprecate later)
  • Session-to-test-case auto-matching (user specifies which test a session corresponds to)
  • Workspace state verification (git diff capture from sessions — future enhancement)

Industry research

Research across 6 eval frameworks confirms this architecture:

Framework Pattern How offline eval works
DeepEval Data-driven Pre-populated actual_output on LLMTestCase
Braintrust Unified traces Production traces converted to datasets, same evaluators
LangSmith Dual-context Datasets from traces, aevaluate() on pre-recorded data
Promptfoo Echo provider Separate echo provider returns cached data through provider interface
RunLedger Cassette replay JSONL cassettes replayed deterministically in CI
WTG Agent Evaluator Event-stream-first aeval chats to-tests converts sessions to eval test sets

All separate import/transform from evaluation. None make the live provider double as a file reader.

Additional tools that parse these session formats:

  • code-insights — parses Claude, Codex, Copilot, Cursor sessions into unified SQLite
  • Agent Trace — emerging standard for AI code attribution (Cursor, Vercel, Google Jules)
  • entireio/cli — git-integrated session capture with checkpoint/rewind

Design clarifications

These resolve ambiguities an implementing agent would otherwise need to ask about.

How transcripts map to eval test cases

A transcript is not matched to eval tests by input string. Instead, --transcript provides a pre-populated ProviderResponse that replaces provider invocation entirely:

  • The eval YAML still defines tests with input and assert — these provide the grading criteria
  • The transcript provides the output (Message[]) that would normally come from a live provider
  • Matching is positional: transcript line 1 → test 1, transcript line 2 → test 2
  • If the eval has 1 test and the transcript has 1 line, they pair 1:1 (most common case)
  • If counts don't match, error with a clear message

This means a typical workflow is:

  1. Run a task manually in Claude Code
  2. Import the session → produces 1-line transcript (the whole session is one entry)
  3. Write an eval YAML with 1 test that has the same input + the assertions you want
  4. agentv eval <eval.yaml> --transcript <transcript.jsonl>

Multi-turn sessions → single transcript entry

A session with multiple user messages is one transcript line (one test case). The entire conversation is captured in the output: Message[] array. This matches how agent evals work: one task = one session, even if it spans many turns.

The input field in the transcript captures the first user message (the initial task). All subsequent messages (follow-ups, clarifications) are part of the output conversation.

Orchestrator --transcript bypass mechanics

When --transcript is provided:

  • targets: in the eval YAML is ignored (no provider is invoked)
  • The target field in result JSONL is set to ${source.provider} from the transcript (e.g., claude-cli)
  • trials.count is forced to 1 (replaying the same transcript multiple times is meaningless)
  • workspace_template is ignored (no workspace is created)
  • The orchestrator constructs a ProviderResponse from the transcript: { output: line.output, tokenUsage: line.token_usage, durationMs: line.duration_ms, costUsd: line.cost_usd, startTime: line.source.timestamp }

Multiple transcripts for comparison:

# Run the same eval three times with different transcripts
agentv eval evals/auth.yaml --transcript .agentv/transcripts/claude-auth.jsonl
agentv eval evals/auth.yaml --transcript .agentv/transcripts/codex-auth.jsonl
agentv eval evals/auth.yaml --transcript .agentv/transcripts/copilot-auth.jsonl
# Each produces a separate result run, then compare

File placement in monorepo

packages/core/src/import/           # NEW — parser logic (reusable by CLI and SDK)
  claude-parser.ts                  # Parse Claude Code session JSONL → Message[]
  codex-parser.ts                   # Parse Codex CLI rollout JSONL → Message[]
  copilot-parser.ts                 # Wraps existing copilot-log-parser.ts
  session-discovery.ts              # Unified session discovery (find latest, by id, by project)
  types.ts                          # TranscriptEntry interface, ImportConfig
  index.ts                          # Public API

apps/cli/src/commands/import/       # NEW — CLI command
  index.ts                          # Subcommand registration (claude, codex, copilot)
  claude.ts                         # CLI handler for `agentv import claude`
  codex.ts                          # CLI handler for `agentv import codex`
  copilot.ts                        # CLI handler for `agentv import copilot`

apps/cli/src/commands/inspect/      # RENAMED from trace/
  index.ts                          # (rename trace → inspect)
  show.ts
  list.ts
  stats.ts
  score.ts

Parsers in packages/core so they're usable from both CLI and the programmatic SDK (evaluate() API).

Default output naming

When --output is omitted:

.agentv/transcripts/<source>-<session-id-short>.jsonl

# Examples:
.agentv/transcripts/claude-0763061e.jsonl
.agentv/transcripts/codex-rollout-2026-03-29T14-22-01.jsonl
.agentv/transcripts/copilot-abc123.jsonl

The directory .agentv/transcripts/ is created automatically if it doesn't exist.

Cost calculation from token counts

Claude Code sessions store token counts but not dollar costs. The importer should:

  1. Extract usage.input_tokens, usage.output_tokens, usage.cache_creation_input_tokens, usage.cache_read_input_tokens from assistant messages
  2. Sum across all messages in the session
  3. Set cost_usd: null (not computed) — cost calculation requires model pricing tables which change frequently and are out of scope
  4. The cost evaluator will work if the user provides a cost_usd override in the eval YAML, or will skip/fail gracefully if null

Same approach for Codex CLI (has usage.input_tokens / usage.output_tokens in turn.completed events).

Claude Code session parsing specifics

Event type mapping:

Session event type Transcript handling
type: "user" Message { role: "user", content }
type: "assistant" Message { role: "assistant", content, toolCalls } — extract tool_use/tool_result from content array
type: "progress" Skip — these are hook/status events, not conversation
type: "system" Skip — API errors, turn metrics (extract duration from here)
type: "file-history-snapshot" Skip — file backup tracking

Subagent sessions ({uuid}/subagents/agent-{id}.jsonl): skip for v1. Import only the main session. Subagent import can be added later.

Token usage: aggregate usage blocks from all type: "assistant" messages. Each has { input_tokens, output_tokens, cache_creation_input_tokens, cache_read_input_tokens }.

Duration: use timestamp delta between first and last event in the session.

Codex CLI session parsing specifics

Rollout item type Transcript handling
ResponseItem Message[] — map user messages, assistant messages, tool outputs
EventMsg (TurnStarted/TurnComplete) Extract turn timing + token usage from TurnComplete
TurnContext Extract model name, CWD, policies
SessionMeta Extract thread_id, git info, CLI version → source metadata
Compacted Skip — compaction metadata

Tool calls: extract from ExecCommandEnd events within EventMsg. Note stdout/stderr are sanitized (cleared) — only aggregated_output is available.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Ready

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions