-
Notifications
You must be signed in to change notification settings - Fork 0
feat: transcript import pipeline — grade existing Claude/Codex/Copilot sessions offline #872
Description
Objective
Add an agentv import command that reads existing AI coding sessions from Claude Code, Codex CLI, and Copilot CLI, normalizes them into a tool-agnostic transcript format, and feeds them into the existing evaluator pipeline for grading. This enables comparing different clients and workspace setups from manually-run sessions without re-executing anything.
Bundled: rename agentv trace → agentv inspect to free up "trace" for its industry-standard meaning (OTel spans) and avoid terminology collision.
Motivation
- Teams run the same task in Claude Code, Codex CLI, and Copilot CLI to compare quality — today there's no way to grade those sessions without re-running them
- Workspace setup experiments (different CLAUDE.md files, system prompts, skills) produce sessions that should be gradeable after the fact
- Grader iteration: import once, re-grade many times with different evaluators without burning API tokens
- Industry alignment: every major eval framework (DeepEval, Braintrust, LangSmith) separates data import from evaluation
Design
Architecture: Two-step pipeline
Industry best practice separates import from evaluation. The data carries the output, not the provider.
Step 1: agentv import <source> → transcript JSONL (.agentv/transcripts/)
Step 2: agentv eval <eval.yaml> --transcript <file> → graded results (.agentv/results/)
Why not a provider modifier (e.g., provider: claude-cli + session: block)?
- Makes providers do two unrelated things (spawn live agent vs parse static files)
- Every agent provider would need a transcript parser coupled to it
- Can't reuse imported data across multiple eval runs cheaply
- Not how any major eval framework works (DeepEval uses pre-populated test cases, Braintrust uses dataset-from-traces, Promptfoo uses an echo provider)
Terminology
| Term | Meaning | Context |
|---|---|---|
| session | Raw tool-specific file on disk | ~/.claude/.../*.jsonl, ~/.codex/.../*.jsonl, ~/.copilot/.../*.jsonl |
| transcript | Normalized, tool-agnostic AgentV representation | Output of agentv import, input to agentv eval --transcript |
| result | Graded output from agentv eval |
Existing concept, unchanged |
| trace | OTel spans for observability backends | Reserved for OTel export only |
| dataset | Eval test grouping in studio | Existing concept, unchanged |
Session file locations (input)
| Tool | Path | Format |
|---|---|---|
| Claude Code | ~/.claude/projects/<encoded-path>/<uuid>.jsonl |
JSONL — type field: user, assistant, progress, system |
| Codex CLI | ~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl |
JSONL — RolloutItem: ResponseItem, EventMsg, TurnContext, SessionMeta |
| Copilot CLI | ~/.copilot/session-state/{id}/events.jsonl |
JSONL — events: session.start, user.message, assistant.message, tool.execution_* |
Transcript JSONL format (output of import)
Each line is a self-contained test case with pre-populated output:
{
"input": "Add JWT authentication to the Express API",
"output": [
{
"role": "user",
"content": "Add JWT authentication to the Express API"
},
{
"role": "assistant",
"content": "I'll add JWT auth...",
"tool_calls": [
{"tool": "Read", "input": {"file_path": "/src/index.ts"}, "output": "...", "duration_ms": 45},
{"tool": "Edit", "input": {"file_path": "/src/auth.ts", "old_string": "...", "new_string": "..."}, "duration_ms": 120}
]
}
],
"token_usage": {"input": 15000, "output": 3200, "cached": 8000},
"duration_ms": 45000,
"cost_usd": 0.12,
"source": {
"provider": "claude-cli",
"session_id": "0763061e-9c91-4dee-9373-3bd69c817fd4",
"model": "claude-opus-4-6",
"version": "2.1.62",
"timestamp": "2026-03-29T21:10:11.969Z",
"git_branch": "main",
"cwd": "/home/user/projects/myapp"
}
}The output field uses the existing Message[] schema from packages/core/src/evaluation/providers/types.ts (snake_case wire format).
CLI: agentv import
# Import a specific Claude Code session
agentv import claude \
--session-id 0763061e-9c91-4dee-9373-3bd69c817fd4 \
--output .agentv/transcripts/claude-auth.jsonl
# Import latest Claude Code session for a project
agentv import claude \
--discover latest \
--project-path /home/user/projects/myapp \
--output .agentv/transcripts/claude-auth.jsonl
# Import latest Codex CLI session
agentv import codex \
--discover latest \
--output .agentv/transcripts/codex-auth.jsonl
# Import specific Codex session by date
agentv import codex \
--date 2026-03-29 \
--discover latest \
--output .agentv/transcripts/codex-auth.jsonl
# Import Copilot CLI session
agentv import copilot \
--session-id abc-123 \
--output .agentv/transcripts/copilot-auth.jsonlDefault output directory: .agentv/transcripts/
Evaluation with transcripts
# Grade a transcript against an eval
agentv eval evals/compare-clients.yaml --transcript .agentv/transcripts/claude-auth.jsonl
agentv eval evals/compare-clients.yaml --transcript .agentv/transcripts/codex-auth.jsonl
agentv eval evals/compare-clients.yaml --transcript .agentv/transcripts/copilot-auth.jsonl
# Compare graded results
agentv compare \
.agentv/results/runs/claude-auth-* \
.agentv/results/runs/codex-auth-* \
.agentv/results/runs/copilot-auth-*Example eval YAML:
description: Compare client implementations of auth feature
tests:
- input: "Add JWT authentication to the Express API"
assert:
- type: llm-grader
prompt: ./graders/auth-quality.md
- type: tool-trajectory
value: ["Read", "Edit", "Bash"]
- type: code-grader
value: ./graders/check-jwt-files.ts
- type: cost
budget: 0.50
- type: execution-metrics
max_tool_calls: 30When --transcript is provided, the orchestrator skips provider invocation and uses the pre-populated output from the transcript as the ProviderResponse. Evaluators run identically to live eval.
Bundled rename: agentv trace → agentv inspect
Current agentv trace subcommands are all result inspection tools, not OTel trace operations:
| Current | Renamed | Purpose |
|---|---|---|
agentv trace list |
agentv inspect list |
List result files |
agentv trace show |
agentv inspect show |
Show result details with execution tree |
agentv trace stats |
agentv inspect stats |
Compute percentile stats |
agentv trace score |
agentv inspect score |
Re-run evaluators post-hoc |
Keep agentv trace as a deprecated alias for one release cycle.
Existing code to build on
| File | Relevance |
|---|---|
packages/core/src/evaluation/providers/copilot-log.ts |
Existing transcript reader — same pattern, migrate parser to importer |
packages/core/src/evaluation/providers/copilot-log-parser.ts |
Copilot JSONL parsing logic — reusable |
packages/core/src/evaluation/providers/copilot-session-discovery.ts |
Session discovery logic — reusable pattern for Claude/Codex |
packages/core/src/evaluation/providers/types.ts |
Message[], ToolCall, ProviderResponse — the target format |
packages/core/src/evaluation/providers/cli.ts |
Generic replay via external scripts — reference pattern |
packages/core/src/evaluation/orchestrator.ts |
Needs --transcript bypass (skip provider, use pre-populated output) |
apps/cli/src/commands/trace/ |
All four subcommands to rename |
examples/showcase/offline-grader-benchmark/ |
Existing offline grading example — validates the approach works |
examples/features/copilot-log-eval/ |
Existing Copilot transcript eval example |
Implementation order
- Claude importer (
agentv import claude) — highest value, primary tool, sessions already on disk - Orchestrator
--transcriptflag — skip provider invocation, use pre-populated output - Codex importer (
agentv import codex) — enables cross-client comparison agentv trace→agentv inspectrename — with deprecated alias- Copilot importer (
agentv import copilot) — migrate existingcopilot-logparser logic
Acceptance signals
-
agentv import claude --session-id <uuid>produces valid transcript JSONL -
agentv import claude --discover latest --project-path <path>auto-discovers sessions -
agentv import codex --discover latestproduces valid transcript JSONL -
agentv import copilot --session-id <uuid>produces valid transcript JSONL -
agentv eval <file> --transcript <path>grades pre-populated transcripts without provider invocation - All existing evaluators (llm-grader, code-grader, rubrics, tool-trajectory, execution-metrics, cost, latency) work on transcripts
-
agentv compareworks across results from different transcript sources -
agentv inspect list/show/stats/scorework identically to currentagentv tracesubcommands -
agentv tracestill works as deprecated alias - Transcript JSONL includes
sourcemetadata (provider, session_id, model, timestamp, git_branch) - Token usage, cost, and duration are preserved from session data
Non-goals
- Real-time session capture (sessions are read post-hoc from disk)
- Cursor/Windsurf/Aider support (can be added later as additional importers)
- Modifying the
copilot-logprovider (keep working, deprecate later) - Session-to-test-case auto-matching (user specifies which test a session corresponds to)
- Workspace state verification (git diff capture from sessions — future enhancement)
Industry research
Research across 6 eval frameworks confirms this architecture:
| Framework | Pattern | How offline eval works |
|---|---|---|
| DeepEval | Data-driven | Pre-populated actual_output on LLMTestCase |
| Braintrust | Unified traces | Production traces converted to datasets, same evaluators |
| LangSmith | Dual-context | Datasets from traces, aevaluate() on pre-recorded data |
| Promptfoo | Echo provider | Separate echo provider returns cached data through provider interface |
| RunLedger | Cassette replay | JSONL cassettes replayed deterministically in CI |
| WTG Agent Evaluator | Event-stream-first | aeval chats to-tests converts sessions to eval test sets |
All separate import/transform from evaluation. None make the live provider double as a file reader.
Additional tools that parse these session formats:
- code-insights — parses Claude, Codex, Copilot, Cursor sessions into unified SQLite
- Agent Trace — emerging standard for AI code attribution (Cursor, Vercel, Google Jules)
- entireio/cli — git-integrated session capture with checkpoint/rewind
Design clarifications
These resolve ambiguities an implementing agent would otherwise need to ask about.
How transcripts map to eval test cases
A transcript is not matched to eval tests by input string. Instead, --transcript provides a pre-populated ProviderResponse that replaces provider invocation entirely:
- The eval YAML still defines tests with
inputandassert— these provide the grading criteria - The transcript provides the
output(Message[]) that would normally come from a live provider - Matching is positional: transcript line 1 → test 1, transcript line 2 → test 2
- If the eval has 1 test and the transcript has 1 line, they pair 1:1 (most common case)
- If counts don't match, error with a clear message
This means a typical workflow is:
- Run a task manually in Claude Code
- Import the session → produces 1-line transcript (the whole session is one entry)
- Write an eval YAML with 1 test that has the same input + the assertions you want
agentv eval <eval.yaml> --transcript <transcript.jsonl>
Multi-turn sessions → single transcript entry
A session with multiple user messages is one transcript line (one test case). The entire conversation is captured in the output: Message[] array. This matches how agent evals work: one task = one session, even if it spans many turns.
The input field in the transcript captures the first user message (the initial task). All subsequent messages (follow-ups, clarifications) are part of the output conversation.
Orchestrator --transcript bypass mechanics
When --transcript is provided:
targets:in the eval YAML is ignored (no provider is invoked)- The
targetfield in result JSONL is set to${source.provider}from the transcript (e.g.,claude-cli) trials.countis forced to 1 (replaying the same transcript multiple times is meaningless)workspace_templateis ignored (no workspace is created)- The orchestrator constructs a
ProviderResponsefrom the transcript:{ output: line.output, tokenUsage: line.token_usage, durationMs: line.duration_ms, costUsd: line.cost_usd, startTime: line.source.timestamp }
Multiple transcripts for comparison:
# Run the same eval three times with different transcripts
agentv eval evals/auth.yaml --transcript .agentv/transcripts/claude-auth.jsonl
agentv eval evals/auth.yaml --transcript .agentv/transcripts/codex-auth.jsonl
agentv eval evals/auth.yaml --transcript .agentv/transcripts/copilot-auth.jsonl
# Each produces a separate result run, then compareFile placement in monorepo
packages/core/src/import/ # NEW — parser logic (reusable by CLI and SDK)
claude-parser.ts # Parse Claude Code session JSONL → Message[]
codex-parser.ts # Parse Codex CLI rollout JSONL → Message[]
copilot-parser.ts # Wraps existing copilot-log-parser.ts
session-discovery.ts # Unified session discovery (find latest, by id, by project)
types.ts # TranscriptEntry interface, ImportConfig
index.ts # Public API
apps/cli/src/commands/import/ # NEW — CLI command
index.ts # Subcommand registration (claude, codex, copilot)
claude.ts # CLI handler for `agentv import claude`
codex.ts # CLI handler for `agentv import codex`
copilot.ts # CLI handler for `agentv import copilot`
apps/cli/src/commands/inspect/ # RENAMED from trace/
index.ts # (rename trace → inspect)
show.ts
list.ts
stats.ts
score.ts
Parsers in packages/core so they're usable from both CLI and the programmatic SDK (evaluate() API).
Default output naming
When --output is omitted:
.agentv/transcripts/<source>-<session-id-short>.jsonl
# Examples:
.agentv/transcripts/claude-0763061e.jsonl
.agentv/transcripts/codex-rollout-2026-03-29T14-22-01.jsonl
.agentv/transcripts/copilot-abc123.jsonl
The directory .agentv/transcripts/ is created automatically if it doesn't exist.
Cost calculation from token counts
Claude Code sessions store token counts but not dollar costs. The importer should:
- Extract
usage.input_tokens,usage.output_tokens,usage.cache_creation_input_tokens,usage.cache_read_input_tokensfrom assistant messages - Sum across all messages in the session
- Set
cost_usd: null(not computed) — cost calculation requires model pricing tables which change frequently and are out of scope - The
costevaluator will work if the user provides acost_usdoverride in the eval YAML, or will skip/fail gracefully if null
Same approach for Codex CLI (has usage.input_tokens / usage.output_tokens in turn.completed events).
Claude Code session parsing specifics
Event type mapping:
| Session event type | Transcript handling |
|---|---|
type: "user" |
→ Message { role: "user", content } |
type: "assistant" |
→ Message { role: "assistant", content, toolCalls } — extract tool_use/tool_result from content array |
type: "progress" |
Skip — these are hook/status events, not conversation |
type: "system" |
Skip — API errors, turn metrics (extract duration from here) |
type: "file-history-snapshot" |
Skip — file backup tracking |
Subagent sessions ({uuid}/subagents/agent-{id}.jsonl): skip for v1. Import only the main session. Subagent import can be added later.
Token usage: aggregate usage blocks from all type: "assistant" messages. Each has { input_tokens, output_tokens, cache_creation_input_tokens, cache_read_input_tokens }.
Duration: use timestamp delta between first and last event in the session.
Codex CLI session parsing specifics
| Rollout item type | Transcript handling |
|---|---|
ResponseItem |
→ Message[] — map user messages, assistant messages, tool outputs |
EventMsg (TurnStarted/TurnComplete) |
Extract turn timing + token usage from TurnComplete |
TurnContext |
Extract model name, CWD, policies |
SessionMeta |
Extract thread_id, git info, CLI version → source metadata |
Compacted |
Skip — compaction metadata |
Tool calls: extract from ExecCommandEnd events within EventMsg. Note stdout/stderr are sanitized (cleared) — only aggregated_output is available.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status