diff --git a/README.md b/README.md
index bb10c1270..e74436061 100644
--- a/README.md
+++ b/README.md
@@ -1,314 +1,90 @@
 # AgentV
 
-**CLI-first AI agent evaluation. No server. No signup. No overhead.**
+**Evaluate AI agents from the terminal. No server. No signup.**
 
-AgentV evaluates your agents locally with multi-objective scoring (correctness, latency, cost, safety) from YAML specifications. Deterministic code graders + customizable LLM graders, all version-controlled in Git.
-
-## Installation
-
-### All Agents Plugin Manager
-
-**1. Add AgentV marketplace source:**
-```bash
-npx allagents plugin marketplace add EntityProcess/agentv
-```
-
-**2. Ask Claude to set up AgentV in your current repository**
-Example prompt:
-```text
-Set up AgentV in this repo.
-```
-
-The `agentv-onboarding` skill bootstraps setup automatically:
-- verifies `agentv` CLI availability
-- installs the CLI if needed
-- runs `agentv init`
-- verifies setup artifacts
-
-### CLI-Only Setup (Fallback)
-
-If you are not using Claude plugins, use the CLI directly.
-
-**1. Install:**
-```bash
-bun install -g agentv
-```
-
-Or with npm:
 ```bash
 npm install -g agentv
-```
-
-**2. Initialize your workspace:**
-```bash
 agentv init
+agentv eval evals/example.yaml
 ```
 
-**3. Configure environment variables:**
-- The init command creates a `.env.example` file in your project root
-- Copy `.env.example` to `.env` and fill in your API keys, endpoints, and other configuration values
-- Update the environment variable names in `.agentv/targets.yaml` to match those defined in your `.env` file
+That's it. Results in seconds, not minutes.
 
-**4. Create an eval** (`./evals/example.yaml`):
-```yaml
-description: Math problem solving evaluation
-execution:
-  target: default
+## What it does
 
+AgentV runs evaluation cases against your AI agents and scores them with deterministic code graders + customizable LLM graders. Everything lives in Git — YAML eval files, markdown judge prompts, JSONL results.
+
+```yaml
+# evals/math.yaml
+description: Math problem solving
 tests:
   - id: addition
-    criteria: Correctly calculates 15 + 27 = 42
-
     input: What is 15 + 27?
-
     expected_output: "42"
-
     assertions:
-      - name: math_check
-        type: code-grader
-        command: ./validators/check_math.py
+      - type: contains
+        value: "42"
 ```
 
-**5. Run the eval:**
 ```bash
-agentv eval ./evals/example.yaml
+agentv eval evals/math.yaml
 ```
 
-Results appear in `.agentv/results/eval_<timestamp>.jsonl` with scores, reasoning, and execution traces.
-
-Learn more in the [examples/](examples/README.md) directory. For a detailed comparison with other frameworks, see [docs/COMPARISON.md](docs/COMPARISON.md).
-
 ## Why AgentV?
 
-| Feature | AgentV | [LangWatch](https://github.com/langwatch/langwatch) | [LangSmith](https://github.com/langchain-ai/langsmith-sdk) | [LangFuse](https://github.com/langfuse/langfuse) |
-|---------|--------|-----------|-----------|----------|
-| **Setup** | `bun install -g agentv` | Cloud account + API key | Cloud account + API key | Cloud account + API key |
-| **Server** | None (local) | Managed cloud | Managed cloud | Managed cloud |
-| **Privacy** | All local | Cloud-hosted | Cloud-hosted | Cloud-hosted |
-| **CLI-first** | ✓ | ✗ | Limited | Limited |
-| **CI/CD ready** | ✓ | Requires API calls | Requires API calls | Requires API calls |
-| **Version control** | ✓ (YAML in Git) | ✗ | ✗ | ✗ |
-| **Evaluators** | Code + LLM + Custom | LLM only | LLM + Code | LLM only |
-
-**Best for:** Developers who want evaluation in their workflow, not a separate dashboard. Teams prioritizing privacy and reproducibility.
-
-## Features
+- **Local-first** — runs on your machine, no cloud accounts or API keys for eval infrastructure
+- **Version-controlled** — evals, judges, and results all live in Git
+- **Hybrid graders** — deterministic code checks + LLM-based subjective scoring
+- **CI/CD native** — exit codes, JSONL output, threshold flags for pipeline gating
+- **Any agent** — supports Claude, Codex, Copilot, VS Code, Pi, Azure OpenAI, or any CLI agent
 
-- **Multi-objective scoring**: Correctness, latency, cost, safety in one run
-- **Multiple evaluator types**: Code validators, LLM graders, custom Python/TypeScript
-- **Built-in targets**: VS Code Copilot, Codex CLI, Pi Coding Agent, Azure OpenAI, local CLI agents
-- **Structured evaluation**: Rubric-based grading with weights and requirements
-- **Batch evaluation**: Run hundreds of test cases in parallel
-- **Export**: JSON, JSONL, YAML formats
-- **Compare results**: Compute deltas between evaluation runs for A/B testing
-
-## Development
-
-Contributing to AgentV? Clone and set up the repository:
+## Quick start
 
+**1. Install and initialize:**
 ```bash
-git clone https://github.com/EntityProcess/agentv.git
-cd agentv
-
-# Install Bun if you don't have it
-curl -fsSL https://bun.sh/install | bash
-
-# Install dependencies and build
-bun install && bun run build
-
-# Run tests
-bun test
-```
-
-See [AGENTS.md](AGENTS.md) for development guidelines and design principles.
-
-### Releasing
-
-Version bump:
-
-```bash
-bun run release          # patch bump
-bun run release minor
-bun run release major
-```
-
-Canary rollout (recommended):
-
-```bash
-bun run publish:next         # publish current version to npm `next`
-bun run promote:latest       # promote same version to npm `latest`
-bun run tag:next 2.18.0      # point npm `next` to an explicit version
-bun run promote:latest 2.18.0 # point npm `latest` to an explicit version
-```
-
-Legacy prerelease flow (still available):
-
-```bash
-bun run release:next         # bump/increment `-next.N`
-bun run release:next major   # start new major prerelease line
+npm install -g agentv
+agentv init
 ```
 
-## Core Concepts
-
-**Evaluation files** (`.yaml` or `.jsonl`) define test cases with expected outcomes. **Targets** specify which agent/provider to evaluate. **Graders** (code or LLM) score results. **Results** are written as JSONL/YAML for analysis and comparison.
+**2. Configure targets** in `.agentv/targets.yaml` — point to your agent or LLM provider.
 
-### JSONL Format Support
-
-For large-scale evaluations, AgentV supports JSONL (JSON Lines) format as an alternative to YAML:
-
-```jsonl
-{"id": "test-1", "criteria": "Calculates correctly", "input": "What is 2+2?"}
-{"id": "test-2", "criteria": "Provides explanation", "input": "Explain variables"}
-```
-
-Optional sidecar YAML metadata file (`dataset.eval.yaml` alongside `dataset.jsonl`):
+**3. Create an eval** in `evals/`:
 ```yaml
-description: Math evaluation dataset
-name: math-tests
-execution:
-  target: azure-llm
-assertions:
-  - name: correctness
-    type: llm-grader
-    prompt: ./graders/correctness.md
+description: Code generation quality
+tests:
+  - id: fizzbuzz
+    criteria: Write a correct FizzBuzz implementation
+    input: Write FizzBuzz in Python
+    assertions:
+      - type: contains
+        value: "fizz"
+      - type: code-grader
+        command: ./validators/check_syntax.py
+      - type: llm-grader
+        prompt: ./graders/correctness.md
 ```
 
-Benefits: Streaming-friendly, Git-friendly diffs, programmatic generation, industry standard (DeepEval, LangWatch, Hugging Face).
-
-## Usage
-
-### Running Evaluations
-
+**4. Run it:**
 ```bash
-# Validate evals
-agentv validate evals/my-eval.yaml
-
-# Run an eval with default target (from eval file or targets.yaml)
 agentv eval evals/my-eval.yaml
-
-# Override target
-agentv eval --target azure-llm evals/**/*.yaml
-
-# Run specific test
-agentv eval --test-id case-123 evals/my-eval.yaml
-
-# Dry-run with mock provider
-agentv eval --dry-run evals/my-eval.yaml
 ```
 
-See `agentv eval --help` for all options: workers, timeouts, output formats, trace dumping, and more.
-
-#### Output Formats
-
-Write results to different formats using the `-o` flag (format auto-detected from extension):
-
+**5. Compare results across targets:**
 ```bash
-# Default run workspace (index.jsonl + benchmark/timing/per-test artifacts)
-agentv eval evals/my-eval.yaml
-
-# Self-contained HTML dashboard (opens in any browser, no server needed)
-agentv eval evals/my-eval.yaml -o report.html
-
-# Explicit JSONL output
-agentv eval evals/my-eval.yaml -o output.jsonl
-
-# Multiple formats simultaneously
-agentv eval evals/my-eval.yaml -o report.html
-
-# JUnit XML for CI/CD integration
-agentv eval evals/my-eval.yaml -o results.xml
+agentv compare .agentv/results/runs/eval_<timestamp>/index.jsonl
 ```
 
-The HTML report auto-refreshes every 2 seconds during a live run, then locks once the run completes.
-
-By default, `agentv eval` creates a run workspace under `.agentv/results/runs/<run>/`
-with `index.jsonl` as the machine-facing manifest.
-
-You can also convert an existing manifest to HTML after the fact:
+## Output formats
 
 ```bash
-agentv convert .agentv/results/runs/eval_<timestamp>/index.jsonl -o report.html
+agentv eval evals/my-eval.yaml                  # JSONL (default)
+agentv eval evals/my-eval.yaml -o report.html   # HTML dashboard
+agentv eval evals/my-eval.yaml -o results.xml   # JUnit XML for CI
 ```
 
-#### Timeouts
-
-AgentV does not apply a default top-level evaluation timeout. If you want one, set it explicitly
-with `--agent-timeout`, or set `execution.agentTimeoutMs` in your AgentV config to make it the
-default for your local runs.
-
-This top-level timeout is separate from provider- or tool-level timeouts. For example, an upstream
-agent or tool call may still time out even when AgentV's own top-level timeout is unset.
-
-### Create Custom Evaluators
-
-Write code graders in Python or TypeScript:
-
-```python
-# validators/check_answer.py
-import json, sys
-data = json.load(sys.stdin)
-answer = data.get("answer", "")
-
-assertions = []
-
-if "42" in answer:
-    assertions.append({"text": "Answer contains correct value (42)", "passed": True})
-else:
-    assertions.append({"text": "Answer does not contain expected value (42)", "passed": False})
-
-passed = sum(1 for a in assertions if a["passed"])
-score = 1.0 if passed == len(assertions) else 0.0
-
-print(json.dumps({
-    "score": score,
-    "assertions": assertions,
-}))
-```
-
-Reference evaluators in your eval file:
-
-```yaml
-assertions:
-  - name: my_validator
-    type: code-grader
-    command: ./validators/check_answer.py
-```
-
-For complete templates, examples, and evaluator patterns, see: [custom-evaluators](https://agentv.dev/evaluators/custom-evaluators/)
-
-### TypeScript SDK
-
-#### Custom Assertions with `defineAssertion()`
-
-Create custom assertion types in TypeScript using `@agentv/eval`:
-
-```typescript
-// .agentv/assertions/word-count.ts
-import { defineAssertion } from '@agentv/eval';
-
-export default defineAssertion(({ answer }) => {
-  const wordCount = answer.trim().split(/\s+/).length;
-  return {
-    pass: wordCount >= 3,
-    reasoning: `Output has ${wordCount} words`,
-  };
-});
-```
-
-Files in `.agentv/assertions/` are auto-discovered by filename — use directly in YAML:
-
-```yaml
-assertions:
-  - type: word-count    # matches word-count.ts
-  - type: contains
-    value: "Hello"
-```
-
-See the [sdk-custom-assertion example](examples/features/sdk-custom-assertion).
-
-#### Programmatic API with `evaluate()`
+## TypeScript SDK
 
-Use AgentV as a library — no YAML needed:
+Use AgentV programmatically:
 
 ```typescript
 import { evaluate } from '@agentv/core';
@@ -326,278 +102,28 @@ const { results, summary } = await evaluate({
 console.log(`${summary.passed}/${summary.total} passed`);
 ```
 
-Auto-discovers `default` target from `.agentv/targets.yaml` and `.env` credentials. See the [sdk-programmatic-api example](examples/features/sdk-programmatic-api).
+## Documentation
 
-#### Typed Configuration with `defineConfig()`
+Full docs at [agentv.dev/docs](https://agentv.dev/docs/getting-started/introduction/).
 
-Create `agentv.config.ts` at your project root for typed, validated configuration:
+- [Eval files](https://agentv.dev/docs/evaluation/eval-files/) — format and structure
+- [Custom evaluators](https://agentv.dev/docs/evaluators/custom-evaluators/) — code graders in any language
+- [Rubrics](https://agentv.dev/docs/evaluation/rubrics/) — structured criteria scoring
+- [Targets](https://agentv.dev/docs/targets/configuration/) — configure agents and providers
+- [Compare results](https://agentv.dev/docs/tools/compare/) — A/B testing and regression detection
+- [Comparison with other frameworks](https://agentv.dev/docs/reference/comparison/) — vs Braintrust, Langfuse, LangSmith, LangWatch
 
-```typescript
-import { defineConfig } from '@agentv/core';
-
-export default defineConfig({
-  execution: { workers: 5, maxRetries: 2 },
-  output: { format: 'jsonl', dir: './results' },
-  limits: { maxCostUsd: 10.0 },
-});
-```
-
-See the [sdk-config-file example](examples/features/sdk-config-file).
-
-#### Scaffold Commands
-
-Bootstrap new assertions and eval files:
-
-```bash
-agentv create assertion sentiment   # → .agentv/assertions/sentiment.ts
-agentv create eval my-eval          # → evals/my-eval.eval.yaml + .cases.jsonl
-```
-
-### Compare Evaluation Results
-
-Compare a combined results file across all targets (N-way matrix):
-
-```bash
-agentv compare .agentv/results/runs/eval_<timestamp>/index.jsonl
-```
-
-```
-Score Matrix
-
-  Test ID          gemini-3-flash-preview  gpt-4.1  gpt-5-mini
-  ───────────────  ──────────────────────  ───────  ──────────
-  code-generation                    0.70     0.80        0.75
-  greeting                           0.90     0.85        0.95
-  summarization                      0.85     0.90        0.80
-
-Pairwise Summary:
-  gemini-3-flash-preview → gpt-4.1:     1 win, 0 losses, 2 ties  (Δ +0.033)
-  gemini-3-flash-preview → gpt-5-mini:  0 wins, 0 losses, 3 ties  (Δ +0.017)
-  gpt-4.1 → gpt-5-mini:                 0 wins, 0 losses, 3 ties  (Δ -0.017)
-```
-
-Designate a baseline for CI regression gating, or compare two specific targets:
-
-```bash
-agentv compare .agentv/results/runs/eval_<timestamp>/index.jsonl --baseline gpt-4.1
-agentv compare .agentv/results/runs/eval_<timestamp>/index.jsonl --baseline gpt-4.1 --candidate gpt-5-mini
-agentv compare before.jsonl after.jsonl                                  # two-file pairwise
-```
-
-## Targets Configuration
-
-Define execution targets in `.agentv/targets.yaml` to decouple evals from providers:
-
-```yaml
-targets:
-  - name: azure-llm
-    provider: azure
-    endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
-    api_key: ${{ AZURE_OPENAI_API_KEY }}
-    model: ${{ AZURE_DEPLOYMENT_NAME }}
-
-  - name: vscode_dev
-    provider: vscode
-    grader_target: azure-llm
-
-  - name: local_agent
-    provider: cli
-    command: 'python agent.py --prompt-file {PROMPT_FILE} --output {OUTPUT_FILE}'
-    grader_target: azure-llm
-```
-
-Supports: `azure`, `anthropic`, `gemini`, `codex`, `copilot`, `pi-coding-agent`, `claude`, `vscode`, `vscode-insiders`, `cli`, and `mock`.
-
-Workspace templates are configured at eval-level under `workspace.template` (not per-target `workspace_template`).
-
-Use `${{ VARIABLE_NAME }}` syntax to reference your `.env` file. See `.agentv/targets.yaml` after `agentv init` for detailed examples and all provider-specific fields.
-
-## Evaluation Features
-
-### Code Graders
-
-Write validators in any language (Python, TypeScript, Node, etc.):
-
-```bash
-# Input: stdin JSON with question, criteria, answer
-# Output: stdout JSON with score (0-1), hits, misses, reasoning
-```
-
-For complete examples and patterns, see:
-- [custom-evaluators](https://agentv.dev/evaluators/custom-evaluators/)
-- [code-grader-sdk example](examples/features/code-grader-sdk)
-
-### Deterministic Assertions
-
-Built-in assertion types for common text-matching patterns — no LLM grader or code_grader needed:
-
-| Type | Value | Behavior |
-|------|-------|----------|
-| `contains` | `string` | Pass if output includes the substring |
-| `contains_any` | `string[]` | Pass if output includes ANY of the strings |
-| `contains_all` | `string[]` | Pass if output includes ALL of the strings |
-| `icontains` | `string` | Case-insensitive `contains` |
-| `icontains_any` | `string[]` | Case-insensitive `contains_any` |
-| `icontains_all` | `string[]` | Case-insensitive `contains_all` |
-| `starts_with` | `string` | Pass if output starts with value (trimmed) |
-| `ends_with` | `string` | Pass if output ends with value (trimmed) |
-| `regex` | `string` | Pass if output matches regex (optional `flags: "i"`) |
-| `equals` | `string` | Pass if output exactly equals value (trimmed) |
-| `is_json` | — | Pass if output is valid JSON |
-
-All assertions support `weight`, `required`, and `negate` flags. Use `negate: true` to invert (no `not_` prefix needed).
-
-```yaml
-assertions:
-  # Case-insensitive matching for natural language variation
-  - type: icontains-any
-    value: ["missing rule code", "need rule code", "provide rule code"]
-    required: true
-
-  # Multiple required terms
-  - type: icontains-all
-    value: ["country code", "rule codes"]
-
-  # Case-insensitive regex
-  - type: regex
-    value: "[a-z]+@[a-z]+\\.[a-z]+"
-    flags: "i"
-```
-
-See the [assert-extended example](examples/features/assert-extended) for complete patterns.
-
-### Target Configuration: `grader_target`
-
-Agent provider targets (`codex`, `copilot`, `claude`, `vscode`) **must** specify `grader_target` (also accepts `judge_target` for backward compatibility) when using `llm_grader` or `rubrics` evaluators. Without it, AgentV errors at startup — agent providers cannot return structured JSON for grading.
-
-```yaml
-targets:
-  # Agent target — requires grader_target for LLM-based evaluation
-  - name: codex_local
-    provider: codex
-    grader_target: azure-llm  # Required: LLM provider for grading
-
-  # LLM target — no grader_target needed (grades itself)
-  - name: azure-llm
-    provider: azure
-```
-
-### Agentic Eval Patterns
-
-When agents respond via tool calls instead of text, use `tool_trajectory` instead of text assertions:
-
-- **Agent takes workspace actions** (creates files, runs commands) → `tool_trajectory` evaluator
-- **Agent responds in text** (answers questions, asks for info) → `contains`/`icontains_any`/`llm_grader`
-- **Agent does both** → `composite` evaluator combining both
-
-### LLM Graders
-
-Create markdown grader files with evaluation criteria and scoring guidelines:
-
-```yaml
-assertions:
-  - name: semantic_check
-    type: llm-grader
-    prompt: ./graders/correctness.md
-```
-
-Your grader prompt file defines criteria and scoring guidelines.
-
-### Rubric-Based Evaluation
-
-Define structured criteria directly in your test:
-
-```yaml
-tests:
-  - id: quicksort-explain
-    criteria: Explain how quicksort works
-
-    input: Explain quicksort algorithm
-
-    assertions:
-      - type: rubrics
-        criteria:
-          - Mentions divide-and-conquer approach
-          - Explains partition step
-          - States time complexity
-```
-
-Scoring: `(satisfied weights) / (total weights)` → verdicts: `pass` (≥0.8), `borderline` (≥0.6), `fail`
-
-Author assertions directly in your eval file. When you want help choosing between simple assertions, deterministic graders, and LLM-based graders, use the `agentv-eval-writer` skill.
-
-See [rubric evaluator](https://agentv.dev/evaluation/rubrics/) for detailed patterns.
-
-## Advanced Configuration
-
-### Retry Behavior
-
-Configure automatic retry with exponential backoff:
-
-```yaml
-targets:
-  - name: azure-llm
-    provider: azure
-    max_retries: 5
-    retry_initial_delay_ms: 2000
-    retry_max_delay_ms: 120000
-    retry_backoff_factor: 2
-    retry_status_codes: [500, 408, 429, 502, 503, 504]
-```
-
-Automatically retries on rate limits, transient 5xx errors, and network failures with jitter.
-
-## Documentation & Learning
-
-**Getting Started:**
-- Run `agentv init` to set up your first evaluation workspace
-- Check [examples/README.md](examples/README.md) for demos (math, code generation, tool use)
-- AI agents: Ask Claude Code to `/agentv-eval-builder` to create and iterate on evals
-
-**Detailed Guides:**
-- [Evaluation format and structure](https://agentv.dev/evaluation/eval-files/)
-- [Custom evaluators](https://agentv.dev/evaluators/custom-evaluators/)
-- [Rubric evaluator](https://agentv.dev/evaluation/rubrics/)
-- [Composite evaluator](https://agentv.dev/evaluators/composite/)
-- [Tool trajectory evaluator](https://agentv.dev/evaluators/tool-trajectory/)
-- [Structured data evaluators](https://agentv.dev/evaluators/structured-data/)
-- [Batch CLI evaluation](https://agentv.dev/evaluation/batch-cli/)
-- [Compare results](https://agentv.dev/tools/compare/)
-- [Example evaluations](https://agentv.dev/evaluation/examples/)
-
-**Reference:**
-- Monorepo structure: `packages/core/` (engine), `packages/eval/` (evaluation logic), `apps/cli/` (commands)
-
-## Troubleshooting
-
-### `EACCES` permission error on global install (npm)
-
-If you see `EACCES: permission denied` when running `npm install -g agentv`, switch to bun (recommended) or configure npm to use a user-owned directory:
-
-**Option 1 (recommended): Use bun instead**
-```bash
-bun install -g agentv
-```
-
-**Option 2: Fix npm permissions**
-```bash
-mkdir -p ~/.npm-global
-npm config set prefix ~/.npm-global --location=user
-```
-
-Then add the directory to your PATH. For bash (`~/.bashrc`) or zsh (`~/.zshrc`):
+## Development
 
 ```bash
-echo 'export PATH=~/.npm-global/bin:$PATH' >> ~/.bashrc
-source ~/.bashrc
+git clone https://github.com/EntityProcess/agentv.git
+cd agentv
+bun install && bun run build
+bun test
 ```
 
-After this, `npm install -g` will work without `sudo`.
-
-## Contributing
-
-See [AGENTS.md](AGENTS.md) for development guidelines, design principles, and quality assurance workflow.
+See [AGENTS.md](AGENTS.md) for development guidelines.
 
 ## License
 
-MIT License - see [LICENSE](LICENSE) for details.
+MIT
diff --git a/apps/web/astro.config.mjs b/apps/web/astro.config.mjs
index 28fe6f71a..507cc69a5 100644
--- a/apps/web/astro.config.mjs
+++ b/apps/web/astro.config.mjs
@@ -37,13 +37,14 @@ export default defineConfig({
         { icon: 'github', label: 'GitHub', href: 'https://github.com/EntityProcess/agentv' },
       ],
       sidebar: [
-        { label: 'Getting Started', autogenerate: { directory: 'docs/getting-started' } },
-        { label: 'Evaluation', autogenerate: { directory: 'docs/evaluation' } },
-        { label: 'Evaluators', autogenerate: { directory: 'docs/evaluators' } },
-        { label: 'Targets', autogenerate: { directory: 'docs/targets' } },
-        { label: 'Tools', autogenerate: { directory: 'docs/tools' } },
-        { label: 'Guides', autogenerate: { directory: 'docs/guides' } },
-        { label: 'Integrations', autogenerate: { directory: 'docs/integrations' } },
+        { label: 'Getting Started', autogenerate: { directory: 'getting-started' } },
+        { label: 'Evaluation', autogenerate: { directory: 'evaluation' } },
+        { label: 'Evaluators', autogenerate: { directory: 'evaluators' } },
+        { label: 'Targets', autogenerate: { directory: 'targets' } },
+        { label: 'Tools', autogenerate: { directory: 'tools' } },
+        { label: 'Guides', autogenerate: { directory: 'guides' } },
+        { label: 'Integrations', autogenerate: { directory: 'integrations' } },
+        { label: 'Reference', autogenerate: { directory: 'reference' } },
       ],
       editLink: {
         baseUrl: 'https://github.com/EntityProcess/agentv/edit/main/apps/web/',
diff --git a/apps/web/src/components/Lander.astro b/apps/web/src/components/Lander.astro
index 850e0b100..e303bd949 100644
--- a/apps/web/src/components/Lander.astro
+++ b/apps/web/src/components/Lander.astro
@@ -13,7 +13,7 @@ import type { Props } from '@astrojs/starlight/props';
       <span class="av-wordmark">agent<span class="av-wordmark-v">v</span></span>
     </a>
     <div class="av-nav-links">
-      <a href="/docs/getting-started/introduction/">Docs</a>
+      <a href="/docs/">Docs</a>
       <a href="https://github.com/EntityProcess/agentv" target="_blank" rel="noopener noreferrer">GitHub</a>
       <button class="av-nav-pill" data-command="npm install -g agentv">
         <code>npm install -g agentv</code>
@@ -39,7 +39,7 @@ import type { Props } from '@astrojs/starlight/props';
           Deterministic code judges + customizable LLM judges, version-controlled in Git.
         </p>
         <div class="av-hero-cta">
-          <a href="/docs/getting-started/introduction/" class="av-btn-primary">Get Started</a>
+          <a href="/docs/" class="av-btn-primary">Get Started</a>
           <a href="https://github.com/EntityProcess/agentv" class="av-btn-ghost" target="_blank" rel="noopener noreferrer">
             <svg xmlns="http://www.w3.org/2000/svg" width="18" height="18" viewBox="0 0 24 24" fill="currentColor"><path d="M12 0c-6.626 0-12 5.373-12 12 0 5.302 3.438 9.8 8.207 11.387.599.111.793-.261.793-.577v-2.234c-3.338.726-4.033-1.416-4.033-1.416-.546-1.387-1.333-1.756-1.333-1.756-1.089-.745.083-.729.083-.729 1.205.084 1.839 1.237 1.839 1.237 1.07 1.834 2.807 1.304 3.492.997.107-.775.418-1.305.762-1.604-2.665-.305-5.467-1.334-5.467-5.931 0-1.311.469-2.381 1.236-3.221-.124-.303-.535-1.524.117-3.176 0 0 1.008-.322 3.301 1.23.957-.266 1.983-.399 3.003-.404 1.02.005 2.047.138 3.006.404 2.291-1.552 3.297-1.23 3.297-1.23.653 1.653.242 2.874.118 3.176.77.84 1.235 1.911 1.235 3.221 0 4.609-2.807 5.624-5.479 5.921.43.372.823 1.102.823 2.222v3.293c0 .319.192.694.801.576 4.765-1.589 8.199-6.086 8.199-11.386 0-6.627-5.373-12-12-12z"/></svg>
             GitHub
@@ -268,7 +268,7 @@ tests:
       <h2 class="av-gradient-text">Start evaluating your agents</h2>
       <p class="av-footer-sub">Open source. Local-first. MIT Licensed.</p>
       <div class="av-footer-actions">
-        <a href="/docs/getting-started/introduction/" class="av-btn-primary">Read the docs</a>
+        <a href="/docs/" class="av-btn-primary">Read the docs</a>
         <button class="av-footer-install" data-command="npm install -g agentv">
           <code>$ npm install -g agentv</code>
           <span class="av-copy-icon">
diff --git a/apps/web/src/content/docs/docs/getting-started/introduction.mdx b/apps/web/src/content/docs/docs.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/getting-started/introduction.mdx
rename to apps/web/src/content/docs/docs.mdx
diff --git a/apps/web/src/content/docs/docs/evaluation/batch-cli.mdx b/apps/web/src/content/docs/evaluation/batch-cli.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/evaluation/batch-cli.mdx
rename to apps/web/src/content/docs/evaluation/batch-cli.mdx
diff --git a/apps/web/src/content/docs/docs/evaluation/eval-cases.mdx b/apps/web/src/content/docs/evaluation/eval-cases.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/evaluation/eval-cases.mdx
rename to apps/web/src/content/docs/evaluation/eval-cases.mdx
diff --git a/apps/web/src/content/docs/docs/evaluation/eval-files.mdx b/apps/web/src/content/docs/evaluation/eval-files.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/evaluation/eval-files.mdx
rename to apps/web/src/content/docs/evaluation/eval-files.mdx
diff --git a/apps/web/src/content/docs/docs/evaluation/examples.mdx b/apps/web/src/content/docs/evaluation/examples.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/evaluation/examples.mdx
rename to apps/web/src/content/docs/evaluation/examples.mdx
diff --git a/apps/web/src/content/docs/docs/evaluation/rubrics.mdx b/apps/web/src/content/docs/evaluation/rubrics.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/evaluation/rubrics.mdx
rename to apps/web/src/content/docs/evaluation/rubrics.mdx
diff --git a/apps/web/src/content/docs/docs/evaluation/running-evals.mdx b/apps/web/src/content/docs/evaluation/running-evals.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/evaluation/running-evals.mdx
rename to apps/web/src/content/docs/evaluation/running-evals.mdx
diff --git a/apps/web/src/content/docs/docs/evaluation/sdk.mdx b/apps/web/src/content/docs/evaluation/sdk.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/evaluation/sdk.mdx
rename to apps/web/src/content/docs/evaluation/sdk.mdx
diff --git a/apps/web/src/content/docs/docs/evaluators/code-graders.mdx b/apps/web/src/content/docs/evaluators/code-graders.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/evaluators/code-graders.mdx
rename to apps/web/src/content/docs/evaluators/code-graders.mdx
diff --git a/apps/web/src/content/docs/docs/evaluators/composite.mdx b/apps/web/src/content/docs/evaluators/composite.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/evaluators/composite.mdx
rename to apps/web/src/content/docs/evaluators/composite.mdx
diff --git a/apps/web/src/content/docs/docs/evaluators/custom-assertions.mdx b/apps/web/src/content/docs/evaluators/custom-assertions.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/evaluators/custom-assertions.mdx
rename to apps/web/src/content/docs/evaluators/custom-assertions.mdx
diff --git a/apps/web/src/content/docs/docs/evaluators/custom-evaluators.mdx b/apps/web/src/content/docs/evaluators/custom-evaluators.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/evaluators/custom-evaluators.mdx
rename to apps/web/src/content/docs/evaluators/custom-evaluators.mdx
diff --git a/apps/web/src/content/docs/docs/evaluators/execution-metrics.mdx b/apps/web/src/content/docs/evaluators/execution-metrics.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/evaluators/execution-metrics.mdx
rename to apps/web/src/content/docs/evaluators/execution-metrics.mdx
diff --git a/apps/web/src/content/docs/docs/evaluators/llm-graders.mdx b/apps/web/src/content/docs/evaluators/llm-graders.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/evaluators/llm-graders.mdx
rename to apps/web/src/content/docs/evaluators/llm-graders.mdx
diff --git a/apps/web/src/content/docs/docs/evaluators/structured-data.mdx b/apps/web/src/content/docs/evaluators/structured-data.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/evaluators/structured-data.mdx
rename to apps/web/src/content/docs/evaluators/structured-data.mdx
diff --git a/apps/web/src/content/docs/docs/evaluators/tool-trajectory.mdx b/apps/web/src/content/docs/evaluators/tool-trajectory.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/evaluators/tool-trajectory.mdx
rename to apps/web/src/content/docs/evaluators/tool-trajectory.mdx
diff --git a/apps/web/src/content/docs/docs/getting-started/installation.mdx b/apps/web/src/content/docs/getting-started/installation.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/getting-started/installation.mdx
rename to apps/web/src/content/docs/getting-started/installation.mdx
diff --git a/apps/web/src/content/docs/getting-started/introduction.mdx b/apps/web/src/content/docs/getting-started/introduction.mdx
new file mode 100644
index 000000000..895d15adf
--- /dev/null
+++ b/apps/web/src/content/docs/getting-started/introduction.mdx
@@ -0,0 +1,53 @@
+---
+title: Introduction
+description: What AgentV is and why it exists
+sidebar:
+  order: 1
+---
+
+AgentV is a CLI-first AI agent evaluation framework. It evaluates your agents locally with multi-objective scoring (correctness, latency, cost, safety) from YAML specifications. Deterministic code graders + customizable LLM graders, all version-controlled in Git.
+
+## Why AgentV?
+
+**Best for:** Developers who want evaluation in their workflow, not a separate dashboard. Teams prioritizing privacy and reproducibility.
+
+- **No cloud dependency** — everything runs locally
+- **No server** — just install and run
+- **Version-controlled** — YAML evaluation files live in Git alongside your code
+- **CI/CD ready** — run evaluations in your pipeline without external API calls
+- **Multiple evaluator types** — code validators, LLM graders, custom Python/TypeScript
+
+## How AgentV Compares
+
+| Feature | AgentV | LangWatch | LangSmith | LangFuse |
+|---------|--------|-----------|-----------|----------|
+| **Setup** | `npx allagents plugin install` | Cloud account + API key | Cloud account + API key | Cloud account + API key |
+| **Server** | None (local) | Managed cloud | Managed cloud | Managed cloud |
+| **Privacy** | All local | Cloud-hosted | Cloud-hosted | Cloud-hosted |
+| **CLI-first** | Yes | No | Limited | Limited |
+| **CI/CD ready** | Yes | Requires API calls | Requires API calls | Requires API calls |
+| **Version control** | Yes (YAML in Git) | No | No | No |
+| **Evaluators** | Code + LLM + Custom | LLM only | LLM + Code | LLM only |
+
+## Core Concepts
+
+**Evaluation files** (`.yaml` or `.jsonl`) define test cases with expected outcomes. **Targets** specify which agent or provider to evaluate. **Graders** (code or LLM) score results. **Results** are written as JSONL/YAML for analysis and comparison.
+
+### Key Components
+
+- **Eval files** — YAML or JSONL definitions of test cases
+- **Tests** — Individual test entries with input messages and expected outcomes
+- **Targets** — The agent or LLM provider being evaluated
+- **Evaluators** — Code graders (Python/TypeScript) or LLM graders that score responses
+- **Rubrics** — Structured criteria with weights for grading
+- **Results** — JSONL output with scores, reasoning, and execution traces
+
+## Features
+
+- **Multi-objective scoring**: Correctness, latency, cost, safety in one run
+- **Multiple evaluator types**: Code validators, LLM graders, custom Python/TypeScript
+- **Built-in targets**: VS Code Copilot, Codex CLI, Pi Coding Agent, Azure OpenAI, local CLI agents
+- **Structured evaluation**: Rubric-based grading with weights and requirements
+- **Batch evaluation**: Run hundreds of test cases in parallel
+- **Export**: JSON, JSONL, YAML formats
+- **Compare results**: Compute deltas between evaluation runs for A/B testing
diff --git a/apps/web/src/content/docs/docs/getting-started/quickstart.mdx b/apps/web/src/content/docs/getting-started/quickstart.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/getting-started/quickstart.mdx
rename to apps/web/src/content/docs/getting-started/quickstart.mdx
diff --git a/apps/web/src/content/docs/docs/guides/agent-eval-layers.mdx b/apps/web/src/content/docs/guides/agent-eval-layers.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/guides/agent-eval-layers.mdx
rename to apps/web/src/content/docs/guides/agent-eval-layers.mdx
diff --git a/apps/web/src/content/docs/docs/guides/agent-skills-evals.mdx b/apps/web/src/content/docs/guides/agent-skills-evals.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/guides/agent-skills-evals.mdx
rename to apps/web/src/content/docs/guides/agent-skills-evals.mdx
diff --git a/apps/web/src/content/docs/docs/guides/autoevals-integration.mdx b/apps/web/src/content/docs/guides/autoevals-integration.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/guides/autoevals-integration.mdx
rename to apps/web/src/content/docs/guides/autoevals-integration.mdx
diff --git a/apps/web/src/content/docs/docs/guides/eval-authoring.mdx b/apps/web/src/content/docs/guides/eval-authoring.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/guides/eval-authoring.mdx
rename to apps/web/src/content/docs/guides/eval-authoring.mdx
diff --git a/apps/web/src/content/docs/docs/guides/evaluation-types.mdx b/apps/web/src/content/docs/guides/evaluation-types.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/guides/evaluation-types.mdx
rename to apps/web/src/content/docs/guides/evaluation-types.mdx
diff --git a/apps/web/src/content/docs/docs/guides/git-cache-workspace.mdx b/apps/web/src/content/docs/guides/git-cache-workspace.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/guides/git-cache-workspace.mdx
rename to apps/web/src/content/docs/guides/git-cache-workspace.mdx
diff --git a/apps/web/src/content/docs/docs/guides/human-review.mdx b/apps/web/src/content/docs/guides/human-review.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/guides/human-review.mdx
rename to apps/web/src/content/docs/guides/human-review.mdx
diff --git a/apps/web/src/content/docs/docs/guides/skill-improvement-workflow.mdx b/apps/web/src/content/docs/guides/skill-improvement-workflow.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/guides/skill-improvement-workflow.mdx
rename to apps/web/src/content/docs/guides/skill-improvement-workflow.mdx
diff --git a/apps/web/src/content/docs/docs/guides/workspace-pool.mdx b/apps/web/src/content/docs/guides/workspace-pool.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/guides/workspace-pool.mdx
rename to apps/web/src/content/docs/guides/workspace-pool.mdx
diff --git a/apps/web/src/content/docs/docs/integrations/langfuse.mdx b/apps/web/src/content/docs/integrations/langfuse.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/integrations/langfuse.mdx
rename to apps/web/src/content/docs/integrations/langfuse.mdx
diff --git a/apps/web/src/content/docs/reference/comparison.mdx b/apps/web/src/content/docs/reference/comparison.mdx
new file mode 100644
index 000000000..d850dfd09
--- /dev/null
+++ b/apps/web/src/content/docs/reference/comparison.mdx
@@ -0,0 +1,126 @@
+---
+title: Comparison
+description: How AgentV compares to other evaluation frameworks.
+---
+
+## Quick Comparison
+
+| Aspect | **AgentV** | **Braintrust** | **Langfuse** | **LangSmith** | **LangWatch** | **Google ADK** | **Mastra** | **OpenCode Bench** |
+|--------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
+| **Primary Focus** | Agent evaluation & testing | Evaluation + logging | Observability + evaluation | Observability + evaluation | LLM ops & evaluation | Agent development | Agent/workflow development | Coding agent benchmarking |
+| **Language** | TypeScript/CLI | Python/TypeScript | Python/JavaScript | Python/JavaScript | Python/JavaScript | Python | TypeScript | Python/CLI |
+| **Deployment** | Local (CLI-first) | Cloud | Cloud/self-hosted | Cloud only | Cloud/self-hosted/hybrid | Local/Cloud Run | Local/server | Benchmarking service |
+| **Self-contained** | Yes | No (cloud) | No (requires server) | No (cloud-only) | No (requires server) | Yes | Yes (optional) | No (requires service) |
+| **Evaluation Focus** | Core feature | Core feature | Yes | Yes | Core feature | Minimal | Secondary | Core feature |
+| **Judge Types** | Code + LLM (custom prompts) | Code + LLM (custom) | LLM-as-judge only | LLM-based + custom | LLM + real-time | Built-in metrics | Built-in (minimal) | Multi-judge LLM (3 judges) |
+| **CLI-First** | Yes | No (SDK-first) | Dashboard-first | Dashboard-first | Dashboard-first | Code-first | Code-first | Service-based |
+| **Open Source** | MIT | Closed source | Apache 2.0 | Closed | Closed | Apache 2.0 | MIT | Open source |
+| **Setup Time** | &lt; 2 min | 5+ min | 15+ min | 10+ min | 20+ min | 30+ min | 10+ min | 5-10 min |
+
+## AgentV vs. Braintrust
+
+| Feature | AgentV | Braintrust |
+|---------|--------|-----------|
+| **Evaluation** | Code + LLM (custom prompts) | Code + LLM (Autoevals library) |
+| **Deployment** | Local (no server) | Cloud-only (managed) |
+| **Open source** | MIT | Closed source |
+| **Pricing** | Free | Free tier + paid plans |
+| **CLI-first** | Yes | SDK-first (Python/TS) |
+| **Custom judge prompts** | Markdown files (Git) | SDK-based |
+| **Observability** | No | Yes (logging, tracing) |
+| **Datasets** | YAML/JSONL in Git | Managed in platform |
+| **CI/CD** | Native (exit codes) | API-based |
+| **Collaboration** | Git-based | Web dashboard |
+
+**Choose AgentV if:** You want local-first evaluation, open source, version-controlled evals in Git.
+**Choose Braintrust if:** You want a managed platform with built-in logging, datasets, and team collaboration.
+
+## AgentV vs. Langfuse
+
+| Feature | AgentV | Langfuse |
+|---------|--------|----------|
+| **Evaluation** | Code + LLM (custom prompts) | LLM only |
+| **Local execution** | Yes | No (requires server) |
+| **Speed** | Fast (no network) | Slower (API round-trips) |
+| **Setup** | `npm install` | Docker + database |
+| **Cost** | Free | Free + $299+/mo for production |
+| **Observability** | No | Full tracing |
+| **Custom judge prompts** | Version in Git | API-based |
+| **CI/CD ready** | Yes | Requires API calls |
+
+**Choose AgentV if:** You iterate locally on evals, need deterministic + subjective judges together.
+**Choose Langfuse if:** You need production observability + team dashboards.
+
+## AgentV vs. LangSmith
+
+| Feature | AgentV | LangSmith |
+|---------|--------|-----------|
+| **Evaluation** | Code + LLM custom | LLM-based (SDK) |
+| **Deployment** | Local (no server) | Cloud only |
+| **Framework lock-in** | None | LangChain ecosystem |
+| **Open source** | MIT | Closed |
+| **Local execution** | Yes | No (requires API calls) |
+| **Observability** | No | Full tracing |
+
+**Choose AgentV if:** You want local evaluation, deterministic judges, open source.
+**Choose LangSmith if:** You're LangChain-heavy, need production tracing.
+
+## AgentV vs. LangWatch
+
+| Feature | AgentV | LangWatch |
+|---------|--------|-----------|
+| **Evaluation focus** | Development-first | Team collaboration first |
+| **Execution** | Local | Cloud/self-hosted server |
+| **Custom judge prompts** | Markdown files (Git) | UI-based |
+| **Code judges** | Yes | LLM-focused |
+| **Setup** | &lt; 2 min | 20+ min |
+| **Team features** | No | Annotation, roles, review |
+
+**Choose AgentV if:** You develop locally, want fast iteration, prefer code judges.
+**Choose LangWatch if:** You need team collaboration, managed optimization, on-prem deployment.
+
+## AgentV vs. Google ADK
+
+| Feature | AgentV | Google ADK |
+|---------|--------|-----------|
+| **Purpose** | Evaluation | Agent development |
+| **Evaluation capability** | Comprehensive | Built-in metrics only |
+| **Setup** | &lt; 2 min | 30+ min |
+| **Code-first** | YAML-first | Python-first |
+
+**Choose AgentV if:** You need to evaluate agents (not build them).
+**Choose Google ADK if:** You're building multi-agent systems.
+
+## AgentV vs. Mastra
+
+| Feature | AgentV | Mastra |
+|---------|--------|--------|
+| **Purpose** | Agent evaluation & testing | Agent/workflow development framework |
+| **Evaluation** | Core focus (code + LLM judges) | Secondary, built-in only |
+| **Agent Building** | No (tests agents) | Yes (builds agents with tools, workflows) |
+| **Open Source** | MIT | MIT |
+
+**Choose AgentV if:** You need to test/evaluate agents.
+**Choose Mastra if:** You're building TypeScript AI agents and need orchestration.
+
+## When to Use AgentV
+
+**Best for:** Individual developers and teams that evaluate locally before deploying, and need custom evaluation criteria.
+
+**Use something else for:**
+- Production observability → Langfuse or LangWatch
+- Team dashboards → LangWatch, Langfuse, or Braintrust
+- Building agents → Mastra (TypeScript) or Google ADK (Python)
+- Standardized benchmarking → OpenCode Bench
+
+## Ecosystem Recommendation
+
+```
+Build agents (Mastra / Google ADK)
+    ↓
+Evaluate locally (AgentV)
+    ↓
+Block regressions in CI/CD (AgentV)
+    ↓
+Monitor in production (Langfuse / LangWatch / Braintrust)
+```
diff --git a/apps/web/src/content/docs/docs/targets/coding-agents.mdx b/apps/web/src/content/docs/targets/coding-agents.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/targets/coding-agents.mdx
rename to apps/web/src/content/docs/targets/coding-agents.mdx
diff --git a/apps/web/src/content/docs/docs/targets/configuration.mdx b/apps/web/src/content/docs/targets/configuration.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/targets/configuration.mdx
rename to apps/web/src/content/docs/targets/configuration.mdx
diff --git a/apps/web/src/content/docs/docs/targets/custom-providers.mdx b/apps/web/src/content/docs/targets/custom-providers.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/targets/custom-providers.mdx
rename to apps/web/src/content/docs/targets/custom-providers.mdx
diff --git a/apps/web/src/content/docs/docs/targets/llm-providers.mdx b/apps/web/src/content/docs/targets/llm-providers.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/targets/llm-providers.mdx
rename to apps/web/src/content/docs/targets/llm-providers.mdx
diff --git a/apps/web/src/content/docs/docs/targets/retry.mdx b/apps/web/src/content/docs/targets/retry.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/targets/retry.mdx
rename to apps/web/src/content/docs/targets/retry.mdx
diff --git a/apps/web/src/content/docs/docs/tools/compare.mdx b/apps/web/src/content/docs/tools/compare.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/tools/compare.mdx
rename to apps/web/src/content/docs/tools/compare.mdx
diff --git a/apps/web/src/content/docs/docs/tools/convert.mdx b/apps/web/src/content/docs/tools/convert.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/tools/convert.mdx
rename to apps/web/src/content/docs/tools/convert.mdx
diff --git a/apps/web/src/content/docs/docs/tools/trace.mdx b/apps/web/src/content/docs/tools/trace.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/tools/trace.mdx
rename to apps/web/src/content/docs/tools/trace.mdx
diff --git a/apps/web/src/content/docs/docs/tools/validate.mdx b/apps/web/src/content/docs/tools/validate.mdx
similarity index 100%
rename from apps/web/src/content/docs/docs/tools/validate.mdx
rename to apps/web/src/content/docs/tools/validate.mdx
diff --git a/docs/COMPARISON.md b/docs/COMPARISON.md
deleted file mode 100644
index dd7bfc73f..000000000
--- a/docs/COMPARISON.md
+++ /dev/null
@@ -1,397 +0,0 @@
-# AgentV vs. Related Frameworks
-
-## Quick Comparison
-
-| Aspect | **AgentV** | **Langfuse** | **LangSmith** | **LangWatch** | **Google ADK** | **Mastra** | **OpenCode Bench** |
-|--------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
-| **Primary Focus** | Agent evaluation & testing | Observability + evaluation | Observability + evaluation | LLM ops & evaluation | Agent development | Agent/workflow development | Coding agent benchmarking |
-| **Language** | TypeScript/CLI | Python/JavaScript | Python/JavaScript | Python/JavaScript | Python | TypeScript | Python/CLI |
-| **Deployment** | Local (CLI-first) | Cloud/self-hosted | Cloud only | Cloud/self-hosted/hybrid | Local/Cloud Run | Local/server | Benchmarking service |
-| **Self-contained** | ✓ Yes | ✗ Requires server | ✗ Cloud-only | ✗ Requires server | ✓ Yes | ✓ Yes (optional) | ✗ Requires service |
-| **Evaluation Focus** | ✓ Core feature | ✓ Yes | ✓ Yes | ✓ Core feature | ✗ Minimal | ✗ Secondary | ✓ Core feature |
-| **Judge Types** | Code + LLM (custom prompts) | LLM-as-judge only | LLM-based + custom | LLM + real-time | Built-in metrics | Built-in (minimal) | Multi-judge LLM (3 judges) |
-| **CLI-First** | ✓ Yes | ✓ Dashboard-first | ✓ Dashboard-first | ✓ Dashboard-first | ✓ Code-first | ✓ Code-first | ✓ Service-based |
-| **Open Source** | ✓ MIT | ✓ Apache 2.0 | ✓ Closed | ✓ Closed | ✓ Apache 2.0 | ✓ MIT | ✓ Open source |
-| **Setup Time** | < 2 min | 15+ min | 10+ min | 20+ min | 30+ min | 10+ min | 5-10 min (CLI) |
-| **Local Iteration Speed** | ✓ Instant (evals) | ✗ UI-mediated | ✗ API calls | ✗ UI-mediated | ✓ Instant (agents) | ✓ Instant (code) | ✗ 30+ min per run |
-| **Deterministic Evaluation** | ✓ Code judges | ✗ (LLM-biased) | ✗ (LLM-biased) | ✗ (LLM-biased) | ✓ Built-in | ~ (Custom code) | ✗ (LLM-based) |
-| **Real-World Tasks** | ~ (Your data) | ~ (Your data) | ~ (Your data) | ~ (Your data) | ~ (Your design) | N/A (agent building) | ✓ GitHub commits |
-
-## Technical Differences
-
-### How AgentV Works
-
-**1. Hybrid Judge System (Code + LLM with Custom Prompts)**
-```yaml
-assertions:
-  - name: format_check
-    type: code_judge           # Deterministic: checks concrete outputs
-    command: ./validators/check_format.py
-
-  - name: correctness
-    type: llm_judge            # Subjective: uses customizable judge prompt
-    prompt: ./judges/correctness.md  # Edit the prompt, not the code
-```
-
-This is more powerful than:
-- **Langfuse**: LLM judges only, limited prompt customization via API
-- **LangSmith**: LLM-biased, requires SDK modifications for custom logic
-- **LangWatch**: UI-driven prompt customization (not version-controlled)
-- **Google ADK**: Not focused on evaluation (agent development framework)
-
-**Why this matters:**
-- Code judges catch objective failures (syntax errors, missing fields, wrong format)
-- LLM judges handle subjective criteria (tone, helpfulness, reasoning quality)
-- Customizable prompts = iterate on eval criteria without code changes
-- All version-controlled in Git alongside your evals
-
-**2. Local-First Workflow**
-No network round-trips, no waiting for managed infrastructure:
-- Edit eval YAML → Run → Get results in seconds
-- Iteration speed: **Code judges (instant) + LLM judges (1-2 sec per case)**
-- Compare to Langfuse/LangWatch: UI clicks + backend processing
-
-**3. CLI-Native, Not UI-Native**
-```bash
-# AgentV workflow
-agentv eval evals/my-eval.yaml
-agentv eval evals/**/*.yaml --workers 10  # Parallel
-agentv compare .agentv/results/runs/eval_<timestamp>/index.jsonl
-agentv compare .agentv/results/runs/eval_<timestamp>/index.jsonl --baseline gpt-4.1
-agentv compare before.jsonl after.jsonl   # Two-file pairwise A/B testing
-```
-
-```bash
-# Langfuse/LangWatch workflow
-# 1. Log in to web UI
-# 2. Create evaluation in UI
-# 3. Configure judges in UI
-# 4. Run evaluation
-# 5. View results in dashboard
-```
-
-AgentV integrates into:
-- **CI/CD pipelines** (`agentv eval evals/` + `agentv compare .agentv/results/runs/eval_<timestamp>/index.jsonl`)
-- **Git hooks** (block PRs if eval scores drop)
-- **Scripts** (parse `index.jsonl`, `benchmark.json`)
-- **Notebooks** (iterate on eval logic)
-
-**4. Zero Infrastructure Overhead**
-```bash
-npm install -g agentv
-agentv init
-agentv eval evals/example.yaml
-# Done. No Docker, no K8s, no managed service.
-```
-
-vs Langfuse:
-```bash
-docker-compose up -d  # Spin up managed infrastructure
-# Configure database, API keys
-# Wait for services to start
-# Create evaluations in web UI
-# ...
-```
-
-## Practical Use Cases
-
-### Scenario: Iterating on Eval Criteria
-
-```markdown
-# judges/correctness.md (edit locally, version in Git)
-Evaluate if the answer is mathematically correct.
-
-## Scoring
-- 1.0: Correct answer with clear reasoning
-- 0.8: Correct answer, reasoning unclear
-- 0.5: Partially correct
-- 0.0: Wrong answer
-```
-
-Then re-run: `agentv eval evals/math.yaml`
-
-Alternative approaches:
-- Langfuse/LangWatch: Go to UI, modify prompt, save, re-run
-- LangSmith: Modify SDK code, redeploy
-- Google ADK: Modify Python code, rerun framework
-
-### Scenario: Deterministic + Subjective Evaluation
-
-```yaml
-assertions:
-  - name: syntax_check
-    type: code_judge
-    command: ["python", "check_syntax.py"]
-  - name: logic_check
-    type: code_judge
-    command: ["python", "check_logic.py"]
-  - name: explanation_quality
-    type: llm_judge
-    prompt: judges/explanation.md
-```
-
-Single eval run scores all three dimensions. Other approaches:
-- Langfuse: LLM judges only (no deterministic checks)
-- LangSmith: Requires custom evaluation SDK calls
-- LangWatch: UI judges only (mixing code + UI-driven)
-
-### Scenario: Reproducible Local Evals in CI/CD
-
-```yaml
-# .github/workflows/eval.yml
-- run: agentv eval evals/**/*.yaml
-- run: agentv compare .agentv/results/runs/eval_<timestamp>/index.jsonl --baseline gpt-4.1
-  # Exit 1 if any target regresses vs baseline (N-way matrix)
-- run: agentv compare baseline.jsonl results.jsonl --threshold 0.05
-  # Or two-file pairwise: fail if performance drops > 5%
-```
-
-Other tools face challenges here:
-- Langfuse/LangWatch: Require external service (not CI-friendly)
-- LangSmith: Cloud-only, no local execution
-- Google ADK: Not designed for evals
-
-### Scenario: Fast Iteration Feedback Loop
-
-```
-Edit eval → Save → agentv eval (1-2 sec) → Review results
-vs
-Edit in UI → Click Save → Wait for backend → Refresh dashboard (10-20 sec)
-```
-
-Other tools:
-- Langfuse: UI-mediated (slower feedback loop)
-- LangSmith: SDK calls + cloud latency
-- LangWatch: UI-mediated (slower)
-- Google ADK: Code change + rerun
-
-## Trade-offs and Alternatives
-
-### Production Monitoring & Observability
-**Use Langfuse, LangSmith, or LangWatch instead**
-
-AgentV evaluates static test cases. It doesn't:
-- ✗ Capture production traces
-- ✗ Monitor LLM call latency in production
-- ✗ Alert on failures in real-world usage
-- ✗ Track cost-per-request
-
-**Recommendation:** Use AgentV for development → Langfuse/LangWatch for production
-
-### Team Collaboration & Dashboards
-**Use LangWatch or Langfuse instead**
-
-AgentV uses Git-based collaboration (like code), not web dashboards:
-- ✓ Git version control (evals, judges, results)
-- ✓ PR reviews for eval changes
-- ✓ Branch-based experimentation
-- ✗ No real-time web dashboard
-- ✗ No in-app annotation/review UI
-- ✗ No role-based access control
-
-### Prompt Optimization
-
-**AgentV approach:**
-- ✓ Has a prompt optimization skill that leverages coding agents
-- ✓ Agents iteratively improve prompts based on eval results
-- ✓ Lightweight and integrated with your eval workflow
-
-**LangWatch approach:**
-- ✓ Built-in MIPROv2 automatic optimization
-- Requires team collaboration features and managed service
-
-### Prompt Version Control & Management
-**Use Langfuse instead**
-
-Langfuse has:
-- ✓ Centralized prompt versioning
-- ✓ A/B testing UI
-- ✓ Automatic caching
-
-AgentV approach: Store judge prompts in Git, manage manually
-
-## Direct Comparisons
-
-### AgentV vs. Langfuse
-
-| Feature | AgentV | Langfuse |
-|---------|--------|----------|
-| **Evaluation** | Code + LLM (custom prompts) | LLM only |
-| **Local execution** | ✓ Yes | ✗ (requires server) |
-| **Speed** | Fast (no network) | Slower (API round-trips) |
-| **Setup** | `npm install` | Docker + database |
-| **Cost** | Free | Free + $299+/mo for production |
-| **Observability** | ✗ No | ✓ Full tracing |
-| **Collaboration** | ✗ No | ✓ Team UI |
-| **Custom judge prompts** | ✓ Version in Git | ~ (API-based) |
-| **CI/CD ready** | ✓ Yes | ~ (Requires API calls) |
-
-**Choose AgentV if:** You iterate locally on evals, need deterministic + subjective judges together
-**Choose Langfuse if:** You need production observability + team dashboards
-
-### AgentV vs. LangWatch
-
-| Feature | AgentV | LangWatch |
-|---------|--------|-----------|
-| **Evaluation focus** | Development-first | Team collaboration first |
-| **Execution** | Local | Cloud/self-hosted server |
-| **Custom judge prompts** | ✓ Markdown files (Git) | ✓ UI-based |
-| **Code judges** | ✓ Yes | ✗ LLM-focused |
-| **Prompt optimization** | ✓ Via skill + agents | ✓ Built-in MIPROv2 |
-| **Setup** | < 2 min | 20+ min |
-| **Iteration speed** | ✓ Instant | ~ UI-mediated |
-| **Team features** | ✗ No | ✓ Annotation, roles, review |
-
-**Choose AgentV if:** You develop locally, want fast iteration, prefer code judges, need lightweight optimization
-**Choose LangWatch if:** You need team collaboration, managed optimization, on-prem deployment
-
-### AgentV vs. LangSmith
-
-| Feature | AgentV | LangSmith |
-|---------|--------|-----------|
-| **Evaluation** | Code + LLM custom | LLM-based (SDK) |
-| **Deployment** | Local (no server) | Cloud only |
-| **Framework lock-in** | None | LangChain ecosystem |
-| **Open source** | ✓ MIT | ✗ Closed |
-| **Setup** | Minimal | Requires API key + SDK setup |
-| **Local execution** | ✓ Yes | ✗ (requires API calls) |
-| **Observability** | ✗ No | ✓ Full tracing |
-| **Production ready** | ✗ (dev tool) | ✓ Yes |
-
-**Choose AgentV if:** You want local evaluation, deterministic judges, open source
-**Choose LangSmith if:** You're LangChain-heavy, need production tracing
-
-### AgentV vs. Google ADK
-
-| Feature | AgentV | Google ADK |
-|---------|--------|-----------|
-| **Purpose** | Evaluation | Agent development |
-| **Evaluation capability** | ✓ Comprehensive | ~ (Built-in metrics only) |
-| **Judge customization** | ✓ Code + LLM prompts | ✗ Limited |
-| **Setup** | < 2 min | 30+ min |
-| **Code-first** | ✗ YAML-first | ✓ Python-first |
-| **Learning curve** | Low | High |
-| **Multi-agent support** | ✗ (tests agents) | ✓ (builds agents) |
-| **Deployment options** | Local | Local + Cloud Run |
-
-**Choose AgentV if:** You need to evaluate agents (not build them)
-**Choose Google ADK if:** You're building multi-agent systems and need development framework
-
-### AgentV vs. Mastra
-
-| Feature | AgentV | Mastra |
-|---------|--------|--------|
-| **Purpose** | Agent evaluation & testing | Agent/workflow development framework |
-| **Language** | TypeScript (CLI-native) | TypeScript (code-native) |
-| **Evaluation** | ✓ Core focus (code + LLM judges) | ~ (Secondary, built-in only) |
-| **Judge Customization** | ✓ High (custom prompts, code judges) | ✗ Fixed built-in metrics |
-| **Agent Building** | ✗ (Tests agents) | ✓ (Builds agents with tools, workflows) |
-| **Workflow Orchestration** | ✗ No | ✓ Yes (`.then()`, `.branch()`, `.parallel()`) |
-| **Model Routing** | ✗ (External) | ✓ (40+ providers unified) |
-| **Context Management** | ✗ No | ✓ (Memory, RAG, history) |
-| **Setup Time** | < 2 min | 10+ min |
-| **Setup Complexity** | Minimal | Medium (npm + TypeScript) |
-| **Evaluation Iteration Speed** | ✓ Instant | ~ Code change + rerun |
-| **Open Source** | ✓ MIT | ✓ MIT |
-
-**Key Difference:**
-- **AgentV**: Specialized tool for evaluating agents (any language, any agent type)
-- **Mastra**: Full framework for building AI agents in TypeScript
-
-**Complementary Use:**
-```
-Mastra (build TypeScript agents)
-    ↓
-AgentV (evaluate your agents with custom criteria)
-    ↓
-Mastra (deploy agents in production)
-```
-
-**Choose AgentV if:** You need to test/evaluate agents, fast iteration on metrics, mix of deterministic + subjective scoring
-**Choose Mastra if:** You're building TypeScript AI agents and need orchestration, context management, multiple LLM providers
-
-### AgentV vs. OpenCode Bench
-
-| Feature | AgentV | OpenCode Bench |
-|---------|--------|---------|
-| **Purpose** | General agent evaluation (any task) | Benchmarking coding agents on real GitHub commits |
-| **Task Source** | You define tasks/expected outcomes | Pre-curated GitHub production commits |
-| **Judge Type** | Code + LLM (customizable) | Multi-judge LLM (3 judges, fixed) |
-| **Scoring Dimensions** | You define (custom rubrics) | 5 fixed: API compliance, logic, integration, tests, checks |
-| **Execution** | Local (seconds) | Remote (30+ min per run) |
-| **Variance Handling** | Single run | 3 runs per task (episode isolation) + variance penalties |
-| **Setup** | < 2 min | 5-10 min CLI setup |
-| **Customization** | High (custom judges, prompts, metrics) | Low (fixed benchmark) |
-| **Use Case** | Develop & iterate on evals | Compare agents against standard benchmark |
-
-**Key Difference:**
-- **AgentV**: Build custom evaluations for your specific needs, iterate quickly locally
-- **OpenCode Bench**: Standardized benchmark to rank coding agents against production GitHub tasks
-
-**Complementary Use:**
-```
-AgentV → Develop your agent → Evaluate locally with custom rubrics
-OpenCode Bench → When ready, submit to public benchmark for objective ranking
-```
-
-**Choose AgentV if:** You need custom evaluation criteria, fast iteration, control over tasks
-**Choose OpenCode Bench if:** You want standard benchmark ranking, reproducible comparison, real-world GitHub tasks
-
-## Key Characteristics
-
-AgentV is designed for developers who prefer working in code and version control over UI-driven workflows:
-
-- **Local-first execution**: Evaluations run entirely on your machine without external services
-- **Version-controlled criteria**: Judge prompts and evaluation configs live in Git alongside your code
-- **Hybrid evaluation**: Supports both deterministic code judges and LLM-based subjective judges
-- **CI/CD integration**: Designed to run in automated pipelines with exit codes and diff comparisons
-- **No infrastructure**: Single npm package, no databases or servers to manage
-- **MIT licensed**: Fork, modify, and distribute without restrictions
-
-This makes AgentV most useful during development and testing phases. For production observability and team collaboration, consider pairing it with tools like Langfuse or LangWatch that specialize in those areas.
-
-## When to Use AgentV
-
-**Don't use AgentV for:**
-- Production observability → Use Langfuse or LangWatch
-- Team collaboration dashboards → Use LangWatch or Langfuse
-- Building agents → Use Mastra (TypeScript) or Google ADK (Python)
-- Intricate production tracing → Use LangSmith
-- Standardized benchmarking → Use OpenCode Bench
-
-**Sweet spot:** Individual developers and teams that evaluate locally before deploying to production, and who need custom evaluation criteria tailored to their specific use case. Pairs naturally with Mastra and Google ADK for end-to-end development workflows.
-
-## Ecosystem Recommendation
-
-**Development to Production Pipeline:**
-
-```
-TypeScript Agents:
-  Mastra (build agents & workflows)
-      ↓
-  AgentV (test & iterate locally)
-      ↓
-  AgentV (CI/CD: block regressions)
-      ↓
-  Langfuse/LangWatch (production monitoring)
-
-Python Agents:
-  Google ADK (build multi-agent systems)
-      ↓
-  AgentV (test & iterate locally)
-      ↓
-  AgentV (CI/CD: block regressions)
-      ↓
-  Langfuse/LangWatch (production monitoring)
-
-Coding Agents (Optional):
-  AgentV (dev evals) → OpenCode Bench (public ranking) → production
-```
-
-**Role of Each Tool:**
-- **Mastra/Google ADK**: Build your agents
-- **AgentV**: Evaluate agents locally with custom criteria, block regressions in CI/CD
-- **OpenCode Bench**: Optional—submit coding agents to standardized public benchmark
-- **Langfuse/LangWatch**: Monitor agents in production, alerting and observability
-
-AgentV is the glue in your evaluation pipeline; it sits naturally between development frameworks and production monitoring.
diff --git a/docs/plans/2026-02-26-eval-schema-generation-design.md b/docs/plans/2026-02-26-eval-schema-generation-design.md
deleted file mode 100644
index 9d6047886..000000000
--- a/docs/plans/2026-02-26-eval-schema-generation-design.md
+++ /dev/null
@@ -1,646 +0,0 @@
-# Eval Schema Generation Implementation Plan
-
-> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
-
-**Goal:** Auto-generate `eval-schema.json` from a Zod schema and add a diff test to catch drift.
-
-**Architecture:** Create a comprehensive Zod schema (`eval-file.schema.ts`) that mirrors the eval YAML file structure. A generator script converts it to JSON Schema via `zod-to-json-schema`. A test regenerates and diffs against the committed file — if they diverge, it fails.
-
-**Tech Stack:** Zod, zod-to-json-schema, Vitest
-
----
-
-### Task 1: Add `zod-to-json-schema` dependency
-
-**Files:**
-- Modify: `packages/core/package.json`
-
-**Step 1: Install the dependency**
-
-Run: `cd /home/christso/projects/agentv && bun add -d zod-to-json-schema --cwd packages/core`
-
-**Step 2: Verify installation**
-
-Run: `grep zod-to-json-schema packages/core/package.json`
-Expected: `"zod-to-json-schema": "^3.x.x"` in devDependencies
-
-**Step 3: Commit**
-
-```bash
-git add packages/core/package.json bun.lock
-git commit -m "chore: add zod-to-json-schema dev dependency"
-```
-
----
-
-### Task 2: Create the eval file Zod schema
-
-**Files:**
-- Create: `packages/core/src/evaluation/validation/eval-file.schema.ts`
-
-**Context:** This schema represents the **YAML input format** (what users write), not the parsed runtime types. Key differences from runtime types:
-- Uses snake_case field names (YAML convention)
-- Includes shorthands (string input → message array)
-- Includes deprecated aliases (eval_cases, script, expected_outcome)
-- Uses `additionalProperties` / `.passthrough()` where custom config is allowed
-- Does NOT include resolved/computed fields (resolvedCwd, resolvedPromptPath, etc.)
-
-The schema should import `EVALUATOR_KIND_VALUES` from `types.ts` to stay in sync with the evaluator kind enum.
-
-**Step 1: Write the schema file**
-
-Create `packages/core/src/evaluation/validation/eval-file.schema.ts` with:
-
-```typescript
-/**
- * Zod schema for eval YAML file format.
- * Used to generate eval-schema.json for AI agent reference.
- *
- * IMPORTANT: This schema describes the YAML input format, not the parsed runtime types.
- * When adding new eval features, update this schema AND run `bun run generate:schema`
- * to regenerate eval-schema.json. The sync test will fail if they diverge.
- */
-import { z } from 'zod';
-
-// ---------------------------------------------------------------------------
-// Shared primitives
-// ---------------------------------------------------------------------------
-
-/** Message content: string or structured array */
-const ContentItemSchema = z.object({
-  type: z.enum(['text', 'file']),
-  value: z.string(),
-});
-
-const MessageContentSchema = z.union([
-  z.string(),
-  z.array(ContentItemSchema),
-]);
-
-const MessageSchema = z.object({
-  role: z.enum(['system', 'user', 'assistant', 'tool']),
-  content: MessageContentSchema,
-});
-
-/** Input: string shorthand or message array */
-const InputSchema = z.union([z.string(), z.array(MessageSchema)]);
-
-/** Expected output: string, object, or message array */
-const ExpectedOutputSchema = z.union([
-  z.string(),
-  z.record(z.unknown()),
-  z.array(MessageSchema),
-]);
-
-// ---------------------------------------------------------------------------
-// Evaluator schemas (YAML input format)
-// ---------------------------------------------------------------------------
-
-/** Common fields shared by all evaluators */
-const EvaluatorCommonSchema = z.object({
-  name: z.string().optional(),
-  weight: z.number().min(0).optional(),
-  required: z.union([z.boolean(), z.number().gt(0).lte(1)]).optional(),
-  negate: z.boolean().optional(),
-});
-
-/** Prompt: string (inline/file path) or executable script config */
-const PromptSchema = z.union([
-  z.string(),
-  z.object({
-    command: z.union([z.string(), z.array(z.string())]).optional(),
-    script: z.union([z.string(), z.array(z.string())]).optional(),
-    config: z.record(z.unknown()).optional(),
-  }),
-]);
-
-/** Score range for analytic rubrics */
-const ScoreRangeSchema = z.object({
-  score_range: z.tuple([z.number().int().min(0).max(10), z.number().int().min(0).max(10)]),
-  outcome: z.string().min(1),
-});
-
-/** Rubric item (checklist or score-range mode) */
-const RubricItemSchema = z.object({
-  id: z.string().optional(),
-  outcome: z.string().optional(),
-  weight: z.number().optional(),
-  required: z.boolean().optional(),
-  required_min_score: z.number().int().min(0).max(10).optional(),
-  score_ranges: z.array(ScoreRangeSchema).optional(),
-});
-
-// --- Type-specific evaluator schemas ---
-
-const CodeJudgeSchema = EvaluatorCommonSchema.extend({
-  type: z.literal('code_judge'),
-  command: z.union([z.string(), z.array(z.string())]),
-  script: z.union([z.string(), z.array(z.string())]).optional(),
-  cwd: z.string().optional(),
-  target: z.union([z.boolean(), z.object({ max_calls: z.number().optional() })]).optional(),
-  config: z.record(z.unknown()).optional(),
-});
-
-const LlmJudgeSchema = EvaluatorCommonSchema.extend({
-  type: z.literal('llm_judge'),
-  prompt: PromptSchema.optional(),
-  rubrics: z.array(RubricItemSchema).optional(),
-  model: z.string().optional(),
-  config: z.record(z.unknown()).optional(),
-});
-
-/** Aggregator configs for composite evaluator */
-const AggregatorSchema = z.discriminatedUnion('type', [
-  z.object({
-    type: z.literal('weighted_average'),
-    weights: z.record(z.number()).optional(),
-  }),
-  z.object({
-    type: z.literal('threshold'),
-    threshold: z.number().min(0).max(1),
-  }),
-  z.object({
-    type: z.literal('code_judge'),
-    path: z.string(),
-    cwd: z.string().optional(),
-  }),
-  z.object({
-    type: z.literal('llm_judge'),
-    prompt: z.string().optional(),
-    model: z.string().optional(),
-  }),
-]);
-
-// Use z.lazy for recursive composite evaluator
-const CompositeSchema: z.ZodType = z.lazy(() =>
-  EvaluatorCommonSchema.extend({
-    type: z.literal('composite'),
-    assertions: z.array(EvaluatorSchema).optional(),
-    evaluators: z.array(EvaluatorSchema).optional(),
-    aggregator: AggregatorSchema,
-  }),
-);
-
-const ArgsMatchSchema = z.union([
-  z.enum(['exact', 'ignore', 'subset', 'superset']),
-  z.array(z.string()),
-]);
-
-const ToolTrajectoryExpectedItemSchema = z.object({
-  tool: z.string(),
-  args: z.union([z.literal('any'), z.record(z.unknown())]).optional(),
-  max_duration_ms: z.number().min(0).optional(),
-  maxDurationMs: z.number().min(0).optional(),
-  args_match: ArgsMatchSchema.optional(),
-  argsMatch: ArgsMatchSchema.optional(),
-});
-
-const ToolTrajectorySchema = EvaluatorCommonSchema.extend({
-  type: z.literal('tool_trajectory'),
-  mode: z.enum(['any_order', 'in_order', 'exact', 'subset', 'superset']),
-  minimums: z.record(z.number().int().min(0)).optional(),
-  expected: z.array(ToolTrajectoryExpectedItemSchema).optional(),
-  args_match: ArgsMatchSchema.optional(),
-  argsMatch: ArgsMatchSchema.optional(),
-});
-
-const FieldConfigSchema = z.object({
-  path: z.string(),
-  match: z.enum(['exact', 'numeric_tolerance', 'date']),
-  required: z.boolean().optional(),
-  weight: z.number().optional(),
-  tolerance: z.number().min(0).optional(),
-  relative: z.boolean().optional(),
-  formats: z.array(z.string()).optional(),
-});
-
-const FieldAccuracySchema = EvaluatorCommonSchema.extend({
-  type: z.literal('field_accuracy'),
-  fields: z.array(FieldConfigSchema).min(1),
-  aggregation: z.enum(['weighted_average', 'all_or_nothing']).optional(),
-});
-
-const LatencySchema = EvaluatorCommonSchema.extend({
-  type: z.literal('latency'),
-  threshold: z.number().min(0),
-});
-
-const CostSchema = EvaluatorCommonSchema.extend({
-  type: z.literal('cost'),
-  budget: z.number().min(0),
-});
-
-const TokenUsageSchema = EvaluatorCommonSchema.extend({
-  type: z.literal('token_usage'),
-  max_total: z.number().min(0).optional(),
-  max_input: z.number().min(0).optional(),
-  max_output: z.number().min(0).optional(),
-});
-
-const ExecutionMetricsSchema = EvaluatorCommonSchema.extend({
-  type: z.literal('execution_metrics'),
-  max_tool_calls: z.number().min(0).optional(),
-  max_llm_calls: z.number().min(0).optional(),
-  max_tokens: z.number().min(0).optional(),
-  max_cost_usd: z.number().min(0).optional(),
-  max_duration_ms: z.number().min(0).optional(),
-  target_exploration_ratio: z.number().min(0).max(1).optional(),
-  exploration_tolerance: z.number().min(0).optional(),
-});
-
-// Note: agent_judge was removed — llm-judge now covers all judge use cases
-// including agentic behavior (auto-detected based on judge provider kind).
-// See LlmJudgeSchema above for the unified schema.
-
-const ContainsSchema = EvaluatorCommonSchema.extend({
-  type: z.literal('contains'),
-  value: z.string(),
-});
-
-const RegexSchema = EvaluatorCommonSchema.extend({
-  type: z.literal('regex'),
-  value: z.string(),
-});
-
-const IsJsonSchema = EvaluatorCommonSchema.extend({
-  type: z.literal('is_json'),
-});
-
-const EqualsSchema = EvaluatorCommonSchema.extend({
-  type: z.literal('equals'),
-  value: z.string(),
-});
-
-const RubricsSchema = EvaluatorCommonSchema.extend({
-  type: z.literal('rubrics'),
-  criteria: z.array(RubricItemSchema).min(1),
-});
-
-/** Union of all evaluator types */
-const EvaluatorSchema = z.union([
-  CodeJudgeSchema,
-  LlmJudgeSchema,
-  CompositeSchema,
-  ToolTrajectorySchema,
-  FieldAccuracySchema,
-  LatencySchema,
-  CostSchema,
-  TokenUsageSchema,
-  ExecutionMetricsSchema,
-  ContainsSchema,
-  RegexSchema,
-  IsJsonSchema,
-  EqualsSchema,
-  RubricsSchema,
-]);
-
-// ---------------------------------------------------------------------------
-// Workspace
-// ---------------------------------------------------------------------------
-
-const WorkspaceScriptSchema = z.object({
-  command: z.union([z.string(), z.array(z.string())]).optional(),
-  script: z.union([z.string(), z.array(z.string())]).optional(),
-  timeout_ms: z.number().min(0).optional(),
-  cwd: z.string().optional(),
-});
-
-const WorkspaceSchema = z.object({
-  template: z.string().optional(),
-  before_all: WorkspaceScriptSchema.optional(),
-  after_all: WorkspaceScriptSchema.optional(),
-  before_each: WorkspaceScriptSchema.optional(),
-  after_each: WorkspaceScriptSchema.optional(),
-});
-
-// ---------------------------------------------------------------------------
-// Execution block
-// ---------------------------------------------------------------------------
-
-const TrialsSchema = z.object({
-  count: z.number().int().min(1),
-  strategy: z.enum(['pass_at_k', 'mean', 'confidence_interval']).optional(),
-  cost_limit_usd: z.number().min(0).optional(),
-  costLimitUsd: z.number().min(0).optional(),
-});
-
-const ExecutionSchema = z.object({
-  target: z.string().optional(),
-  targets: z.array(z.string()).optional(),
-  assertions: z.array(EvaluatorSchema).optional(),
-  evaluators: z.array(EvaluatorSchema).optional(),
-  skip_defaults: z.boolean().optional(),
-  cache: z.boolean().optional(),
-  trials: TrialsSchema.optional(),
-  total_budget_usd: z.number().min(0).optional(),
-  totalBudgetUsd: z.number().min(0).optional(),
-});
-
-// ---------------------------------------------------------------------------
-// Test case
-// ---------------------------------------------------------------------------
-
-const EvalTestSchema = z.object({
-  id: z.string().min(1),
-  criteria: z.string().optional(),
-  expected_outcome: z.string().optional(),
-  input: InputSchema.optional(),
-  expected_output: ExpectedOutputSchema.optional(),
-  assertions: z.array(EvaluatorSchema).optional(),
-  evaluators: z.array(EvaluatorSchema).optional(),
-  execution: ExecutionSchema.optional(),
-  workspace: WorkspaceSchema.optional(),
-  metadata: z.record(z.unknown()).optional(),
-  conversation_id: z.string().optional(),
-  dataset: z.string().optional(),
-  note: z.string().optional(),
-});
-
-// ---------------------------------------------------------------------------
-// Top-level eval file
-// ---------------------------------------------------------------------------
-
-export const EvalFileSchema = z.object({
-  $schema: z.string().optional(),
-  // Metadata
-  name: z.string().regex(/^[a-z0-9-]+$/).optional(),
-  description: z.string().optional(),
-  version: z.string().optional(),
-  author: z.string().optional(),
-  tags: z.array(z.string()).optional(),
-  license: z.string().optional(),
-  requires: z.object({ agentv: z.string().optional() }).optional(),
-  // Suite-level input
-  input: InputSchema.optional(),
-  // Tests (array or external file path)
-  tests: z.union([z.array(EvalTestSchema), z.string()]),
-  // Deprecated aliases
-  eval_cases: z.union([z.array(EvalTestSchema), z.string()]).optional(),
-  // Target
-  target: z.string().optional(),
-  // Execution
-  execution: ExecutionSchema.optional(),
-  // Suite-level assertions
-  assertions: z.array(EvaluatorSchema).optional(),
-  // Workspace
-  workspace: WorkspaceSchema.optional(),
-});
-```
-
-**Step 2: Verify the file compiles**
-
-Run: `cd /home/christso/projects/agentv && bunx tsc --noEmit packages/core/src/evaluation/validation/eval-file.schema.ts --esModuleInterop --moduleResolution bundler --module esnext --target es2022 --strict`
-
-If tsc is fussy with standalone file checking, just run the full typecheck:
-Run: `bun run typecheck --filter @agentv/core`
-
-**Step 3: Commit**
-
-```bash
-git add packages/core/src/evaluation/validation/eval-file.schema.ts
-git commit -m "feat: add Zod schema for eval YAML file format"
-```
-
----
-
-### Task 3: Create the generator script
-
-**Files:**
-- Create: `packages/core/scripts/generate-eval-schema.ts`
-- Modify: `packages/core/package.json` (add script)
-
-**Step 1: Write the generator script**
-
-Create `packages/core/scripts/generate-eval-schema.ts`:
-
-```typescript
-#!/usr/bin/env bun
-/**
- * Generates eval-schema.json from the Zod schema.
- * Run: bun run generate:schema (from packages/core)
- * Or:  bun packages/core/scripts/generate-eval-schema.ts (from repo root)
- */
-import { zodToJsonSchema } from 'zod-to-json-schema';
-import { writeFile } from 'node:fs/promises';
-import path from 'node:path';
-import { EvalFileSchema } from '../src/evaluation/validation/eval-file.schema.js';
-
-const jsonSchema = zodToJsonSchema(EvalFileSchema, {
-  name: 'EvalFile',
-  $refStrategy: 'none',
-});
-
-// Add JSON Schema metadata
-const schema = {
-  $schema: 'http://json-schema.org/draft-07/schema#',
-  title: 'AgentV Eval File',
-  description: 'Schema for AgentV evaluation YAML files (.eval.yaml)',
-  ...jsonSchema,
-};
-
-const outputPath = path.resolve(
-  import.meta.dirname,
-  '../../../plugins/agentv-dev/skills/agentv-eval-builder/references/eval-schema.json',
-);
-
-await writeFile(outputPath, `${JSON.stringify(schema, null, 2)}\n`);
-console.log(`Generated: ${outputPath}`);
-```
-
-**Step 2: Add the script to package.json**
-
-Add to `packages/core/package.json` scripts:
-```json
-"generate:schema": "bun scripts/generate-eval-schema.ts"
-```
-
-**Step 3: Run the generator and verify output**
-
-Run: `cd /home/christso/projects/agentv/packages/core && bun run generate:schema`
-Expected: `Generated: .../plugins/agentv-dev/skills/agentv-eval-builder/references/eval-schema.json`
-
-Inspect the output:
-Run: `head -30 /home/christso/projects/agentv/plugins/agentv-dev/skills/agentv-eval-builder/references/eval-schema.json`
-Expected: Valid JSON with `$schema`, `title`, `properties` including `tests`, `execution`, `assert`, etc.
-
-**Step 4: Run biome format on the generated file**
-
-Run: `cd /home/christso/projects/agentv && bunx biome format --write plugins/agentv-dev/skills/agentv-eval-builder/references/eval-schema.json`
-
-**Step 5: Commit**
-
-```bash
-git add packages/core/scripts/generate-eval-schema.ts packages/core/package.json
-git add plugins/agentv-dev/skills/agentv-eval-builder/references/eval-schema.json
-git commit -m "feat: add eval-schema.json generator from Zod schema"
-```
-
----
-
-### Task 4: Add the sync diff test
-
-**Files:**
-- Create: `packages/core/test/evaluation/validation/eval-schema-sync.test.ts`
-
-**Step 1: Write the failing test (schema should already be in sync from Task 3)**
-
-Create `packages/core/test/evaluation/validation/eval-schema-sync.test.ts`:
-
-```typescript
-import { describe, expect, it } from 'bun:test';
-import { readFile } from 'node:fs/promises';
-import path from 'node:path';
-import { zodToJsonSchema } from 'zod-to-json-schema';
-import { EvalFileSchema } from '../../../src/evaluation/validation/eval-file.schema.js';
-
-describe('eval-schema.json sync', () => {
-  it('matches the generated schema from Zod', async () => {
-    const repoRoot = path.resolve(import.meta.dirname, '../../../..');
-    const schemaPath = path.join(
-      repoRoot,
-      'plugins/agentv-dev/skills/agentv-eval-builder/references/eval-schema.json',
-    );
-
-    // Read committed schema
-    const committed = JSON.parse(await readFile(schemaPath, 'utf8'));
-
-    // Generate fresh schema from Zod
-    const generated = zodToJsonSchema(EvalFileSchema, {
-      name: 'EvalFile',
-      $refStrategy: 'none',
-    });
-
-    const expected = {
-      $schema: 'http://json-schema.org/draft-07/schema#',
-      title: 'AgentV Eval File',
-      description: 'Schema for AgentV evaluation YAML files (.eval.yaml)',
-      ...generated,
-    };
-
-    // Compare (ignoring formatting differences)
-    expect(JSON.parse(JSON.stringify(committed))).toEqual(
-      JSON.parse(JSON.stringify(expected)),
-    );
-  });
-});
-```
-
-**Step 2: Run the test to verify it passes**
-
-Run: `cd /home/christso/projects/agentv && bun test packages/core/test/evaluation/validation/eval-schema-sync.test.ts`
-Expected: PASS (since we just generated the schema in Task 3)
-
-**Step 3: Commit**
-
-```bash
-git add packages/core/test/evaluation/validation/eval-schema-sync.test.ts
-git commit -m "test: add eval-schema.json sync test"
-```
-
----
-
-### Task 5: Also copy generated schema to CLI dist templates
-
-**Context:** The schema is also bundled in `apps/cli/dist/templates/`. Check if this is done by the build or needs manual sync.
-
-**Step 1: Check how CLI templates reference the schema**
-
-Run: `diff plugins/agentv-dev/skills/agentv-eval-builder/references/eval-schema.json apps/cli/dist/templates/.claude/skills/agentv-eval-builder/references/eval-schema.json`
-
-If they differ, the CLI build should copy from the source. Check the CLI build process:
-Run: `grep -r "eval-schema" apps/cli/tsup.config.ts apps/cli/package.json 2>/dev/null`
-
-If no copy step exists, the template copies are stale artifacts. Either:
-- Add a copy step to the CLI build, or
-- Note this as out of scope (the CLI templates are created by `agentv create` and may have their own update cycle)
-
-**Step 2: Determine action and commit if needed**
-
-This step is investigative — commit only if a change is needed.
-
----
-
-### Task 6: Run full test suite and push
-
-**Step 1: Run all tests**
-
-Run: `cd /home/christso/projects/agentv && bun run test`
-Expected: All tests pass
-
-**Step 2: Run typecheck**
-
-Run: `cd /home/christso/projects/agentv && bun run typecheck`
-Expected: No errors
-
-**Step 3: Run lint**
-
-Run: `cd /home/christso/projects/agentv && bun run lint`
-Expected: No errors (fix any formatting issues from generated file)
-
-**Step 4: Push the branch**
-
-Run: `git push -u origin chore/update-eval-schema`
-
----
-
-### Task 7: Create PR and file follow-up issue
-
-**Step 1: Create PR**
-
-```bash
-gh pr create --title "chore: auto-generate eval-schema.json from Zod" --body "$(cat <<'EOF'
-## Summary
-- Adds a comprehensive Zod schema (`eval-file.schema.ts`) that describes the eval YAML file format
-- Generates `eval-schema.json` from this Zod schema via `zod-to-json-schema`
-- Adds a sync test that regenerates and diffs — fails if schema drifts from Zod
-
-## Motivation
-The JSON schema was manually maintained and had drifted significantly from the actual validation logic. This ensures the schema stays current as the codebase evolves.
-
-## How to update the schema
-When adding new eval features, update `eval-file.schema.ts` and run:
-```bash
-cd packages/core && bun run generate:schema
-```
-
-## Test plan
-- [ ] `bun test packages/core/test/evaluation/validation/eval-schema-sync.test.ts` passes
-- [ ] Full test suite passes
-- [ ] Schema validates against existing example eval files
-
-🤖 Generated with [Claude Code](https://claude.com/claude-code)
-EOF
-)"
-```
-
-**Step 2: File follow-up issue for Approach B**
-
-```bash
-gh issue create --title "refactor: migrate eval-validator.ts from procedural to Zod-based validation" --body "$(cat <<'EOF'
-## Context
-The eval file validation in `eval-validator.ts` uses procedural if/else logic (~500+ lines). A parallel Zod schema (`eval-file.schema.ts`) was added in #<PR_NUMBER> for JSON Schema generation, creating two sources of truth.
-
-## Proposal
-Refactor `eval-validator.ts` to use the Zod schema as the single source of truth for both:
-1. Runtime validation (Zod `.safeParse()`)
-2. JSON Schema generation (`zod-to-json-schema`)
-
-## Benefits
-- Single source of truth for eval file structure
-- Better error messages from Zod
-- Removes ~500 lines of manual validation code
-- Type-safe parsing (no type casts)
-
-## Considerations
-- The current procedural validator supports warnings (not just errors) — Zod only does pass/fail
-- Custom evaluator types use `.passthrough()` which needs careful handling
-- Backward-compatible aliases (eval_cases, script, expected_outcome) need Zod transforms
-- Extensive test coverage exists in `eval-validator.test.ts` — migration should preserve all test cases
-
-## Scope
-- `packages/core/src/evaluation/validation/eval-validator.ts` → refactor to use Zod
-- `packages/core/test/evaluation/validation/eval-validator.test.ts` → update test setup
-- Remove the separate `eval-file.schema.ts` once validator uses Zod natively
-EOF
-)"
-```
diff --git a/docs/plans/2026-03-17-github-copilot-plugin-compat-design.md b/docs/plans/2026-03-17-github-copilot-plugin-compat-design.md
deleted file mode 100644
index 5fac093b2..000000000
--- a/docs/plans/2026-03-17-github-copilot-plugin-compat-design.md
+++ /dev/null
@@ -1,79 +0,0 @@
-# GitHub Copilot Plugin Compatibility Design
-
-**Date:** 2026-03-17
-**Status:** Approved
-
-## Objective
-
-Make the agentv plugin discoverable by both Claude Code (existing) and VS Code GitHub Copilot by adding `.github/plugin/` structure alongside the existing `.claude-plugin/` layout.
-
-## Background
-
-GitHub Copilot (VS Code) discovers plugins via `.github/plugin/marketplace.json` and per-plugin `.github/plugin/plugin.json` manifests. Claude Code uses `.claude-plugin/marketplace.json`. The formats are similar but live in different directories.
-
-Reference repos:
-- `github/awesome-copilot` — community plugin marketplace
-- `WiseTechGlobal/cargowise-copilot` — production plugin example
-
-## Design Decisions
-
-1. **Dual-compatible:** Keep `.claude-plugin/` for Claude Code AND add `.github/plugin/` for VS Code Copilot.
-2. **No file moves:** Skills and agents stay nested inside `plugins/agentv-dev/`. The `.github/plugin/plugin.json` references them with relative paths.
-3. **No agent rename:** Agent files keep `.md` extension (no `.agent.md` rename).
-4. **Hooks plugin excluded:** `agentv-claude-trace` stays Claude Code-only. VS Code Copilot hooks use a different format.
-
-## New Files
-
-### `.github/plugin/marketplace.json`
-
-Root-level marketplace entry for GitHub Copilot discovery:
-
-```json
-{
-  "plugins": [
-    {
-      "name": "agentv-dev",
-      "source": "agentv-dev",
-      "description": "Development skills for building and optimizing AgentV evaluations",
-      "version": "1.0.0"
-    }
-  ]
-}
-```
-
-### `plugins/agentv-dev/.github/plugin/plugin.json`
-
-Per-plugin manifest in GitHub Copilot format:
-
-```json
-{
-  "name": "agentv-dev",
-  "description": "Development skills for building and optimizing AgentV evaluations",
-  "version": "1.0.0",
-  "author": { "name": "AgentV" },
-  "repository": "https://github.com/EntityProcess/agentv",
-  "license": "MIT",
-  "keywords": ["eval", "testing", "agent", "benchmarks"],
-  "agents": ["./agents"],
-  "skills": [
-    "./skills/agentv-bench",
-    "./skills/agentv-eval-analyzer",
-    "./skills/agentv-eval-writer",
-    "./skills/agentv-onboarding",
-    "./skills/agentv-trace-analyst"
-  ]
-}
-```
-
-## Unchanged Files
-
-- `.claude-plugin/marketplace.json` — untouched
-- All `SKILL.md` files — already compatible format
-- All agent `.md` files — no rename
-- `plugins/agentv-claude-trace/` — Claude Code only
-
-## Result
-
-The repo is discoverable by both:
-- **Claude Code** via `.claude-plugin/marketplace.json`
-- **VS Code GitHub Copilot** via `.github/plugin/marketplace.json`
diff --git a/docs/plans/2026-03-17-github-copilot-plugin-compat-plan.md b/docs/plans/2026-03-17-github-copilot-plugin-compat-plan.md
deleted file mode 100644
index bf97f4492..000000000
--- a/docs/plans/2026-03-17-github-copilot-plugin-compat-plan.md
+++ /dev/null
@@ -1,140 +0,0 @@
-# GitHub Copilot Plugin Compatibility — Implementation Plan
-
-> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
-
-**Goal:** Add `.github/plugin/` structure so the agentv plugin is discoverable by VS Code GitHub Copilot alongside existing Claude Code support.
-
-**Architecture:** Two new JSON files create the GitHub Copilot discovery layer. The root `.github/plugin/marketplace.json` lists available plugins. Each plugin gets a `.github/plugin/plugin.json` manifest. All existing files stay untouched.
-
-**Tech Stack:** JSON manifests, no code changes.
-
----
-
-### Task 1: Create root `.github/plugin/marketplace.json`
-
-**Files:**
-- Create: `.github/plugin/marketplace.json`
-
-**Step 1: Create the directory and file**
-
-```bash
-mkdir -p .github/plugin
-```
-
-Then create `.github/plugin/marketplace.json`:
-
-```json
-{
-  "plugins": [
-    {
-      "name": "agentv-dev",
-      "source": "agentv-dev",
-      "description": "Development skills for building and optimizing AgentV evaluations",
-      "version": "1.0.0"
-    }
-  ]
-}
-```
-
-**Step 2: Validate JSON is well-formed**
-
-Run: `python3 -c "import json; json.load(open('.github/plugin/marketplace.json')); print('OK')"`
-Expected: `OK`
-
-**Step 3: Commit**
-
-```bash
-git add .github/plugin/marketplace.json
-git commit -m "feat: add .github/plugin/marketplace.json for VS Code Copilot discovery"
-```
-
----
-
-### Task 2: Create per-plugin `.github/plugin/plugin.json`
-
-**Files:**
-- Create: `plugins/agentv-dev/.github/plugin/plugin.json`
-
-**Step 1: Create the directory and file**
-
-```bash
-mkdir -p plugins/agentv-dev/.github/plugin
-```
-
-Then create `plugins/agentv-dev/.github/plugin/plugin.json`:
-
-```json
-{
-  "name": "agentv-dev",
-  "description": "Development skills for building and optimizing AgentV evaluations",
-  "version": "1.0.0",
-  "author": {
-    "name": "AgentV"
-  },
-  "repository": "https://github.com/EntityProcess/agentv",
-  "license": "MIT",
-  "keywords": [
-    "eval",
-    "testing",
-    "agent",
-    "benchmarks"
-  ],
-  "agents": [
-    "./agents"
-  ],
-  "skills": [
-    "./skills/agentv-bench",
-    "./skills/agentv-eval-analyzer",
-    "./skills/agentv-eval-writer",
-    "./skills/agentv-onboarding",
-    "./skills/agentv-trace-analyst"
-  ]
-}
-```
-
-**Step 2: Validate JSON is well-formed**
-
-Run: `python3 -c "import json; json.load(open('plugins/agentv-dev/.github/plugin/plugin.json')); print('OK')"`
-Expected: `OK`
-
-**Step 3: Verify agent and skill paths resolve**
-
-Run: `ls plugins/agentv-dev/agents/ && ls -d plugins/agentv-dev/skills/agentv-bench plugins/agentv-dev/skills/agentv-eval-analyzer plugins/agentv-dev/skills/agentv-eval-writer plugins/agentv-dev/skills/agentv-onboarding plugins/agentv-dev/skills/agentv-trace-analyst`
-Expected: All paths exist, no errors.
-
-**Step 4: Commit**
-
-```bash
-git add plugins/agentv-dev/.github/plugin/plugin.json
-git commit -m "feat: add agentv-dev plugin.json for VS Code Copilot compatibility"
-```
-
----
-
-### Task 3: Verify dual compatibility
-
-**Step 1: Confirm Claude Code marketplace is untouched**
-
-Run: `cat .claude-plugin/marketplace.json | python3 -c "import sys,json; d=json.load(sys.stdin); assert len(d['plugins'])==2; print('Claude Code: OK')"`
-Expected: `Claude Code: OK`
-
-**Step 2: Confirm GitHub Copilot marketplace is valid**
-
-Run: `cat .github/plugin/marketplace.json | python3 -c "import sys,json; d=json.load(sys.stdin); assert d['plugins'][0]['name']=='agentv-dev'; print('GitHub Copilot: OK')"`
-Expected: `GitHub Copilot: OK`
-
-**Step 3: Confirm per-plugin manifest references resolve**
-
-Run: `python3 -c "
-import json, os
-p = json.load(open('plugins/agentv-dev/.github/plugin/plugin.json'))
-base = 'plugins/agentv-dev'
-for a in p['agents']:
-    path = os.path.join(base, a)
-    assert os.path.isdir(path), f'Missing: {path}'
-for s in p['skills']:
-    path = os.path.join(base, s, 'SKILL.md')
-    assert os.path.isfile(path), f'Missing: {path}'
-print('All paths resolve: OK')
-"`
-Expected: `All paths resolve: OK`
diff --git a/docs/plans/2026-03-25-threshold-flag-design.md b/docs/plans/2026-03-25-threshold-flag-design.md
deleted file mode 100644
index 29c6b5e74..000000000
--- a/docs/plans/2026-03-25-threshold-flag-design.md
+++ /dev/null
@@ -1,76 +0,0 @@
-# Design: `--threshold` flag for suite-level quality gates
-
-**Issue:** #698
-**Date:** 2026-03-25
-
-## Objective
-
-Add a `--threshold` CLI flag to `agentv eval` that fails (exit 1) if the mean score across all tests falls below the specified threshold. This enables CI/CD quality gating without needing `agentv compare --baseline`.
-
-## CLI Flag
-
-- `--threshold <number>` on `agentv eval run` (0–1 scale)
-- Optional — if omitted, no threshold check (current behavior preserved)
-- Overrides `execution.threshold` from YAML if both set
-
-## YAML Config
-
-Add `threshold` to the `execution` block in eval YAML files:
-
-```yaml
-execution:
-  threshold: 0.8
-```
-
-Both `threshold` and `execution.threshold` accepted (snake_case wire format convention).
-
-## Score Evaluation
-
-After all tests complete:
-
-1. Compute mean score from quality results only (excluding `execution_error` tests — same as existing `calculateEvaluationSummary()`)
-2. If mean score < threshold → exit code 1
-3. Execution errors fail independently via existing `fail_on_error` mechanism (separate concern)
-4. If no quality results exist (all execution errors), threshold check is skipped
-
-## Output
-
-When threshold is active, append a summary line after the existing result summary:
-
-```
-Suite score: 0.53 (threshold: 0.60) — FAIL
-```
-
-or:
-
-```
-Suite score: 0.85 (threshold: 0.60) — PASS
-```
-
-## JUnit Integration
-
-The JUnit writer uses the threshold for per-test pass/fail:
-
-- If threshold is set: `score < threshold` → `<failure>` element
-- If threshold is not set: `score < 0.5` (current hardcoded behavior preserved)
-
-## Exit Code
-
-- Exit 0: mean score >= threshold (or no threshold set)
-- Exit 1: mean score < threshold
-- Execution errors handled separately by `fail_on_error`
-
-## Files to Modify
-
-1. `packages/core/src/evaluation/validation/eval-file.schema.ts` — add `threshold` to ExecutionSchema
-2. `apps/cli/src/commands/eval/commands/run.ts` — add `--threshold` CLI flag
-3. `apps/cli/src/commands/eval/run-eval.ts` — pass threshold through, check after results
-4. `apps/cli/src/commands/eval/statistics.ts` — add threshold summary formatting
-5. `apps/cli/src/commands/eval/junit-writer.ts` — use threshold for pass/fail
-6. Tests for new behavior
-
-## Non-Goals
-
-- Per-test threshold override (use `required` for that)
-- Replacement for `agentv compare` regression gating
-- Severity levels (#334)
diff --git a/docs/plans/2026-03-25-threshold-flag-plan.md b/docs/plans/2026-03-25-threshold-flag-plan.md
deleted file mode 100644
index 57ba2eb53..000000000
--- a/docs/plans/2026-03-25-threshold-flag-plan.md
+++ /dev/null
@@ -1,562 +0,0 @@
-# `--threshold` Flag Implementation Plan
-
-> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
-
-**Goal:** Add a `--threshold` CLI flag and `execution.threshold` YAML field to `agentv eval` that exits 1 when mean quality score falls below the threshold.
-
-**Architecture:** The threshold value flows from CLI flag or YAML config through the existing options pipeline. After all tests complete, the summary is checked against the threshold. JUnit writer also uses the threshold for per-test pass/fail.
-
-**Tech Stack:** TypeScript, cmd-ts (CLI parsing), Zod (schema validation), Vitest (testing)
-
----
-
-### Task 1: Add `extractThreshold` to core config-loader
-
-**Files:**
-- Modify: `packages/core/src/evaluation/loaders/config-loader.ts:287` (after `extractTotalBudgetUsd`)
-- Test: `packages/core/test/evaluation/loaders/config-loader.test.ts`
-
-**Step 1: Write the failing tests**
-
-Add to `packages/core/test/evaluation/loaders/config-loader.test.ts` after the `extractFailOnError` describe block:
-
-```typescript
-describe('extractThreshold', () => {
-  it('returns undefined when no execution block', () => {
-    const suite: JsonObject = { tests: [] };
-    expect(extractThreshold(suite)).toBeUndefined();
-  });
-
-  it('returns undefined when threshold not set', () => {
-    const suite: JsonObject = { execution: { target: 'default' } };
-    expect(extractThreshold(suite)).toBeUndefined();
-  });
-
-  it('parses valid threshold', () => {
-    const suite: JsonObject = { execution: { threshold: 0.8 } };
-    expect(extractThreshold(suite)).toBe(0.8);
-  });
-
-  it('accepts 0 as threshold', () => {
-    const suite: JsonObject = { execution: { threshold: 0 } };
-    expect(extractThreshold(suite)).toBe(0);
-  });
-
-  it('accepts 1 as threshold', () => {
-    const suite: JsonObject = { execution: { threshold: 1 } };
-    expect(extractThreshold(suite)).toBe(1);
-  });
-
-  it('returns undefined for negative threshold', () => {
-    const suite: JsonObject = { execution: { threshold: -0.1 } };
-    expect(extractThreshold(suite)).toBeUndefined();
-  });
-
-  it('returns undefined for threshold > 1', () => {
-    const suite: JsonObject = { execution: { threshold: 1.5 } };
-    expect(extractThreshold(suite)).toBeUndefined();
-  });
-
-  it('returns undefined for non-number threshold', () => {
-    const suite: JsonObject = { execution: { threshold: 'high' } };
-    expect(extractThreshold(suite)).toBeUndefined();
-  });
-});
-```
-
-Also add `extractThreshold` to the import at the top of the test file.
-
-**Step 2: Run tests to verify they fail**
-
-Run: `bun test packages/core/test/evaluation/loaders/config-loader.test.ts`
-Expected: FAIL — `extractThreshold` not found
-
-**Step 3: Implement `extractThreshold`**
-
-Add to `packages/core/src/evaluation/loaders/config-loader.ts` after `extractTotalBudgetUsd` (after line ~308):
-
-```typescript
-/**
- * Extract `execution.threshold` from parsed eval suite.
- * Accepts a number in [0, 1] range.
- * Returns undefined when not specified.
- */
-export function extractThreshold(suite: JsonObject): number | undefined {
-  const execution = suite.execution;
-  if (!execution || typeof execution !== 'object' || Array.isArray(execution)) {
-    return undefined;
-  }
-
-  const executionObj = execution as Record<string, unknown>;
-  const raw = executionObj.threshold;
-
-  if (raw === undefined || raw === null) {
-    return undefined;
-  }
-
-  if (typeof raw === 'number' && raw >= 0 && raw <= 1) {
-    return raw;
-  }
-
-  logWarning(
-    `Invalid execution.threshold: ${raw}. Must be a number between 0 and 1. Ignoring.`,
-  );
-  return undefined;
-}
-```
-
-**Step 4: Run tests to verify they pass**
-
-Run: `bun test packages/core/test/evaluation/loaders/config-loader.test.ts`
-Expected: PASS
-
-**Step 5: Commit**
-
-```bash
-git add packages/core/src/evaluation/loaders/config-loader.ts packages/core/test/evaluation/loaders/config-loader.test.ts
-git commit -m "feat(core): add extractThreshold for execution.threshold YAML field (#698)"
-```
-
----
-
-### Task 2: Wire `extractThreshold` through YAML parser and schema
-
-**Files:**
-- Modify: `packages/core/src/evaluation/yaml-parser.ts:12` (imports), `:58` (re-exports), `:204` (loadTestSuite)
-- Modify: `packages/core/src/evaluation/yaml-parser.ts:168` (EvalSuiteResult type)
-- Modify: `packages/core/src/evaluation/validation/eval-file.schema.ts:317` (ExecutionSchema)
-
-**Step 1: Add `threshold` to ExecutionSchema in eval-file.schema.ts**
-
-In `packages/core/src/evaluation/validation/eval-file.schema.ts`, add to the `ExecutionSchema` object (after `failOnError` at line 330):
-
-```typescript
-  threshold: z.number().min(0).max(1).optional(),
-```
-
-**Step 2: Add to EvalSuiteResult type in yaml-parser.ts**
-
-In `packages/core/src/evaluation/yaml-parser.ts`, add to the `EvalSuiteResult` type (after `failOnError` at line 182):
-
-```typescript
-  /** Suite-level quality threshold (0-1) — suite fails if mean score is below */
-  readonly threshold?: number;
-```
-
-**Step 3: Import and re-export `extractThreshold` in yaml-parser.ts**
-
-Add `extractThreshold` to the import from `./loaders/config-loader.js` (line 12 area) and the re-export block (line 58 area).
-
-**Step 4: Use in `loadTestSuite`**
-
-In the `loadTestSuite` function (around line 203), extract and return threshold:
-
-```typescript
-  const threshold = extractThreshold(parsed);
-  return {
-    tests,
-    trials: extractTrialsConfig(parsed),
-    targets: extractTargetsFromSuite(parsed),
-    workers: extractWorkersFromSuite(parsed),
-    cacheConfig: extractCacheConfig(parsed),
-    totalBudgetUsd: extractTotalBudgetUsd(parsed),
-    ...(metadata !== undefined && { metadata }),
-    ...(failOnError !== undefined && { failOnError }),
-    ...(threshold !== undefined && { threshold }),
-  };
-```
-
-**Step 5: Regenerate the JSON schema**
-
-Run: `bun run generate:schema`
-
-**Step 6: Run core tests**
-
-Run: `bun test packages/core/test/evaluation/loaders/config-loader.test.ts`
-Expected: PASS
-
-**Step 7: Commit**
-
-```bash
-git add packages/core/src/evaluation/validation/eval-file.schema.ts packages/core/src/evaluation/yaml-parser.ts
-git commit -m "feat(core): wire extractThreshold through YAML parser and schema (#698)"
-```
-
----
-
-### Task 3: Add `--threshold` CLI flag and pass through to run-eval
-
-**Files:**
-- Modify: `apps/cli/src/commands/eval/commands/run.ts` (add CLI flag)
-- Modify: `apps/cli/src/commands/eval/run-eval.ts` (NormalizedOptions, normalizeOptions, handler return)
-
-**Step 1: Add CLI flag to run.ts**
-
-In `apps/cli/src/commands/eval/commands/run.ts`, add after the `model` option (around line 171):
-
-```typescript
-    threshold: option({
-      type: optional(number),
-      long: 'threshold',
-      description: 'Suite-level quality gate: exit 1 if mean score falls below this value (0-1)',
-    }),
-```
-
-And add `threshold: args.threshold` to the `rawOptions` object in the handler (around line 219).
-
-**Step 2: Add to NormalizedOptions in run-eval.ts**
-
-In `apps/cli/src/commands/eval/run-eval.ts`, add to the `NormalizedOptions` interface:
-
-```typescript
-  readonly threshold?: number;
-```
-
-**Step 3: Add to normalizeOptions**
-
-In the `normalizeOptions` function, add threshold resolution (CLI > YAML):
-
-```typescript
-  // Resolve threshold: CLI --threshold > YAML execution.threshold
-  const cliThreshold = normalizeOptionalNumber(rawOptions.threshold);
-```
-
-And in the return statement:
-
-```typescript
-    threshold: cliThreshold,
-```
-
-**Step 4: Wire YAML threshold into normalized options**
-
-In `runEvalCommand`, after `prepareEvalFile` returns, merge the YAML threshold if CLI didn't set one. In the loop over eval files (around the `prepareEvalFile` call), capture `suite.threshold` and pass it through.
-
-The cleanest approach: read the YAML threshold in `prepareEvalFile` and return it alongside the other fields. Then in the main `runEvalCommand`, resolve CLI vs YAML threshold.
-
-Add `threshold` to the `prepareEvalFile` return type (alongside `failOnError`):
-
-```typescript
-  readonly threshold?: number;
-```
-
-And in `prepareEvalFile`, add after `failOnError: suite.failOnError`:
-
-```typescript
-    threshold: suite.threshold,
-```
-
-**Step 5: Commit**
-
-```bash
-git add apps/cli/src/commands/eval/commands/run.ts apps/cli/src/commands/eval/run-eval.ts
-git commit -m "feat(cli): add --threshold flag and wire through options pipeline (#698)"
-```
-
----
-
-### Task 4: Add threshold check and summary output after eval completes
-
-**Files:**
-- Modify: `apps/cli/src/commands/eval/run-eval.ts` (after summary calculation ~line 1152)
-- Modify: `apps/cli/src/commands/eval/statistics.ts` (add `formatThresholdSummary`)
-- Test: `apps/cli/test/commands/eval/threshold.test.ts` (new)
-
-**Step 1: Write failing tests**
-
-Create `apps/cli/test/commands/eval/threshold.test.ts`:
-
-```typescript
-import { describe, expect, it } from 'bun:test';
-
-import type { EvaluationResult } from '@agentv/core';
-
-import { formatThresholdSummary } from '../../../src/commands/eval/statistics.js';
-
-function makeResult(overrides: Partial<EvaluationResult> = {}): EvaluationResult {
-  return {
-    timestamp: '2024-01-01T00:00:00Z',
-    testId: 'test-1',
-    score: 1.0,
-    assertions: [{ text: 'criterion-1', passed: true }],
-    output: [{ role: 'assistant' as const, content: 'answer' }],
-    target: 'default',
-    ...overrides,
-  };
-}
-
-describe('formatThresholdSummary', () => {
-  it('returns PASS when mean score meets threshold', () => {
-    const result = formatThresholdSummary(0.85, 0.6);
-    expect(result.passed).toBe(true);
-    expect(result.message).toContain('0.85');
-    expect(result.message).toContain('0.60');
-    expect(result.message).toContain('PASS');
-  });
-
-  it('returns FAIL when mean score is below threshold', () => {
-    const result = formatThresholdSummary(0.53, 0.6);
-    expect(result.passed).toBe(false);
-    expect(result.message).toContain('0.53');
-    expect(result.message).toContain('0.60');
-    expect(result.message).toContain('FAIL');
-  });
-
-  it('returns PASS when mean score exactly equals threshold', () => {
-    const result = formatThresholdSummary(0.6, 0.6);
-    expect(result.passed).toBe(true);
-  });
-
-  it('returns PASS for threshold 0 with any score', () => {
-    const result = formatThresholdSummary(0, 0);
-    expect(result.passed).toBe(true);
-  });
-});
-```
-
-**Step 2: Run tests to verify they fail**
-
-Run: `bun test apps/cli/test/commands/eval/threshold.test.ts`
-Expected: FAIL — `formatThresholdSummary` not found
-
-**Step 3: Implement `formatThresholdSummary` in statistics.ts**
-
-Add to `apps/cli/src/commands/eval/statistics.ts`:
-
-```typescript
-/**
- * Format a threshold check summary line.
- * Returns whether the threshold was met and the formatted message.
- */
-export function formatThresholdSummary(
-  meanScore: number,
-  threshold: number,
-): { passed: boolean; message: string } {
-  const passed = meanScore >= threshold;
-  const verdict = passed ? 'PASS' : 'FAIL';
-  const message = `Suite score: ${meanScore.toFixed(2)} (threshold: ${threshold.toFixed(2)}) — ${verdict}`;
-  return { passed, message };
-}
-```
-
-**Step 4: Run tests to verify they pass**
-
-Run: `bun test apps/cli/test/commands/eval/threshold.test.ts`
-Expected: PASS
-
-**Step 5: Wire the threshold check into run-eval.ts**
-
-In `apps/cli/src/commands/eval/run-eval.ts`, after the summary is printed (around line 1153), add:
-
-```typescript
-    // Threshold quality gate check
-    const resolvedThreshold = options.threshold ?? yamlThreshold;
-    if (resolvedThreshold !== undefined) {
-      const { formatThresholdSummary } = await import('./statistics.js');
-      const thresholdResult = formatThresholdSummary(summary.mean, resolvedThreshold);
-      console.log(`\n${thresholdResult.message}`);
-      if (!thresholdResult.passed) {
-        process.exitCode = 1;
-      }
-    }
-```
-
-Note: `yamlThreshold` needs to be captured from the `prepareEvalFile` results. If multiple eval files are run, use the first non-undefined threshold (or the CLI value).
-
-Import `formatThresholdSummary` statically at the top (preferred over dynamic import since it's in the same package):
-
-```typescript
-import {
-  calculateEvaluationSummary,
-  formatEvaluationSummary,
-  formatMatrixSummary,
-  formatThresholdSummary,
-} from './statistics.js';
-```
-
-**Step 6: Commit**
-
-```bash
-git add apps/cli/src/commands/eval/statistics.ts apps/cli/src/commands/eval/run-eval.ts apps/cli/test/commands/eval/threshold.test.ts
-git commit -m "feat(cli): add threshold check with summary output after eval (#698)"
-```
-
----
-
-### Task 5: JUnit writer uses threshold for per-test pass/fail
-
-**Files:**
-- Modify: `apps/cli/src/commands/eval/junit-writer.ts`
-- Modify: `apps/cli/test/commands/eval/output-writers.test.ts` (add tests)
-
-**Step 1: Write failing tests**
-
-Add to `apps/cli/test/commands/eval/output-writers.test.ts` in the JUnit describe block:
-
-```typescript
-  it('uses custom threshold for pass/fail when provided', async () => {
-    const filePath = path.join(testDir, `junit-threshold-${Date.now()}.xml`);
-    const writer = await JunitWriter.open(filePath, { threshold: 0.8 });
-
-    await writer.append(makeResult({ testId: 'high', score: 0.9 }));
-    await writer.append(makeResult({ testId: 'mid', score: 0.6 }));
-    await writer.close();
-
-    const xml = await readFile(filePath, 'utf8');
-    expect(xml).not.toContain('<failure message="score=0.900"');
-    expect(xml).toContain('<failure message="score=0.600"');
-  });
-
-  it('defaults to 0.5 threshold when none provided', async () => {
-    const filePath = path.join(testDir, `junit-default-${Date.now()}.xml`);
-    const writer = await JunitWriter.open(filePath);
-
-    await writer.append(makeResult({ testId: 'pass', score: 0.6 }));
-    await writer.append(makeResult({ testId: 'fail', score: 0.3 }));
-    await writer.close();
-
-    const xml = await readFile(filePath, 'utf8');
-    expect(xml).not.toContain('<failure message="score=0.600"');
-    expect(xml).toContain('<failure message="score=0.300"');
-  });
-```
-
-**Step 2: Run tests to verify they fail**
-
-Run: `bun test apps/cli/test/commands/eval/output-writers.test.ts`
-Expected: FAIL — `JunitWriter.open` doesn't accept options
-
-**Step 3: Implement threshold support in JunitWriter**
-
-Modify `apps/cli/src/commands/eval/junit-writer.ts`:
-
-```typescript
-export interface JunitWriterOptions {
-  readonly threshold?: number;
-}
-
-export class JunitWriter {
-  private readonly filePath: string;
-  private readonly results: EvaluationResult[] = [];
-  private readonly threshold: number;
-  private closed = false;
-
-  private constructor(filePath: string, options?: JunitWriterOptions) {
-    this.filePath = filePath;
-    this.threshold = options?.threshold ?? 0.5;
-  }
-
-  static async open(filePath: string, options?: JunitWriterOptions): Promise<JunitWriter> {
-    await mkdir(path.dirname(filePath), { recursive: true });
-    return new JunitWriter(filePath, options);
-  }
-```
-
-Then replace all `r.score < 0.5` with `r.score < this.threshold` in the `close()` method.
-
-**Step 4: Pass threshold to JunitWriter in output-writer.ts**
-
-In `apps/cli/src/commands/eval/output-writer.ts`, where JunitWriter is created, pass the threshold. Check how output writers are created and thread the threshold through.
-
-**Step 5: Run tests to verify they pass**
-
-Run: `bun test apps/cli/test/commands/eval/output-writers.test.ts`
-Expected: PASS
-
-**Step 6: Commit**
-
-```bash
-git add apps/cli/src/commands/eval/junit-writer.ts apps/cli/src/commands/eval/output-writer.ts apps/cli/test/commands/eval/output-writers.test.ts
-git commit -m "feat(cli): JUnit writer uses --threshold for per-test pass/fail (#698)"
-```
-
----
-
-### Task 6: Add `threshold` to Zod schema and regenerate JSON schema
-
-**Files:**
-- Modify: `packages/core/src/evaluation/validation/eval-file.schema.ts` (already done in Task 2)
-- Run: `bun run generate:schema`
-
-**Step 1: Verify threshold is in ExecutionSchema**
-
-Read `packages/core/src/evaluation/validation/eval-file.schema.ts` and confirm `threshold` was added in Task 2.
-
-**Step 2: Regenerate JSON schema**
-
-Run: `bun run generate:schema`
-
-**Step 3: Run validate:examples to check existing YAML files still pass**
-
-Run: `bun run validate:examples`
-Expected: PASS (threshold is optional, so existing files are unaffected)
-
-**Step 4: Commit if schema file changed**
-
-```bash
-git add packages/core/
-git commit -m "chore: regenerate eval-schema.json with threshold field (#698)"
-```
-
----
-
-### Task 7: Run full test suite and verify
-
-**Step 1: Run all tests**
-
-Run: `bun run test`
-Expected: PASS (except any pre-existing known failures)
-
-**Step 2: Run typecheck**
-
-Run: `bun run typecheck`
-Expected: PASS
-
-**Step 3: Run lint**
-
-Run: `bun run lint`
-Expected: PASS
-
-**Step 4: Run build**
-
-Run: `bun run build`
-Expected: PASS
-
----
-
-### Task 8: Manual red/green UAT
-
-**Step 1: Red — verify no threshold behavior on main**
-
-Run an eval without --threshold:
-
-```bash
-bun apps/cli/src/cli.ts eval examples/features/rubric/evals/dataset.eval.yaml --test-id summary-1
-```
-
-Confirm: no "Suite score" line in output, exit code is 0.
-
-**Step 2: Green — verify --threshold works**
-
-Run with a threshold that should PASS:
-
-```bash
-bun apps/cli/src/cli.ts eval examples/features/rubric/evals/dataset.eval.yaml --test-id summary-1 --threshold 0.3
-```
-
-Confirm: "Suite score: X.XX (threshold: 0.30) — PASS" printed, exit code 0.
-
-Run with a threshold that should FAIL:
-
-```bash
-bun apps/cli/src/cli.ts eval examples/features/rubric/evals/dataset.eval.yaml --test-id summary-1 --threshold 0.99
-```
-
-Confirm: "Suite score: X.XX (threshold: 0.99) — FAIL" printed, exit code 1.
-
-**Step 3: Verify JUnit output uses threshold**
-
-```bash
-bun apps/cli/src/cli.ts eval examples/features/rubric/evals/dataset.eval.yaml --test-id summary-1 --threshold 0.9 -o /tmp/test-threshold.xml
-```
-
-Inspect the XML: tests with score < 0.9 should have `<failure>` elements.