Skip to content

feat(evals): set default targets so all evals work out of the box#898

Draft
christso wants to merge 14 commits intomainfrom
feat/default-targets
Draft

feat(evals): set default targets so all evals work out of the box#898
christso wants to merge 14 commits intomainfrom
feat/default-targets

Conversation

@christso
Copy link
Copy Markdown
Collaborator

@christso christso commented Apr 1, 2026

Summary

  • Every eval file under examples/ and evals/ now declares its own target, so agentv eval run works without a global --target flag
  • Added copilot, vscode, and copilot-log targets to root .agentv/targets.yaml so matrix evals and specialized evals resolve correctly
  • Updated the CI workflow (evals.yml) to discover all eval files by default and make --target optional — each eval uses its own target

Details

  • 17 eval files gained target: default (LLM-only evals)
  • 1 eval file gained target: copilot-log (copilot transcript evaluation)
  • Fixed invalid name field in benchmark-tooling eval (spaces → kebab-case)
  • Workflow default patterns now cover evals/**/*.eval.yaml, examples/**/*.eval.yaml, examples/**/*.EVAL.yaml, examples/**/EVAL.yaml

Test plan

  • bun run validate:examples — all 53 example evals valid
  • Dry-run all 278 tests across all eval files — no target resolution errors
  • Pre-push hooks pass (build, typecheck, lint, test, validate)

🤖 Generated with Claude Code

Every eval file under examples/ and evals/ now declares its own target,
so running `agentv eval run` no longer requires a global --target flag.
This lets the CI workflow run all evals without forcing a single target
(like copilot-cli) that may not suit every eval.

Changes:
- Add `target: default` to 17 eval files that were missing a target
- Add `target: copilot-log` to the copilot-log eval
- Add copilot, vscode, and copilot-log targets to root targets.yaml
- Update evals.yml workflow: default patterns cover all eval files,
  --target is now optional (each eval uses its own)
- Fix invalid name in benchmark-tooling eval (spaces → kebab-case)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Apr 1, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 63bed93
Status: ✅  Deploy successful!
Preview URL: https://663553a4.agentv.pages.dev
Branch Preview URL: https://feat-default-targets.agentv.pages.dev

View logs

christso and others added 13 commits April 1, 2026 12:00
Every eval file now declares its own target:
- `target: default` — LLM-only evals (grading, text generation)
- `target: agent` — coding agent evals (env-var-driven via
  AGENT_PROVIDER + AGENT_MODEL, defaults to copilot-cli)
- Specialized targets (mock_agent, copilot-log, batch_cli, etc.)
  resolve via per-example .agentv/targets.yaml

Added env-var-driven `agent` target to root targets.yaml so CI and
local dev can control which coding agent runs without editing eval
files.

Tags:
- `tags: [agent]` on evals requiring a coding agent or infrastructure
- `tags: [multi-provider]` on multi-model-benchmark (excluded from CI)

Workflow changes:
- Default patterns discover all eval files across examples/ and evals/
- --target is now optional (each eval uses its own)
- AGENT_PROVIDER/AGENT_MODEL written to .env for agent target resolution
- Multi-model-benchmark excluded from default CI sweep

Other fixes:
- Removed deprecated vscode target references
- Fixed invalid name in benchmark-tooling eval (spaces → kebab-case)
- Converted matrix-evaluation from multi-target to single agent target

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The `default` target in root targets.yaml now resolves via AGENT_PROVIDER
+ AGENT_MODEL env vars (defaults to copilot-cli in CI). Evals without an
explicit target automatically use default, so no target field is needed.

Evals with specialized targets (copilot-log, batch_cli, mock_agent, etc.)
keep their explicit `execution.target` — these resolve via per-example
.agentv/targets.yaml files.

Tags:
- `tags: [agent]` on evals requiring a coding agent or infrastructure
- `tags: [multi-provider]` on multi-model-benchmark (excluded from CI)

Workflow:
- Default patterns discover all eval files
- --target is optional (each eval uses its own or falls back to default)
- AGENT_PROVIDER/AGENT_MODEL written to .env
- Only multi-model-benchmark excluded from default CI sweep

Other:
- Removed deprecated vscode target references
- Converted matrix-evaluation from multi-target to single default target
- Fixed invalid name in benchmark-tooling eval

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The CLI doesn't support !glob negation. List showcase subdirectories
explicitly, excluding only multi-model-benchmark.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Patterns prefixed with ! are now treated as exclusions, passed to
fast-glob's ignore option. This lets CI workflows exclude specific
eval directories:

  agentv eval run 'examples/**/*.eval.yaml' '!examples/showcase/multi-model-benchmark/**'

Updated the evals workflow to use this instead of explicit include lists.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The explicit --targets flag forces the root targets.yaml and prevents
per-example targets (batch_cli, mock_agent, etc.) from being found.
Let the CLI auto-discover targets.yaml by walking up from each eval file.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The workspace_template field was removed from target definitions.
These mock targets relied on it but the eval files already define
workspace.template at the eval level.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The psychotherapy evals use target: gemini-llm which needs
GOOGLE_GENERATIVE_AI_API_KEY and GEMINI_MODEL_NAME.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Added `llm` target to root targets.yaml (GH Models, no agent binary)
- LLM-only evals now set `execution.target: llm`
- Agent evals omit target (falls back to default = copilot via env vars)
- export-screening uses its per-example mock target (no change needed)
- Added pi-cli install to CI workflow
- Added Gemini credentials to CI .env

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Changed agent-plugin-review from pi-cli to default target (copilot).
Added OPENROUTER credentials to CI .env for evals that need them.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
agent-skills-evals (missing echo.ts), batch-cli (custom runner script),
code-grader-sdk and local-cli (need uv + mock_cli.py) all require local
setup that isn't available on the CI runner.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Created .agentv/providers/echo.ts for agent-skills-evals (was never
  committed — convention-based provider that echoes input back)
- Installed uv on CI runner so local-cli and code-grader-sdk evals
  can run their Python mock scripts
- Removed CI exclusions for local script evals (all deps now available)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Strengthened system prompts so assertions pass with gpt-5-mini:
- JSON evals: explicit "no markdown, no code blocks, raw JSON only"
- equals evals: "respond with ONLY the number, nothing else"
- starts-with evals: "you MUST start every response with X"
- icontains-all evals: system prompt lists required phrases
- Removed expected_output where it served no assertion purpose
- Changed azure-llm override in basic eval to llm target

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant