Skip to content

feat(pipeline): agent-mode reliability improvements#793

Merged
christso merged 1 commit intomainfrom
feat/pipeline-agent-mode-improvements
Mar 27, 2026
Merged

feat(pipeline): agent-mode reliability improvements#793
christso merged 1 commit intomainfrom
feat/pipeline-agent-mode-improvements

Conversation

@christso
Copy link
Copy Markdown
Collaborator

Summary

This PR addresses four issues identified from observing an agent struggle to run evals reliably via the agent-mode pipeline on Windows. All changes are in the pipeline commands and the bundled agentv-bench skill.

Issues addressed

#789 — Fix Windows subprocess in Python scripts

The bundled scripts (run_tests.py, run_code_graders.py, bench.py) used subprocess.run(["agentv", ...]) which fails on Windows because agentv resolves to a .ps1 wrapper. Changed to shutil.which("agentv") which correctly resolves the executable on all platforms.

Also added .env auto-loading to run_tests.py so target commands inherit required environment variables (e.g. SEARCH_SERVICE, OPENAI_API_KEY).

#790 — Add --llm-scores flag to pipeline bench

pipeline bench previously only accepted LLM scores via stdin. In PowerShell, the < operator is reserved, so agentv pipeline bench <dir> < llm_scores.json fails with a parse error. Added --llm-scores <path> as an optional flag. Falls back to stdin when omitted (backward compatible).

#791 — Add pipeline run combined command

Added agentv pipeline run <eval> --out <dir> which combines input extraction, CLI target invocation (parallel), and code grading into a single command. Loads .env from the eval directory automatically.

Before this PR, agent-mode required 4 steps; after, it's 2 commands plus the agent's LLM grading step:

agentv pipeline run evals/my.eval.yaml --out .agentv/results/export/run-1
# ... agent grades responses, writes llm_scores.json ...
agentv pipeline bench .agentv/results/export/run-1 --llm-scores llm_scores.json

#792 — Update agentv-bench skill docs for cross-platform use

  • Changed the recommended workflow from Python wrapper scripts to direct CLI commands
  • Updated the quick-reference example to show the 2-command pipeline runpipeline bench flow
  • Added --llm-scores to all bench examples
  • Added guidance for non-subagent environments (VS Code Copilot, Codex) where grader subagent dispatch isn't available

Files changed

File Change
apps/cli/src/commands/pipeline/run.ts New pipeline run command
apps/cli/src/commands/pipeline/bench.ts Add --llm-scores flag
apps/cli/src/commands/pipeline/index.ts Register run subcommand
plugins/agentv-dev/skills/agentv-bench/SKILL.md Updated instructions
plugins/agentv-dev/skills/agentv-bench/scripts/run_tests.py shutil.which + .env loading
plugins/agentv-dev/skills/agentv-bench/scripts/run_code_graders.py shutil.which fix
plugins/agentv-dev/skills/agentv-bench/scripts/bench.py shutil.which fix

Closes #789
Closes #790
Closes #791
Closes #792

- Add --llm-scores flag to pipeline bench (#790)
- Add pipeline run combined command (#791)
- Fix Windows subprocess in Python scripts (#789)
- Update agentv-bench skill docs for cross-platform use (#792)
@christso christso merged commit 5c583dc into main Mar 27, 2026
2 checks passed
@christso christso deleted the feat/pipeline-agent-mode-improvements branch March 27, 2026 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant