feat(pipeline): agent-mode reliability improvements by christso · Pull Request #793 · EntityProcess/agentv

christso · 2026-03-27T07:48:57Z

Summary

This PR addresses four issues identified from observing an agent struggle to run evals reliably via the agent-mode pipeline on Windows. All changes are in the pipeline commands and the bundled agentv-bench skill.

Issues addressed

#789 — Fix Windows subprocess in Python scripts

The bundled scripts (run_tests.py, run_code_graders.py, bench.py) used subprocess.run(["agentv", ...]) which fails on Windows because agentv resolves to a .ps1 wrapper. Changed to shutil.which("agentv") which correctly resolves the executable on all platforms.

Also added .env auto-loading to run_tests.py so target commands inherit required environment variables (e.g. SEARCH_SERVICE, OPENAI_API_KEY).

#790 — Add `--llm-scores` flag to `pipeline bench`

pipeline bench previously only accepted LLM scores via stdin. In PowerShell, the < operator is reserved, so agentv pipeline bench <dir> < llm_scores.json fails with a parse error. Added --llm-scores <path> as an optional flag. Falls back to stdin when omitted (backward compatible).

#791 — Add `pipeline run` combined command

Added agentv pipeline run <eval> --out <dir> which combines input extraction, CLI target invocation (parallel), and code grading into a single command. Loads .env from the eval directory automatically.

Before this PR, agent-mode required 4 steps; after, it's 2 commands plus the agent's LLM grading step:

agentv pipeline run evals/my.eval.yaml --out .agentv/results/export/run-1
# ... agent grades responses, writes llm_scores.json ...
agentv pipeline bench .agentv/results/export/run-1 --llm-scores llm_scores.json

#792 — Update agentv-bench skill docs for cross-platform use

Changed the recommended workflow from Python wrapper scripts to direct CLI commands
Updated the quick-reference example to show the 2-command pipeline run → pipeline bench flow
Added --llm-scores to all bench examples
Added guidance for non-subagent environments (VS Code Copilot, Codex) where grader subagent dispatch isn't available

Files changed

File	Change
`apps/cli/src/commands/pipeline/run.ts`	New `pipeline run` command
`apps/cli/src/commands/pipeline/bench.ts`	Add `--llm-scores` flag
`apps/cli/src/commands/pipeline/index.ts`	Register `run` subcommand
`plugins/agentv-dev/skills/agentv-bench/SKILL.md`	Updated instructions
`plugins/agentv-dev/skills/agentv-bench/scripts/run_tests.py`	`shutil.which` + `.env` loading
`plugins/agentv-dev/skills/agentv-bench/scripts/run_code_graders.py`	`shutil.which` fix
`plugins/agentv-dev/skills/agentv-bench/scripts/bench.py`	`shutil.which` fix

Closes #789
Closes #790
Closes #791
Closes #792

- Add --llm-scores flag to pipeline bench (#790) - Add pipeline run combined command (#791) - Fix Windows subprocess in Python scripts (#789) - Update agentv-bench skill docs for cross-platform use (#792)

feat(pipeline): agent-mode reliability improvements

7173597

- Add --llm-scores flag to pipeline bench (#790) - Add pipeline run combined command (#791) - Fix Windows subprocess in Python scripts (#789) - Update agentv-bench skill docs for cross-platform use (#792)

christso merged commit 5c583dc into main Mar 27, 2026
2 checks passed

christso deleted the feat/pipeline-agent-mode-improvements branch March 27, 2026 09:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pipeline): agent-mode reliability improvements#793

feat(pipeline): agent-mode reliability improvements#793
christso merged 1 commit intomainfrom
feat/pipeline-agent-mode-improvements

christso commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christso commented Mar 27, 2026

Summary

Issues addressed

#789 — Fix Windows subprocess in Python scripts

#790 — Add --llm-scores flag to pipeline bench

#791 — Add pipeline run combined command

#792 — Update agentv-bench skill docs for cross-platform use

Files changed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

#790 — Add `--llm-scores` flag to `pipeline bench`

#791 — Add `pipeline run` combined command