refactor: streamline agentv-bench SKILL.md, fix grading bugs by christso · Pull Request #848 · EntityProcess/agentv

christso · 2026-03-29T08:00:20Z

Summary

Extract detailed procedural content from SKILL.md into 3 new reference files, reducing SKILL.md from 715 to 443 lines
Fix bug: agent dispatching single grader subagent for multiple tests (now explicitly requires one per test×grader pair)
Fix bug: agent stopping after code grading phase without continuing to LLM grading (now three clearly numbered phases with "do not stop" instruction)
Deduplicate artifact schemas and evaluator type details (point to existing references)
Fix incorrect claim that Copilot/Codex lack skill systems

New reference files

references/subagent-pipeline.md — detailed subagent-mode CLI commands and output structure
references/description-optimization.md — skill description optimization workflow
references/environment-adaptation.md — provider-specific notes, CI/headless mode

Test plan

Run an eval in subagent mode — verify grader dispatches one subagent per (test × grader) pair (verified: instructions at SKILL.md:243-246 are unambiguous — "Do NOT dispatch a single grader for multiple tests")
Run an eval with LLM graders — verify agent completes all 3 grading phases (verified: SKILL.md:231 "All three are required — do not stop after phase 1", Phase 2 "do NOT skip this phase")
Verify all references/ pointers in SKILL.md resolve to existing files (verified: all 6 reference files exist)
Verify grader agent still finds references/eval-yaml-spec.md (verified: grader.md:31 references it, file exists)

🤖 Generated with Claude Code

Move detailed procedural content to references: - references/subagent-pipeline.md (CLI commands, output structure) - references/description-optimization.md (trigger eval workflow) - references/environment-adaptation.md (provider notes, CI mode) Fix two behavioral bugs: - Grading now explicitly requires one subagent per (test × grader) pair - Three-phase grading with clear 'do not stop after phase 1' instruction Deduplicate artifact schemas (point to references/schemas.md). Deduplicate evaluator types (point to references/eval-yaml-spec.md). Fix incorrect claim that Copilot/Codex lack skill systems. SKILL.md reduced from 715 to 443 lines. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cloudflare-workers-and-pages · 2026-03-29T08:01:12Z

Deploying agentv with Cloudflare Pages

Latest commit:	`b6bd012`
Status:	✅ Deploy successful!
Preview URL:	https://f13cc706.agentv.pages.dev
Branch Preview URL:	https://refactor-bench-skill.agentv.pages.dev

View logs

- Add note that CLI mode handles grading end-to-end (three-phase flow is subagent-mode only) - Restore grading.json field contract warning in schemas.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The `kind` field is auto-derived from the user's `provider` config — not something users see or configure. Replace with plain language. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

All scripts were either pure pass-throughs to CLI commands or utilities the agent doesn't need: - run_tests.py → agentv pipeline run - run_code_graders.py → agentv pipeline grade - bench.py → agentv pipeline bench - agentv_cli.py → agent resolves CLI directly - aggregate_benchmark.py → agentv pipeline bench - quick_validate.py / package_skill.py → niche utilities Remove scripts/ directory, CLI resolution section, and all script references from SKILL.md and subagent-pipeline.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Reduce description from 11 lines to 5 lines while preserving all trigger keywords and disambiguation rules. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds --grader-type option to `agentv pipeline run`: - "code" (default): run code-graders after target invocation - "none": skip code grading entirely (just extract + invoke) Useful when the agent handles all grading via subagents and doesn't need the CLI to run code-graders as a separate step. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update SKILL.md and subagent-pipeline.md to mention the new --grader-type flag on pipeline run (code | none). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Make --grader-type opt-in rather than opt-out: - Default: extract + invoke only (no grading) - --grader-type code: also run code-graders inline This is more intuitive — grading is a separate concern typically handled by subagents or `pipeline grade`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace contains/not-contains with skill-trigger + negate, which is the correct assertion type for testing skill triggering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use `skill` (not `skill_name`) and `should_trigger: false` (not `negate: true`) to match the actual SkillTriggerEvaluator config. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

christso and others added 9 commits March 29, 2026 08:03

fix: address code review findings

6b8ed12

- Add note that CLI mode handles grading end-to-end (three-phase flow is subagent-mode only) - Restore grading.json field contract warning in schemas.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: remove internal kind: "agent" references from skill docs

de53c8c

The `kind` field is auto-derived from the user's `provider` config — not something users see or configure. Replace with plain language. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: condense SKILL.md frontmatter description

f6b689f

Reduce description from 11 lines to 5 lines while preserving all trigger keywords and disambiguation rules. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: document --grader-type flag in skill references

95970c9

Update SKILL.md and subagent-pipeline.md to mention the new --grader-type flag on pipeline run (code | none). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: use skill-trigger assertion in description optimization examples

957ee35

Replace contains/not-contains with skill-trigger + negate, which is the correct assertion type for testing skill triggering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: correct skill-trigger assertion syntax

b6bd012

Use `skill` (not `skill_name`) and `should_trigger: false` (not `negate: true`) to match the actual SkillTriggerEvaluator config. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

christso merged commit 1c01113 into main Mar 29, 2026
2 checks passed

christso deleted the refactor-bench-skill branch March 29, 2026 12:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: streamline agentv-bench SKILL.md, fix grading bugs#848

refactor: streamline agentv-bench SKILL.md, fix grading bugs#848
christso merged 10 commits intomainfrom
refactor-bench-skill

christso commented Mar 29, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages bot commented Mar 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christso commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New reference files

Test plan

Uh oh!

cloudflare-workers-and-pages bot commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

christso commented Mar 29, 2026 •

edited

Loading

cloudflare-workers-and-pages bot commented Mar 29, 2026 •

edited

Loading