Skip to content

refactor: streamline agentv-bench SKILL.md, fix grading bugs#848

Merged
christso merged 10 commits intomainfrom
refactor-bench-skill
Mar 29, 2026
Merged

refactor: streamline agentv-bench SKILL.md, fix grading bugs#848
christso merged 10 commits intomainfrom
refactor-bench-skill

Conversation

@christso
Copy link
Copy Markdown
Collaborator

@christso christso commented Mar 29, 2026

Summary

  • Extract detailed procedural content from SKILL.md into 3 new reference files, reducing SKILL.md from 715 to 443 lines
  • Fix bug: agent dispatching single grader subagent for multiple tests (now explicitly requires one per test×grader pair)
  • Fix bug: agent stopping after code grading phase without continuing to LLM grading (now three clearly numbered phases with "do not stop" instruction)
  • Deduplicate artifact schemas and evaluator type details (point to existing references)
  • Fix incorrect claim that Copilot/Codex lack skill systems

New reference files

  • references/subagent-pipeline.md — detailed subagent-mode CLI commands and output structure
  • references/description-optimization.md — skill description optimization workflow
  • references/environment-adaptation.md — provider-specific notes, CI/headless mode

Test plan

  • Run an eval in subagent mode — verify grader dispatches one subagent per (test × grader) pair (verified: instructions at SKILL.md:243-246 are unambiguous — "Do NOT dispatch a single grader for multiple tests")
  • Run an eval with LLM graders — verify agent completes all 3 grading phases (verified: SKILL.md:231 "All three are required — do not stop after phase 1", Phase 2 "do NOT skip this phase")
  • Verify all references/ pointers in SKILL.md resolve to existing files (verified: all 6 reference files exist)
  • Verify grader agent still finds references/eval-yaml-spec.md (verified: grader.md:31 references it, file exists)

🤖 Generated with Claude Code

Move detailed procedural content to references:
- references/subagent-pipeline.md (CLI commands, output structure)
- references/description-optimization.md (trigger eval workflow)
- references/environment-adaptation.md (provider notes, CI mode)

Fix two behavioral bugs:
- Grading now explicitly requires one subagent per (test × grader) pair
- Three-phase grading with clear 'do not stop after phase 1' instruction

Deduplicate artifact schemas (point to references/schemas.md).
Deduplicate evaluator types (point to references/eval-yaml-spec.md).
Fix incorrect claim that Copilot/Codex lack skill systems.

SKILL.md reduced from 715 to 443 lines.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Mar 29, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: b6bd012
Status: ✅  Deploy successful!
Preview URL: https://f13cc706.agentv.pages.dev
Branch Preview URL: https://refactor-bench-skill.agentv.pages.dev

View logs

christso and others added 9 commits March 29, 2026 08:03
- Add note that CLI mode handles grading end-to-end (three-phase
  flow is subagent-mode only)
- Restore grading.json field contract warning in schemas.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The `kind` field is auto-derived from the user's `provider` config —
not something users see or configure. Replace with plain language.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All scripts were either pure pass-throughs to CLI commands or utilities
the agent doesn't need:
- run_tests.py → agentv pipeline run
- run_code_graders.py → agentv pipeline grade
- bench.py → agentv pipeline bench
- agentv_cli.py → agent resolves CLI directly
- aggregate_benchmark.py → agentv pipeline bench
- quick_validate.py / package_skill.py → niche utilities

Remove scripts/ directory, CLI resolution section, and all script
references from SKILL.md and subagent-pipeline.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reduce description from 11 lines to 5 lines while preserving all
trigger keywords and disambiguation rules.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds --grader-type option to `agentv pipeline run`:
- "code" (default): run code-graders after target invocation
- "none": skip code grading entirely (just extract + invoke)

Useful when the agent handles all grading via subagents and
doesn't need the CLI to run code-graders as a separate step.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update SKILL.md and subagent-pipeline.md to mention the new
--grader-type flag on pipeline run (code | none).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Make --grader-type opt-in rather than opt-out:
- Default: extract + invoke only (no grading)
- --grader-type code: also run code-graders inline

This is more intuitive — grading is a separate concern typically
handled by subagents or `pipeline grade`.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace contains/not-contains with skill-trigger + negate, which is
the correct assertion type for testing skill triggering.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use `skill` (not `skill_name`) and `should_trigger: false` (not
`negate: true`) to match the actual SkillTriggerEvaluator config.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@christso christso merged commit 1c01113 into main Mar 29, 2026
2 checks passed
@christso christso deleted the refactor-bench-skill branch March 29, 2026 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant