refactor: streamline agentv-bench SKILL.md, fix grading bugs#848
Merged
refactor: streamline agentv-bench SKILL.md, fix grading bugs#848
Conversation
Move detailed procedural content to references: - references/subagent-pipeline.md (CLI commands, output structure) - references/description-optimization.md (trigger eval workflow) - references/environment-adaptation.md (provider notes, CI mode) Fix two behavioral bugs: - Grading now explicitly requires one subagent per (test × grader) pair - Three-phase grading with clear 'do not stop after phase 1' instruction Deduplicate artifact schemas (point to references/schemas.md). Deduplicate evaluator types (point to references/eval-yaml-spec.md). Fix incorrect claim that Copilot/Codex lack skill systems. SKILL.md reduced from 715 to 443 lines. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Deploying agentv with
|
| Latest commit: |
b6bd012
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://f13cc706.agentv.pages.dev |
| Branch Preview URL: | https://refactor-bench-skill.agentv.pages.dev |
- Add note that CLI mode handles grading end-to-end (three-phase flow is subagent-mode only) - Restore grading.json field contract warning in schemas.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The `kind` field is auto-derived from the user's `provider` config — not something users see or configure. Replace with plain language. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All scripts were either pure pass-throughs to CLI commands or utilities the agent doesn't need: - run_tests.py → agentv pipeline run - run_code_graders.py → agentv pipeline grade - bench.py → agentv pipeline bench - agentv_cli.py → agent resolves CLI directly - aggregate_benchmark.py → agentv pipeline bench - quick_validate.py / package_skill.py → niche utilities Remove scripts/ directory, CLI resolution section, and all script references from SKILL.md and subagent-pipeline.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reduce description from 11 lines to 5 lines while preserving all trigger keywords and disambiguation rules. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds --grader-type option to `agentv pipeline run`: - "code" (default): run code-graders after target invocation - "none": skip code grading entirely (just extract + invoke) Useful when the agent handles all grading via subagents and doesn't need the CLI to run code-graders as a separate step. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update SKILL.md and subagent-pipeline.md to mention the new --grader-type flag on pipeline run (code | none). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Make --grader-type opt-in rather than opt-out: - Default: extract + invoke only (no grading) - --grader-type code: also run code-graders inline This is more intuitive — grading is a separate concern typically handled by subagents or `pipeline grade`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace contains/not-contains with skill-trigger + negate, which is the correct assertion type for testing skill triggering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use `skill` (not `skill_name`) and `should_trigger: false` (not `negate: true`) to match the actual SkillTriggerEvaluator config. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
New reference files
references/subagent-pipeline.md— detailed subagent-mode CLI commands and output structurereferences/description-optimization.md— skill description optimization workflowreferences/environment-adaptation.md— provider-specific notes, CI/headless modeTest plan
references/pointers in SKILL.md resolve to existing files (verified: all 6 reference files exist)references/eval-yaml-spec.md(verified: grader.md:31 references it, file exists)🤖 Generated with Claude Code