Skip to content

Improve agent-plugin-review skill to pass remaining 3 eval tests #779

@christso

Description

@christso

Summary

The agent-plugin-review skill passes 6/9 eval tests against pi-cli (mean score 0.722). Three tests fail consistently:

Test Score Issue
detect-relative-file-paths 0.500 Partially detected — skill mentions leading / but agent doesn't consistently flag it
detect-repeated-inputs 0.000 Missed — agent doesn't suggest top-level input for repeated file references
detect-missing-hard-gates 0.000 Missed — agent doesn't flag missing artifact existence checks between phases

Approach

Use the agentv-bench eval-driven iteration loop:

  1. Analyze the failing test transcripts to understand what the agent does instead
  2. Identify which SKILL.md instructions are unclear or missing
  3. Make targeted edits to the skill
  4. Re-run evals to verify improvement
  5. Repeat until all 9 pass

Possible improvements

  • Relative file paths: Add an explicit checklist item about checking type: file values in eval YAML
  • Repeated inputs: Add guidance about the top-level input field from AgentV eval docs
  • Hard gates: Make the workflow-checklist.md more prescriptive about what to look for (artifact existence checks at the start of each phase skill)

Eval command

bun run --filter @agentv/core build && bun apps/cli/src/cli.ts eval evals/agentic-engineering/agent-plugin-review.eval.yaml --target pi-cli

Note: must rebuild @agentv/core dist before running if core source was modified.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    in-progressClaimed by an agent — do not duplicate work

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions