-
Notifications
You must be signed in to change notification settings - Fork 0
Compare HiveSpec vs Superpowers using Claude Code target #1
Copy link
Copy link
Open
Description
Objective
Run a head-to-head comparison of HiveSpec vs Superpowers (https://github.com/obra/superpowers/) using Claude Code as the eval target. Measure whether the enriched HiveSpec skills (post EntityProcess/hivespec#1) achieve parity or better on the capabilities that Superpowers covers.
Background
A gap analysis identified three areas where Superpowers had deeper coverage than HiveSpec:
- TDD rationalization prevention —
superpowers:test-driven-development - Systematic debugging methodology —
superpowers:systematic-debugging - Receiving code review discipline —
superpowers:receiving-code-review
These gaps were addressed in EntityProcess/hivespec#1 by enriching hs-implement and hs-verify. This eval should verify the enrichments are effective.
Eval Design
Scenarios to cover
- TDD discipline — Give the agent a feature to implement. Measure whether it writes a failing test first, watches it fail, then implements. Pressure-test with rationalizations ("this is too simple to test", "I already wrote the code").
- Systematic debugging — Present a failing test or broken feature. Measure whether the agent follows root cause investigation before proposing fixes. Check for the 3-failure escalation pattern.
- Code review response — Provide review feedback (mix of correct, incorrect, and YAGNI suggestions). Measure whether the agent evaluates technically rather than performatively agreeing.
- End-to-end lifecycle — Full issue → PR flow. Measure phase discipline, verification evidence, and shipping quality.
Target
- Agent: Claude Code (CLI)
- Plugins: Run each scenario twice — once with HiveSpec installed, once with Superpowers installed
- Metrics: Pass/fail per grading rubric, plus qualitative comparison of agent behavior
Grading approach
Use llm-grader rubrics that check for specific behavioral signals (e.g., "agent wrote test before implementation", "agent traced root cause before fixing", "agent pushed back on incorrect review feedback").
Acceptance criteria
- Eval YAML files created for all 4 scenario categories
- Baseline run with Superpowers
- Comparison run with HiveSpec
- Results published to this repo with
agentv compareoutput - Summary of parity/gaps documented
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels