Skip to content

Compare HiveSpec vs Superpowers using Claude Code target #1

@christso

Description

@christso

Objective

Run a head-to-head comparison of HiveSpec vs Superpowers (https://github.com/obra/superpowers/) using Claude Code as the eval target. Measure whether the enriched HiveSpec skills (post EntityProcess/hivespec#1) achieve parity or better on the capabilities that Superpowers covers.

Background

A gap analysis identified three areas where Superpowers had deeper coverage than HiveSpec:

  1. TDD rationalization preventionsuperpowers:test-driven-development
  2. Systematic debugging methodologysuperpowers:systematic-debugging
  3. Receiving code review disciplinesuperpowers:receiving-code-review

These gaps were addressed in EntityProcess/hivespec#1 by enriching hs-implement and hs-verify. This eval should verify the enrichments are effective.

Eval Design

Scenarios to cover

  1. TDD discipline — Give the agent a feature to implement. Measure whether it writes a failing test first, watches it fail, then implements. Pressure-test with rationalizations ("this is too simple to test", "I already wrote the code").
  2. Systematic debugging — Present a failing test or broken feature. Measure whether the agent follows root cause investigation before proposing fixes. Check for the 3-failure escalation pattern.
  3. Code review response — Provide review feedback (mix of correct, incorrect, and YAGNI suggestions). Measure whether the agent evaluates technically rather than performatively agreeing.
  4. End-to-end lifecycle — Full issue → PR flow. Measure phase discipline, verification evidence, and shipping quality.

Target

  • Agent: Claude Code (CLI)
  • Plugins: Run each scenario twice — once with HiveSpec installed, once with Superpowers installed
  • Metrics: Pass/fail per grading rubric, plus qualitative comparison of agent behavior

Grading approach

Use llm-grader rubrics that check for specific behavioral signals (e.g., "agent wrote test before implementation", "agent traced root cause before fixing", "agent pushed back on incorrect review feedback").

Acceptance criteria

  • Eval YAML files created for all 4 scenario categories
  • Baseline run with Superpowers
  • Comparison run with HiveSpec
  • Results published to this repo with agentv compare output
  • Summary of parity/gaps documented

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions