Skip to content

Integrate structured review output, subagent contracts, and retro accumulation from plugin research #7

@christso

Description

@christso

Objective

Integrate high-value patterns identified from researching affaan-m/everything-claude-code and openai/codex-plugin-cc into HiveSpec's delivery lifecycle skills, with evals proving each change improves agent behavior.

Full research: agentevals-research HIVESPEC_INTEGRATION.md

Design Latitude

  • The markdown snippets below are directional, not prescriptive — reshape to fit each file's existing voice and structure
  • Each change is a skill content update only (edit existing .md files) — no code, no infrastructure, no new dependencies
  • Evals go in the hivespec-evals repo, following AgentV EVAL.yaml format
  • Change feat(skills): add hs-retro phase for session retrospectives #5 (foreground/background dispatch) is out of scope for this issue — it's an optimization that can be a follow-up

Changes

1. Structured review output schema

Files to edit:

  • skills/hs-implement/references/spec-reviewer-prompt.md
  • skills/hs-implement/references/code-quality-reviewer-prompt.md

What to add: A required output format section instructing reviewers to return findings as a structured table:

| # | Severity | File:Line | Finding | Recommendation |
|---|----------|-----------|---------|----------------|
| 1 | high | src/foo.ts:42 | Missing null check on user input | Add guard clause |

**Verdict**: approve | needs-attention
**Next steps**: [specific actions if needs-attention]

Why: Currently reviewers return prose, making it hard for the parent agent to systematically track whether all findings were addressed before proceeding to hs-ship. Structured output makes findings enumerable and trackable.

Also update: skills/hs-verify/SKILL.md Step 4 — add a note that the parent agent should confirm every "needs-attention" finding was resolved before moving to Step 5.

2. Subagent output contract

Files to edit:

  • skills/hs-implement/references/implementer-prompt.md

What to add: A required output contract section:

## Output Contract

Every response must end with:

**Status**: DONE | DONE_WITH_CONCERNS | NEEDS_CONTEXT | BLOCKED
**Files changed**:
- `path/to/file.ts` — one-line description
**Tests added/modified**:
- `path/to/test.ts` — what it covers
**Unresolved concerns**:
- Any shared types/interfaces modified that other subagents might also touch
- Anything noticed but out of scope for this task

Also update: skills/hs-implement/SKILL.md Subagent Review Protocol — add an integration safety check: after collecting outputs from parallel subagents, verify no two subagents modified the same shared type/interface. If they did, reconcile before committing.

Why: Currently hs-implement defines subagent status codes (DONE, DONE_WITH_CONCERNS, etc.) but doesn't require structured output listing files and concerns. The parent agent has to infer what changed from prose, making integration conflicts between parallel subagents invisible until tests break.

3. Confidence-based retro accumulation

Files to edit:

  • skills/hs-retro/SKILL.md

What to add: A new Step 3b "Check recurrence" between current Step 3 (Classify) and Step 4 (Apply fixes):

### Step 3b: Check recurrence

Before applying a fix, assess whether this is a one-off or a pattern:

1. **First occurrence**: Apply the fix only if it's clearly a systematic gap (not a user preference or one-off situation). Otherwise, note it as a candidate for future confirmation.
2. **Recurs across 2+ sessions**: Apply with high confidence — this is a systematic gap.
3. **Recurs across 3+ sessions AND across different repos**: Consider promoting the fix from project-scoped to HiveSpec core skills.

Single-session over-fitting is how skills accumulate noise. Require recurrence before structural changes.

Why: Currently hs-retro treats every session independently — a one-time user preference gets the same treatment as a pattern that appears in every session. This leads to over-fitting skills to single-session noise.

4. Reliability check for non-deterministic features

Files to edit:

  • skills/hs-verify/SKILL.md

What to add: A new Step 2b after Step 2 (E2E red/green protocol):

### Step 2b: Reliability check (non-deterministic features only)

If the feature involves LLM calls, external APIs, or historically flaky tests:

1. Run the relevant test suite **3 times** (pass^3 protocol)
2. All 3 must pass — any failure means the feature is unreliable, not "flaky"
3. For LLM-dependent features: use at least 2 different representative inputs

Skip this step for purely deterministic code paths (config parsing, data transformations, etc.) — one green run is sufficient.

Why: A single passing test run doesn't prove reliability for non-deterministic features. pass^3 catches intermittent failures that a single run misses, while the deterministic carve-out avoids wasting time re-running stable code.


Evals

Each change needs A/B evals in hivespec-evals proving the updated skill produces measurably better agent behavior than the current skill. Use AgentV EVAL.yaml format.

Eval 1: Structured review output

Test ID What It Checks Key Assertions
review-output-is-structured Reviewer returns tabular findings with severity, file:line, verdict contains: "| Severity |", contains-any: ["approve", "needs-attention"], rubrics: findings have file refs + are actionable
review-findings-trackable Parent agent can enumerate findings and check resolution status per finding rubrics: all findings enumerated, each gets resolved/unresolved status

Eval 2: Subagent output contract

Test ID What It Checks Key Assertions
subagent-returns-contract Implementer subagent output contains status, files changed, tests added contains-all: ["Status:", "Files changed:", "Tests added"], rubrics: status is valid enum, concerns surfaced
integration-conflict-detected Parent agent catches when two subagents modified the same shared type rubrics: conflict identified, resolution proposed

Eval 3: Reliability check

Test ID What It Checks Key Assertions
nondeterministic-gets-multiple-runs Agent runs tests 3x for LLM-dependent features rubrics: multiple runs, recognizes nondeterminism
deterministic-gets-single-run Agent runs tests once for deterministic features (no wasted reruns) rubrics: single run sufficient, no unnecessary reruns

Eval 4: Retro accumulation

Test ID What It Checks Key Assertions
first-occurrence-marked-candidate First-time intervention is noted, not immediately committed as a skill edit rubrics: candidate marked, no premature skill edit
recurring-pattern-applied Same intervention across 3 transcripts → fix applied with high confidence rubrics: recurrence recognized, fix applied citing recurrence

Running evals

# Baseline (current skills)
agentv eval --file evals/hivespec/structured-review.eval.yaml --target claude

# Candidate (updated skills in workspace)
agentv eval --file evals/hivespec/structured-review.eval.yaml --target claude

# Compare
agentv compare baseline.jsonl candidate.jsonl

Key metrics: pass_rate delta, token_usage delta, per-rubric score improvements.


Acceptance Signals

  • Reviewer prompts produce structured, tabular findings with verdict (eval 1 passes)
  • Implementer prompt specifies output contract; parent catches integration conflicts (eval 2 passes)
  • hs-retro distinguishes first-occurrence from recurring patterns (eval 4 passes)
  • hs-verify applies pass^3 for non-deterministic features, single run for deterministic (eval 3 passes)
  • All evals show pass_rate improvement over baseline (current skills)
  • All changes are skill content updates only (no code, no infrastructure)

Non-Goals

  • Hook profile system (minimal/standard/strict) — HiveSpec's Hard Gates are unconditional by design
  • Multi-model routing — belongs in target repo's CLAUDE.md, not HiveSpec
  • Session persistence — Claude Code handles this
  • Foreground/background dispatch heuristic — follow-up issue, not this one
  • DAG decomposition — hs-implement's parallel dispatch is sufficient
  • XML-structured prompts — Markdown skills are simpler and work

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions