-
Notifications
You must be signed in to change notification settings - Fork 0
Integrate structured review output, subagent contracts, and retro accumulation from plugin research #7
Description
Objective
Integrate high-value patterns identified from researching affaan-m/everything-claude-code and openai/codex-plugin-cc into HiveSpec's delivery lifecycle skills, with evals proving each change improves agent behavior.
Full research: agentevals-research HIVESPEC_INTEGRATION.md
Design Latitude
- The markdown snippets below are directional, not prescriptive — reshape to fit each file's existing voice and structure
- Each change is a skill content update only (edit existing
.mdfiles) — no code, no infrastructure, no new dependencies - Evals go in the hivespec-evals repo, following AgentV EVAL.yaml format
- Change feat(skills): add hs-retro phase for session retrospectives #5 (foreground/background dispatch) is out of scope for this issue — it's an optimization that can be a follow-up
Changes
1. Structured review output schema
Files to edit:
skills/hs-implement/references/spec-reviewer-prompt.mdskills/hs-implement/references/code-quality-reviewer-prompt.md
What to add: A required output format section instructing reviewers to return findings as a structured table:
| # | Severity | File:Line | Finding | Recommendation |
|---|----------|-----------|---------|----------------|
| 1 | high | src/foo.ts:42 | Missing null check on user input | Add guard clause |
**Verdict**: approve | needs-attention
**Next steps**: [specific actions if needs-attention]Why: Currently reviewers return prose, making it hard for the parent agent to systematically track whether all findings were addressed before proceeding to hs-ship. Structured output makes findings enumerable and trackable.
Also update: skills/hs-verify/SKILL.md Step 4 — add a note that the parent agent should confirm every "needs-attention" finding was resolved before moving to Step 5.
2. Subagent output contract
Files to edit:
skills/hs-implement/references/implementer-prompt.md
What to add: A required output contract section:
## Output Contract
Every response must end with:
**Status**: DONE | DONE_WITH_CONCERNS | NEEDS_CONTEXT | BLOCKED
**Files changed**:
- `path/to/file.ts` — one-line description
**Tests added/modified**:
- `path/to/test.ts` — what it covers
**Unresolved concerns**:
- Any shared types/interfaces modified that other subagents might also touch
- Anything noticed but out of scope for this taskAlso update: skills/hs-implement/SKILL.md Subagent Review Protocol — add an integration safety check: after collecting outputs from parallel subagents, verify no two subagents modified the same shared type/interface. If they did, reconcile before committing.
Why: Currently hs-implement defines subagent status codes (DONE, DONE_WITH_CONCERNS, etc.) but doesn't require structured output listing files and concerns. The parent agent has to infer what changed from prose, making integration conflicts between parallel subagents invisible until tests break.
3. Confidence-based retro accumulation
Files to edit:
skills/hs-retro/SKILL.md
What to add: A new Step 3b "Check recurrence" between current Step 3 (Classify) and Step 4 (Apply fixes):
### Step 3b: Check recurrence
Before applying a fix, assess whether this is a one-off or a pattern:
1. **First occurrence**: Apply the fix only if it's clearly a systematic gap (not a user preference or one-off situation). Otherwise, note it as a candidate for future confirmation.
2. **Recurs across 2+ sessions**: Apply with high confidence — this is a systematic gap.
3. **Recurs across 3+ sessions AND across different repos**: Consider promoting the fix from project-scoped to HiveSpec core skills.
Single-session over-fitting is how skills accumulate noise. Require recurrence before structural changes.Why: Currently hs-retro treats every session independently — a one-time user preference gets the same treatment as a pattern that appears in every session. This leads to over-fitting skills to single-session noise.
4. Reliability check for non-deterministic features
Files to edit:
skills/hs-verify/SKILL.md
What to add: A new Step 2b after Step 2 (E2E red/green protocol):
### Step 2b: Reliability check (non-deterministic features only)
If the feature involves LLM calls, external APIs, or historically flaky tests:
1. Run the relevant test suite **3 times** (pass^3 protocol)
2. All 3 must pass — any failure means the feature is unreliable, not "flaky"
3. For LLM-dependent features: use at least 2 different representative inputs
Skip this step for purely deterministic code paths (config parsing, data transformations, etc.) — one green run is sufficient.Why: A single passing test run doesn't prove reliability for non-deterministic features. pass^3 catches intermittent failures that a single run misses, while the deterministic carve-out avoids wasting time re-running stable code.
Evals
Each change needs A/B evals in hivespec-evals proving the updated skill produces measurably better agent behavior than the current skill. Use AgentV EVAL.yaml format.
Eval 1: Structured review output
| Test ID | What It Checks | Key Assertions |
|---|---|---|
review-output-is-structured |
Reviewer returns tabular findings with severity, file:line, verdict | contains: "| Severity |", contains-any: ["approve", "needs-attention"], rubrics: findings have file refs + are actionable |
review-findings-trackable |
Parent agent can enumerate findings and check resolution status per finding | rubrics: all findings enumerated, each gets resolved/unresolved status |
Eval 2: Subagent output contract
| Test ID | What It Checks | Key Assertions |
|---|---|---|
subagent-returns-contract |
Implementer subagent output contains status, files changed, tests added | contains-all: ["Status:", "Files changed:", "Tests added"], rubrics: status is valid enum, concerns surfaced |
integration-conflict-detected |
Parent agent catches when two subagents modified the same shared type | rubrics: conflict identified, resolution proposed |
Eval 3: Reliability check
| Test ID | What It Checks | Key Assertions |
|---|---|---|
nondeterministic-gets-multiple-runs |
Agent runs tests 3x for LLM-dependent features | rubrics: multiple runs, recognizes nondeterminism |
deterministic-gets-single-run |
Agent runs tests once for deterministic features (no wasted reruns) | rubrics: single run sufficient, no unnecessary reruns |
Eval 4: Retro accumulation
| Test ID | What It Checks | Key Assertions |
|---|---|---|
first-occurrence-marked-candidate |
First-time intervention is noted, not immediately committed as a skill edit | rubrics: candidate marked, no premature skill edit |
recurring-pattern-applied |
Same intervention across 3 transcripts → fix applied with high confidence | rubrics: recurrence recognized, fix applied citing recurrence |
Running evals
# Baseline (current skills)
agentv eval --file evals/hivespec/structured-review.eval.yaml --target claude
# Candidate (updated skills in workspace)
agentv eval --file evals/hivespec/structured-review.eval.yaml --target claude
# Compare
agentv compare baseline.jsonl candidate.jsonlKey metrics: pass_rate delta, token_usage delta, per-rubric score improvements.
Acceptance Signals
- Reviewer prompts produce structured, tabular findings with verdict (eval 1 passes)
- Implementer prompt specifies output contract; parent catches integration conflicts (eval 2 passes)
- hs-retro distinguishes first-occurrence from recurring patterns (eval 4 passes)
- hs-verify applies pass^3 for non-deterministic features, single run for deterministic (eval 3 passes)
- All evals show pass_rate improvement over baseline (current skills)
- All changes are skill content updates only (no code, no infrastructure)
Non-Goals
- Hook profile system (minimal/standard/strict) — HiveSpec's Hard Gates are unconditional by design
- Multi-model routing — belongs in target repo's CLAUDE.md, not HiveSpec
- Session persistence — Claude Code handles this
- Foreground/background dispatch heuristic — follow-up issue, not this one
- DAG decomposition — hs-implement's parallel dispatch is sufficient
- XML-structured prompts — Markdown skills are simpler and work