Integrate structured review output, subagent contracts, and retro accumulation from plugin research

## Objective

Integrate high-value patterns identified from researching [affaan-m/everything-claude-code](https://github.com/affaan-m/everything-claude-code) and [openai/codex-plugin-cc](https://github.com/openai/codex-plugin-cc) into HiveSpec's delivery lifecycle skills, with evals proving each change improves agent behavior.

Full research: [agentevals-research HIVESPEC_INTEGRATION.md](https://github.com/agentevals/agentevals-research/blob/main/research/findings/everything-claude-code/HIVESPEC_INTEGRATION.md)

## Design Latitude

- The markdown snippets below are **directional, not prescriptive** — reshape to fit each file's existing voice and structure
- Each change is a **skill content update only** (edit existing `.md` files) — no code, no infrastructure, no new dependencies
- Evals go in the [hivespec-evals](https://github.com/EntityProcess/hivespec-evals) repo, following AgentV EVAL.yaml format
- Change #5 (foreground/background dispatch) is **out of scope** for this issue — it's an optimization that can be a follow-up

## Changes

### 1. Structured review output schema

**Files to edit:**
- `skills/hs-implement/references/spec-reviewer-prompt.md`
- `skills/hs-implement/references/code-quality-reviewer-prompt.md`

**What to add:** A required output format section instructing reviewers to return findings as a structured table:

```markdown
| # | Severity | File:Line | Finding | Recommendation |
|---|----------|-----------|---------|----------------|
| 1 | high | src/foo.ts:42 | Missing null check on user input | Add guard clause |

**Verdict**: approve | needs-attention
**Next steps**: [specific actions if needs-attention]
```

**Why:** Currently reviewers return prose, making it hard for the parent agent to systematically track whether all findings were addressed before proceeding to hs-ship. Structured output makes findings enumerable and trackable.

**Also update:** `skills/hs-verify/SKILL.md` Step 4 — add a note that the parent agent should confirm every "needs-attention" finding was resolved before moving to Step 5.

### 2. Subagent output contract

**Files to edit:**
- `skills/hs-implement/references/implementer-prompt.md`

**What to add:** A required output contract section:

```markdown
## Output Contract

Every response must end with:

**Status**: DONE | DONE_WITH_CONCERNS | NEEDS_CONTEXT | BLOCKED
**Files changed**:
- `path/to/file.ts` — one-line description
**Tests added/modified**:
- `path/to/test.ts` — what it covers
**Unresolved concerns**:
- Any shared types/interfaces modified that other subagents might also touch
- Anything noticed but out of scope for this task
```

**Also update:** `skills/hs-implement/SKILL.md` Subagent Review Protocol — add an **integration safety** check: after collecting outputs from parallel subagents, verify no two subagents modified the same shared type/interface. If they did, reconcile before committing.

**Why:** Currently hs-implement defines subagent status codes (DONE, DONE_WITH_CONCERNS, etc.) but doesn't require structured output listing files and concerns. The parent agent has to infer what changed from prose, making integration conflicts between parallel subagents invisible until tests break.

### 3. Confidence-based retro accumulation

**Files to edit:**
- `skills/hs-retro/SKILL.md`

**What to add:** A new Step 3b "Check recurrence" between current Step 3 (Classify) and Step 4 (Apply fixes):

```markdown
### Step 3b: Check recurrence

Before applying a fix, assess whether this is a one-off or a pattern:

1. **First occurrence**: Apply the fix only if it's clearly a systematic gap (not a user preference or one-off situation). Otherwise, note it as a candidate for future confirmation.
2. **Recurs across 2+ sessions**: Apply with high confidence — this is a systematic gap.
3. **Recurs across 3+ sessions AND across different repos**: Consider promoting the fix from project-scoped to HiveSpec core skills.

Single-session over-fitting is how skills accumulate noise. Require recurrence before structural changes.
```

**Why:** Currently hs-retro treats every session independently — a one-time user preference gets the same treatment as a pattern that appears in every session. This leads to over-fitting skills to single-session noise.

### 4. Reliability check for non-deterministic features

**Files to edit:**
- `skills/hs-verify/SKILL.md`

**What to add:** A new Step 2b after Step 2 (E2E red/green protocol):

```markdown
### Step 2b: Reliability check (non-deterministic features only)

If the feature involves LLM calls, external APIs, or historically flaky tests:

1. Run the relevant test suite **3 times** (pass^3 protocol)
2. All 3 must pass — any failure means the feature is unreliable, not "flaky"
3. For LLM-dependent features: use at least 2 different representative inputs

Skip this step for purely deterministic code paths (config parsing, data transformations, etc.) — one green run is sufficient.
```

**Why:** A single passing test run doesn't prove reliability for non-deterministic features. `pass^3` catches intermittent failures that a single run misses, while the deterministic carve-out avoids wasting time re-running stable code.

---

## Evals

Each change needs A/B evals in [hivespec-evals](https://github.com/EntityProcess/hivespec-evals) proving the updated skill produces measurably better agent behavior than the current skill. Use AgentV EVAL.yaml format.

### Eval 1: Structured review output

| Test ID | What It Checks | Key Assertions |
|---|---|---|
| `review-output-is-structured` | Reviewer returns tabular findings with severity, file:line, verdict | `contains: "\| Severity \|"`, `contains-any: ["approve", "needs-attention"]`, `rubrics: findings have file refs + are actionable` |
| `review-findings-trackable` | Parent agent can enumerate findings and check resolution status per finding | `rubrics: all findings enumerated, each gets resolved/unresolved status` |

### Eval 2: Subagent output contract

| Test ID | What It Checks | Key Assertions |
|---|---|---|
| `subagent-returns-contract` | Implementer subagent output contains status, files changed, tests added | `contains-all: ["Status:", "Files changed:", "Tests added"]`, `rubrics: status is valid enum, concerns surfaced` |
| `integration-conflict-detected` | Parent agent catches when two subagents modified the same shared type | `rubrics: conflict identified, resolution proposed` |

### Eval 3: Reliability check

| Test ID | What It Checks | Key Assertions |
|---|---|---|
| `nondeterministic-gets-multiple-runs` | Agent runs tests 3x for LLM-dependent features | `rubrics: multiple runs, recognizes nondeterminism` |
| `deterministic-gets-single-run` | Agent runs tests once for deterministic features (no wasted reruns) | `rubrics: single run sufficient, no unnecessary reruns` |

### Eval 4: Retro accumulation

| Test ID | What It Checks | Key Assertions |
|---|---|---|
| `first-occurrence-marked-candidate` | First-time intervention is noted, not immediately committed as a skill edit | `rubrics: candidate marked, no premature skill edit` |
| `recurring-pattern-applied` | Same intervention across 3 transcripts → fix applied with high confidence | `rubrics: recurrence recognized, fix applied citing recurrence` |

### Running evals

```bash
# Baseline (current skills)
agentv eval --file evals/hivespec/structured-review.eval.yaml --target claude

# Candidate (updated skills in workspace)
agentv eval --file evals/hivespec/structured-review.eval.yaml --target claude

# Compare
agentv compare baseline.jsonl candidate.jsonl
```

Key metrics: `pass_rate` delta, `token_usage` delta, per-rubric score improvements.

---

## Acceptance Signals

- [ ] Reviewer prompts produce structured, tabular findings with verdict (eval 1 passes)
- [ ] Implementer prompt specifies output contract; parent catches integration conflicts (eval 2 passes)
- [ ] hs-retro distinguishes first-occurrence from recurring patterns (eval 4 passes)
- [ ] hs-verify applies pass^3 for non-deterministic features, single run for deterministic (eval 3 passes)
- [ ] All evals show pass_rate improvement over baseline (current skills)
- [ ] All changes are skill content updates only (no code, no infrastructure)

## Non-Goals

- Hook profile system (minimal/standard/strict) — HiveSpec's Hard Gates are unconditional by design
- Multi-model routing — belongs in target repo's CLAUDE.md, not HiveSpec
- Session persistence — Claude Code handles this
- Foreground/background dispatch heuristic — follow-up issue, not this one
- DAG decomposition — hs-implement's parallel dispatch is sufficient
- XML-structured prompts — Markdown skills are simpler and work

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate structured review output, subagent contracts, and retro accumulation from plugin research #7

Objective

Design Latitude

Changes

1. Structured review output schema

2. Subagent output contract

3. Confidence-based retro accumulation

4. Reliability check for non-deterministic features

Evals

Eval 1: Structured review output

Eval 2: Subagent output contract

Eval 3: Reliability check

Eval 4: Retro accumulation

Running evals

Acceptance Signals

Non-Goals

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Test ID	What It Checks	Key Assertions
`review-output-is-structured`	Reviewer returns tabular findings with severity, file:line, verdict	`contains: "\| Severity \|"`, `contains-any: ["approve", "needs-attention"]`, `rubrics: findings have file refs + are actionable`
`review-findings-trackable`	Parent agent can enumerate findings and check resolution status per finding	`rubrics: all findings enumerated, each gets resolved/unresolved status`

Test ID	What It Checks	Key Assertions
`subagent-returns-contract`	Implementer subagent output contains status, files changed, tests added	`contains-all: ["Status:", "Files changed:", "Tests added"]`, `rubrics: status is valid enum, concerns surfaced`
`integration-conflict-detected`	Parent agent catches when two subagents modified the same shared type	`rubrics: conflict identified, resolution proposed`

Test ID	What It Checks	Key Assertions
`nondeterministic-gets-multiple-runs`	Agent runs tests 3x for LLM-dependent features	`rubrics: multiple runs, recognizes nondeterminism`
`deterministic-gets-single-run`	Agent runs tests once for deterministic features (no wasted reruns)	`rubrics: single run sufficient, no unnecessary reruns`

Test ID	What It Checks	Key Assertions
`first-occurrence-marked-candidate`	First-time intervention is noted, not immediately committed as a skill edit	`rubrics: candidate marked, no premature skill edit`
`recurring-pattern-applied`	Same intervention across 3 transcripts → fix applied with high confidence	`rubrics: recurrence recognized, fix applied citing recurrence`

Integrate structured review output, subagent contracts, and retro accumulation from plugin research #7

Description

Objective

Design Latitude

Changes

1. Structured review output schema

2. Subagent output contract

3. Confidence-based retro accumulation

4. Reliability check for non-deterministic features

Evals

Eval 1: Structured review output

Eval 2: Subagent output contract

Eval 3: Reliability check

Eval 4: Retro accumulation

Running evals

Acceptance Signals

Non-Goals

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions