Skip to content

epic: agent-reviewed contracts — LLM validation gates with rework feedback #697

@nextlevelshit

Description

@nextlevelshit

Vision

Transform Wave's contract system from mechanical format checks into real quality gates where a separate agent reviews work products, provides structured feedback, and drives rework loops. This closes the gap between "output matches schema" and "output is actually correct."

Today a pipeline step can pass all contracts (JSON schema valid, tests green) while producing a no-op PR, wrong implementation, or missing the issue entirely. The contract system validates shape, not substance.

Target state

handover:
  contract:
    # Mechanical checks run first (fast, cheap)
    - type: test_suite
      command: "{{ project.contract_test_command }}"
      on_failure: retry

    # Agent review runs second (slower, costs tokens, but catches quality issues)
    - type: agent_review
      reviewer: navigator
      model: claude-haiku
      criteria_path: .wave/contracts/impl-review-criteria.md
      context:
        - artifact: assessment
        - artifact: impl-plan
        - source: git_diff
      on_failure: rework
      rework_step: fix-implement

The reviewer agent sees the git diff, upstream context (issue, plan), and review criteria. It outputs a structured verdict. If rework is needed, the feedback is injected as an artifact into the rework step — the implementer knows exactly what to fix.


Principles

  1. Separation of concerns — the agent that did the work never reviews its own work
  2. Cheap first, expensive second — mechanical checks (schema, tests) run before agent review to avoid wasting tokens on obviously broken output
  3. Feedback flows forward — review feedback is a first-class artifact, not a log message
  4. Invisible to simple pipelines — existing json_schema / test_suite contracts work unchanged. agent_review is opt-in per step
  5. Cost-bounded — agent reviews have token budgets and model pinning (haiku by default)

Child Issues

1. Expand ContractResult with structured feedback

Scope: internal/contract/

Extend ContractResult to carry rich review output alongside the existing pass/fail:

type ContractResult struct {
    Pass     bool
    Error    string
    Feedback *ReviewFeedback  // nil for mechanical contracts
}

type ReviewFeedback struct {
    Verdict     string   `json:"verdict"`      // "pass", "rework", "fail"  
    Issues      []Issue  `json:"issues"`       // specific problems
    Suggestions []string `json:"suggestions"`  // improvement ideas
    Confidence  float64  `json:"confidence"`   // 0.0-1.0
}

type Issue struct {
    Severity string `json:"severity"` // "critical", "major", "minor"
    File     string `json:"file,omitempty"`
    Detail   string `json:"detail"`
}

Backward-compatible — all existing validators return Feedback: nil. The executor checks Feedback only when non-nil.

Acceptance criteria:

  • ContractResult has optional Feedback field
  • Existing contract types unaffected (return nil feedback)
  • ReviewFeedback has JSON tags for serialization
  • Executor logs feedback when present
  • All existing tests pass unchanged

2. Implement agent_review contract validator

Scope: internal/contract/agent_review.go (new file)

New contract type that spawns a lightweight agent to review step output.

Input to the reviewer:

  • Git diff of the step's worktree changes
  • Step output artifacts
  • Upstream artifacts specified in context config
  • Review criteria from criteria_path
  • A system prompt enforcing structured JSON output

Output: ReviewFeedback JSON parsed from the agent's response.

Configuration:

type: agent_review
reviewer: navigator           # persona name — MUST differ from step persona
model: claude-haiku           # model override (cheap by default)
criteria_path: .wave/contracts/review.md  # review prompt
context:                      # what the reviewer sees
  - artifact: assessment      # from named prior step artifact
  - artifact: impl-plan
  - source: git_diff          # automatic: worktree diff
max_tokens: 8192              # budget cap for the review
timeout: 120                  # seconds

Key constraint: The reviewer persona must be different from the step's persona. Validator should enforce this at parse time.

Acceptance criteria:

  • agent_review registered as a contract type
  • Reviewer sees git diff + configured context artifacts
  • Structured ReviewFeedback parsed from agent output
  • Token budget enforced (max_tokens)
  • Timeout enforced
  • Error if reviewer persona == step persona
  • Falls back to ContractResult{Pass: true} if agent is unavailable (fail-open configurable)

3. Wire adapter runner into contract validation

Scope: internal/pipeline/executor.go, internal/contract/

Currently validateContract() is purely mechanical — no adapter access. The agent_review type needs to spawn a subprocess.

Change: Pass the AdapterRunner into the contract validation path as an optional dependency. Existing types ignore it.

type ValidatorContext struct {
    Runner       adapter.AdapterRunner  // for agent_review
    WorkspaceDir string                 // for git_diff
    ArtifactDir  string                 // for context injection
    Manifest     *manifest.Manifest     // for persona resolution
}

Acceptance criteria:

  • Contract validators receive ValidatorContext with optional adapter
  • Existing validators (json_schema, test_suite, etc.) unchanged
  • agent_review validator uses the adapter to spawn review agent
  • Review agent runs in the step's worktree (read-only)
  • Review agent tokens tracked in cost ledger

4. Feed review feedback into rework steps

Scope: internal/pipeline/executor.go — rework step creation path

When on_failure: rework triggers after an agent_review failure, the rework step should receive the full ReviewFeedback as an injected artifact, not just an error string.

Change: In the rework step creation path, if ContractResult.Feedback is non-nil:

  1. Write ReviewFeedback to .wave/artifacts/review-feedback.json
  2. Inject it into the rework step's artifact context
  3. The rework step's prompt can reference specific issues

Rework step sees:

The implementation was reviewed by navigator and needs rework.

Review verdict: rework
Issues:
  - [critical] Missing error handling in handlers_compare.go:45
  - [major] Tests don't cover the empty-state path
Suggestions:
  - Add a table-driven test for the comparison edge cases

Full review: .wave/artifacts/review-feedback.json

Acceptance criteria:

  • Review feedback written to .wave/artifacts/review-feedback.json on rework
  • Rework step receives feedback as injected artifact
  • Rework prompt template includes review issues
  • Feedback artifact cleaned up after successful rework

5. Add git_diff as automatic context source

Scope: internal/workspace/, internal/contract/

The reviewer needs to see what the step changed. Add a workspace method to produce the git diff, and wire it as an automatic context source for agent_review.

// In workspace package
func (w *Workspace) Diff() (string, error) {
    // git diff HEAD in the worktree
}

When context includes source: git_diff, the contract validator calls workspace.Diff() and includes the output in the reviewer's context.

Acceptance criteria:

  • Workspace.Diff() returns the uncommitted diff in the worktree
  • Handles clean worktrees (returns empty diff, not error)
  • Diff truncated at configurable limit (default 50KB) to avoid blowing context
  • Available as source: git_diff in agent_review context config

6. Contract composition — run multiple contracts in sequence

Scope: internal/contract/, internal/pipeline/executor.go

Today a step has one contract. For agent review to work properly, steps need multiple contracts that run in order: mechanical first, agent review second.

handover:
  contracts:  # plural — ordered list
    - type: test_suite
      command: "{{ project.contract_test_command }}"
      on_failure: retry
    - type: agent_review
      reviewer: navigator
      on_failure: rework

The executor runs contracts sequentially. If an early contract fails, later ones are skipped. Each contract can have its own on_failure policy.

Acceptance criteria:

  • handover.contracts (plural) accepted alongside existing handover.contract (singular)
  • Contracts run in definition order
  • Early failure skips remaining contracts
  • Each contract has independent on_failure policy
  • Singular contract still works (backward-compatible)

7. Upgrade Wave's own pipelines with agent review

Scope: .wave/pipelines/, .wave/contracts/

After the infrastructure is in place, upgrade Wave's own pipelines to use agent_review:

Priority pipelines:

  • impl-issue — review the implement step's output (diff + tests)
  • impl-speckit — review at implement + create-pr steps
  • ops-pr-review — review the review output itself (meta-review)

Review criteria files to create:

  • .wave/contracts/impl-review-criteria.md — does the diff match the plan? tests adequate? no leaked files?
  • .wave/contracts/pr-review-criteria.md — is the PR description accurate? changes scoped correctly?

Acceptance criteria:

  • impl-issue implement step uses agent_review with navigator
  • impl-speckit implement step uses agent_review
  • Review criteria files created and tested
  • At least 3 successful pipeline runs with agent review active
  • False-positive rate < 20% (review doesn't block correct implementations)

8. Observability — review verdicts in dashboard and retros

Scope: internal/webui/, internal/retro/

Make agent reviews visible in the dashboard and retrospectives:

  • Run detail page: Show review verdict per step (pass/rework/fail with expandable issues)
  • Retros: Track review_rework as a friction point type (distinct from retry and contract_failure)
  • Analytics: Review pass rate per pipeline, average review tokens, rework-after-review rate

Acceptance criteria:

  • Run detail shows review verdicts inline with step cards
  • Retro friction points include review_rework type
  • Analytics tracks review token spend

Implementation Order

1. ContractResult expansion          (no dependencies, safe)
2. git_diff context source           (no dependencies, safe)
3. Contract composition              (depends on 1)
4. Wire adapter into validation      (depends on 1)
5. agent_review validator            (depends on 2, 3, 4)
6. Rework feedback injection         (depends on 5)
7. Upgrade Wave's pipelines          (depends on 5, 6)
8. Dashboard + retro observability   (depends on 5)

Issues 1-4 can be parallelized. Issue 5 is the core. Issues 7-8 validate the system end-to-end.


Cost Model

Agent reviews add token cost per step. With haiku at ~$0.25/MTok input:

Scenario Context size Review cost Per-pipeline overhead
Small diff (< 5KB) ~3K tokens ~$0.001 ~$0.003 (3 steps)
Medium diff (5-20KB) ~10K tokens ~$0.003 ~$0.009
Large diff (20-50KB) ~25K tokens ~$0.006 ~$0.018

At 100 pipeline runs/day with medium diffs: ~$0.90/day additional cost. Negligible compared to the implementation steps themselves.


Non-Goals

  • Replacing mechanical contractstest_suite and json_schema remain. Agent review supplements, doesn't replace
  • Blocking on review for every step — agent review is opt-in per step, not global
  • Self-review — the step persona reviewing its own output. This is architecturally prevented
  • Human-in-the-loop review — that's the existing gate mechanism. Agent review is fully automated

Metadata

Metadata

Assignees

No one assigned

    Labels

    epicMulti-issue initiative

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions