-
Notifications
You must be signed in to change notification settings - Fork 1
epic: agent-reviewed contracts — LLM validation gates with rework feedback #697
Description
Vision
Transform Wave's contract system from mechanical format checks into real quality gates where a separate agent reviews work products, provides structured feedback, and drives rework loops. This closes the gap between "output matches schema" and "output is actually correct."
Today a pipeline step can pass all contracts (JSON schema valid, tests green) while producing a no-op PR, wrong implementation, or missing the issue entirely. The contract system validates shape, not substance.
Target state
handover:
contract:
# Mechanical checks run first (fast, cheap)
- type: test_suite
command: "{{ project.contract_test_command }}"
on_failure: retry
# Agent review runs second (slower, costs tokens, but catches quality issues)
- type: agent_review
reviewer: navigator
model: claude-haiku
criteria_path: .wave/contracts/impl-review-criteria.md
context:
- artifact: assessment
- artifact: impl-plan
- source: git_diff
on_failure: rework
rework_step: fix-implementThe reviewer agent sees the git diff, upstream context (issue, plan), and review criteria. It outputs a structured verdict. If rework is needed, the feedback is injected as an artifact into the rework step — the implementer knows exactly what to fix.
Principles
- Separation of concerns — the agent that did the work never reviews its own work
- Cheap first, expensive second — mechanical checks (schema, tests) run before agent review to avoid wasting tokens on obviously broken output
- Feedback flows forward — review feedback is a first-class artifact, not a log message
- Invisible to simple pipelines — existing
json_schema/test_suitecontracts work unchanged.agent_reviewis opt-in per step - Cost-bounded — agent reviews have token budgets and model pinning (haiku by default)
Child Issues
1. Expand ContractResult with structured feedback
Scope: internal/contract/
Extend ContractResult to carry rich review output alongside the existing pass/fail:
type ContractResult struct {
Pass bool
Error string
Feedback *ReviewFeedback // nil for mechanical contracts
}
type ReviewFeedback struct {
Verdict string `json:"verdict"` // "pass", "rework", "fail"
Issues []Issue `json:"issues"` // specific problems
Suggestions []string `json:"suggestions"` // improvement ideas
Confidence float64 `json:"confidence"` // 0.0-1.0
}
type Issue struct {
Severity string `json:"severity"` // "critical", "major", "minor"
File string `json:"file,omitempty"`
Detail string `json:"detail"`
}Backward-compatible — all existing validators return Feedback: nil. The executor checks Feedback only when non-nil.
Acceptance criteria:
-
ContractResulthas optionalFeedbackfield - Existing contract types unaffected (return nil feedback)
-
ReviewFeedbackhas JSON tags for serialization - Executor logs feedback when present
- All existing tests pass unchanged
2. Implement agent_review contract validator
Scope: internal/contract/agent_review.go (new file)
New contract type that spawns a lightweight agent to review step output.
Input to the reviewer:
- Git diff of the step's worktree changes
- Step output artifacts
- Upstream artifacts specified in
contextconfig - Review criteria from
criteria_path - A system prompt enforcing structured JSON output
Output: ReviewFeedback JSON parsed from the agent's response.
Configuration:
type: agent_review
reviewer: navigator # persona name — MUST differ from step persona
model: claude-haiku # model override (cheap by default)
criteria_path: .wave/contracts/review.md # review prompt
context: # what the reviewer sees
- artifact: assessment # from named prior step artifact
- artifact: impl-plan
- source: git_diff # automatic: worktree diff
max_tokens: 8192 # budget cap for the review
timeout: 120 # secondsKey constraint: The reviewer persona must be different from the step's persona. Validator should enforce this at parse time.
Acceptance criteria:
-
agent_reviewregistered as a contract type - Reviewer sees git diff + configured context artifacts
- Structured
ReviewFeedbackparsed from agent output - Token budget enforced (
max_tokens) - Timeout enforced
- Error if reviewer persona == step persona
- Falls back to
ContractResult{Pass: true}if agent is unavailable (fail-open configurable)
3. Wire adapter runner into contract validation
Scope: internal/pipeline/executor.go, internal/contract/
Currently validateContract() is purely mechanical — no adapter access. The agent_review type needs to spawn a subprocess.
Change: Pass the AdapterRunner into the contract validation path as an optional dependency. Existing types ignore it.
type ValidatorContext struct {
Runner adapter.AdapterRunner // for agent_review
WorkspaceDir string // for git_diff
ArtifactDir string // for context injection
Manifest *manifest.Manifest // for persona resolution
}Acceptance criteria:
- Contract validators receive
ValidatorContextwith optional adapter - Existing validators (
json_schema,test_suite, etc.) unchanged -
agent_reviewvalidator uses the adapter to spawn review agent - Review agent runs in the step's worktree (read-only)
- Review agent tokens tracked in cost ledger
4. Feed review feedback into rework steps
Scope: internal/pipeline/executor.go — rework step creation path
When on_failure: rework triggers after an agent_review failure, the rework step should receive the full ReviewFeedback as an injected artifact, not just an error string.
Change: In the rework step creation path, if ContractResult.Feedback is non-nil:
- Write
ReviewFeedbackto.wave/artifacts/review-feedback.json - Inject it into the rework step's artifact context
- The rework step's prompt can reference specific issues
Rework step sees:
The implementation was reviewed by navigator and needs rework.
Review verdict: rework
Issues:
- [critical] Missing error handling in handlers_compare.go:45
- [major] Tests don't cover the empty-state path
Suggestions:
- Add a table-driven test for the comparison edge cases
Full review: .wave/artifacts/review-feedback.json
Acceptance criteria:
- Review feedback written to
.wave/artifacts/review-feedback.jsonon rework - Rework step receives feedback as injected artifact
- Rework prompt template includes review issues
- Feedback artifact cleaned up after successful rework
5. Add git_diff as automatic context source
Scope: internal/workspace/, internal/contract/
The reviewer needs to see what the step changed. Add a workspace method to produce the git diff, and wire it as an automatic context source for agent_review.
// In workspace package
func (w *Workspace) Diff() (string, error) {
// git diff HEAD in the worktree
}When context includes source: git_diff, the contract validator calls workspace.Diff() and includes the output in the reviewer's context.
Acceptance criteria:
-
Workspace.Diff()returns the uncommitted diff in the worktree - Handles clean worktrees (returns empty diff, not error)
- Diff truncated at configurable limit (default 50KB) to avoid blowing context
- Available as
source: git_diffinagent_reviewcontext config
6. Contract composition — run multiple contracts in sequence
Scope: internal/contract/, internal/pipeline/executor.go
Today a step has one contract. For agent review to work properly, steps need multiple contracts that run in order: mechanical first, agent review second.
handover:
contracts: # plural — ordered list
- type: test_suite
command: "{{ project.contract_test_command }}"
on_failure: retry
- type: agent_review
reviewer: navigator
on_failure: reworkThe executor runs contracts sequentially. If an early contract fails, later ones are skipped. Each contract can have its own on_failure policy.
Acceptance criteria:
-
handover.contracts(plural) accepted alongside existinghandover.contract(singular) - Contracts run in definition order
- Early failure skips remaining contracts
- Each contract has independent
on_failurepolicy - Singular
contractstill works (backward-compatible)
7. Upgrade Wave's own pipelines with agent review
Scope: .wave/pipelines/, .wave/contracts/
After the infrastructure is in place, upgrade Wave's own pipelines to use agent_review:
Priority pipelines:
impl-issue— review the implement step's output (diff + tests)impl-speckit— review at implement + create-pr stepsops-pr-review— review the review output itself (meta-review)
Review criteria files to create:
.wave/contracts/impl-review-criteria.md— does the diff match the plan? tests adequate? no leaked files?.wave/contracts/pr-review-criteria.md— is the PR description accurate? changes scoped correctly?
Acceptance criteria:
-
impl-issueimplement step usesagent_reviewwith navigator -
impl-speckitimplement step usesagent_review - Review criteria files created and tested
- At least 3 successful pipeline runs with agent review active
- False-positive rate < 20% (review doesn't block correct implementations)
8. Observability — review verdicts in dashboard and retros
Scope: internal/webui/, internal/retro/
Make agent reviews visible in the dashboard and retrospectives:
- Run detail page: Show review verdict per step (pass/rework/fail with expandable issues)
- Retros: Track
review_reworkas a friction point type (distinct fromretryandcontract_failure) - Analytics: Review pass rate per pipeline, average review tokens, rework-after-review rate
Acceptance criteria:
- Run detail shows review verdicts inline with step cards
- Retro friction points include
review_reworktype - Analytics tracks review token spend
Implementation Order
1. ContractResult expansion (no dependencies, safe)
2. git_diff context source (no dependencies, safe)
3. Contract composition (depends on 1)
4. Wire adapter into validation (depends on 1)
5. agent_review validator (depends on 2, 3, 4)
6. Rework feedback injection (depends on 5)
7. Upgrade Wave's pipelines (depends on 5, 6)
8. Dashboard + retro observability (depends on 5)
Issues 1-4 can be parallelized. Issue 5 is the core. Issues 7-8 validate the system end-to-end.
Cost Model
Agent reviews add token cost per step. With haiku at ~$0.25/MTok input:
| Scenario | Context size | Review cost | Per-pipeline overhead |
|---|---|---|---|
| Small diff (< 5KB) | ~3K tokens | ~$0.001 | ~$0.003 (3 steps) |
| Medium diff (5-20KB) | ~10K tokens | ~$0.003 | ~$0.009 |
| Large diff (20-50KB) | ~25K tokens | ~$0.006 | ~$0.018 |
At 100 pipeline runs/day with medium diffs: ~$0.90/day additional cost. Negligible compared to the implementation steps themselves.
Non-Goals
- Replacing mechanical contracts —
test_suiteandjson_schemaremain. Agent review supplements, doesn't replace - Blocking on review for every step — agent review is opt-in per step, not global
- Self-review — the step persona reviewing its own output. This is architecturally prevented
- Human-in-the-loop review — that's the existing
gatemechanism. Agent review is fully automated