Add Auto Review proof metrics and dogfood diagnostics

## Summary

Add proof instrumentation so the durable Auto Review concept can be evaluated with data, not vibes.

## Scope

- Emit structured counters/events for run lifecycle, duplicate reuse/skips, supersede/cancel reasons, findings surfaced/inspected/applied/dismissed, ledger tokens, detail tokens, and token-spend estimates.
- Add a diagnostic surface such as `/review-stats` if useful.
- Build deterministic scanners or fixtures for stale/superseded/duplicate/fix-train signatures from logs/rollouts where appropriate.

## Acceptance Criteria

- [ ] Metrics include duplicate review rate, skipped/adopted/superseded/cancelled counts, unsurfaced terminal findings, ledger overhead, avoided token estimate, time to surface findings, and finding usefulness/disposition.
- [ ] Each Auto Review run records enough proof data to explain latency: model, reasoning effort, resolve model/effort, phase timing, follow-up count, token count when available, prompt token estimate, and terminal reason.
- [ ] Restart recovery and duplicate avoidance are testable without a live TUI where possible.
- [ ] Dogfood diagnostics can compare before/after behavior across real sessions and identify whether slowness came from first review pass, follow-up loops, worktree/lock contention, retries, or prompt bloat.
- [ ] Metrics do not inject bulky telemetry into normal assistant context; ordinary turns receive only bounded actionable review state.

## Relationships

Parent: #324
Depends on: #325, #327, #329
Related: #43, #50

## Finish Line

Every Code emits enough Auto Review metrics and diagnostics to prove duplicate review reduction, avoided token spend, surfaced findings, ledger overhead, restart recovery, and finding usefulness during dogfooding.

## Current Status

State: Planned after #329. This is the proof layer for the Auto Review love gate.

Recent evidence to carry forward:
- Current config can be correct while reviews still feel slow; diagnostics need to show which model/reasoning settings were actually used by each background Auto Review run.
- Lowering Auto Review follow-ups reduces worst-case loop count but does not speed the initial review pass, so phase timing matters.
- `prompt_token_estimate` exists on `AutoReviewRun` but is not currently populated, leaving prompt bloat mostly unprovable.
- Dogfooding needs before/after comparisons for duplicate review spend, stale review confusion, ledger overhead, and finding usefulness.

Next action after #329: populate prompt/token/timing/disposition fields and expose compact diagnostics for dogfood comparison without adding normal-turn context bulk.

Blocked by: #329 for dedupe outcome fields and policy events.

Last verified: 2026-06-02 during Auto Review latency/settings planning review.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Auto Review proof metrics and dogfood diagnostics #330

Summary

Scope

Acceptance Criteria

Relationships

Finish Line

Current Status

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add Auto Review proof metrics and dogfood diagnostics #330

Description

Summary

Scope

Acceptance Criteria

Relationships

Finish Line

Current Status

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions