Skip to content

Add Auto Review proof metrics and dogfood diagnostics #330

@shiny-code-bot

Description

@shiny-code-bot

Summary

Add proof instrumentation so the durable Auto Review concept can be evaluated with data, not vibes.

Scope

  • Emit structured counters/events for run lifecycle, duplicate reuse/skips, supersede/cancel reasons, findings surfaced/inspected/applied/dismissed, ledger tokens, detail tokens, and token-spend estimates.
  • Add a diagnostic surface such as /review-stats if useful.
  • Build deterministic scanners or fixtures for stale/superseded/duplicate/fix-train signatures from logs/rollouts where appropriate.

Acceptance Criteria

  • Metrics include duplicate review rate, skipped/adopted/superseded/cancelled counts, unsurfaced terminal findings, ledger overhead, avoided token estimate, time to surface findings, and finding usefulness/disposition.
  • Each Auto Review run records enough proof data to explain latency: model, reasoning effort, resolve model/effort, phase timing, follow-up count, token count when available, prompt token estimate, and terminal reason.
  • Restart recovery and duplicate avoidance are testable without a live TUI where possible.
  • Dogfood diagnostics can compare before/after behavior across real sessions and identify whether slowness came from first review pass, follow-up loops, worktree/lock contention, retries, or prompt bloat.
  • Metrics do not inject bulky telemetry into normal assistant context; ordinary turns receive only bounded actionable review state.

Relationships

Parent: #324
Depends on: #325, #327, #329
Related: #43, #50

Finish Line

Every Code emits enough Auto Review metrics and diagnostics to prove duplicate review reduction, avoided token spend, surfaced findings, ledger overhead, restart recovery, and finding usefulness during dogfooding.

Current Status

State: Planned after #329. This is the proof layer for the Auto Review love gate.

Recent evidence to carry forward:

  • Current config can be correct while reviews still feel slow; diagnostics need to show which model/reasoning settings were actually used by each background Auto Review run.
  • Lowering Auto Review follow-ups reduces worst-case loop count but does not speed the initial review pass, so phase timing matters.
  • prompt_token_estimate exists on AutoReviewRun but is not currently populated, leaving prompt bloat mostly unprovable.
  • Dogfooding needs before/after comparisons for duplicate review spend, stale review confusion, ledger overhead, and finding usefulness.

Next action after #329: populate prompt/token/timing/disposition fields and expose compact diagnostics for dogfood comparison without adding normal-turn context bulk.

Blocked by: #329 for dedupe outcome fields and policy events.

Last verified: 2026-06-02 during Auto Review latency/settings planning review.

Metadata

Metadata

Assignees

No one assigned

    Labels

    planDurable planning issueplan:activeCurrent active plan

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions