Skip to content

Implement factory pipeline fix plan (Phases 1-7)#100

Merged
nickhamze merged 1 commit into
mainfrom
agent/factory-pipeline-fix
Apr 29, 2026
Merged

Implement factory pipeline fix plan (Phases 1-7)#100
nickhamze merged 1 commit into
mainfrom
agent/factory-pipeline-fix

Conversation

@nickhamze
Copy link
Copy Markdown
Contributor

Summary

Seven guardrails that collectively prevent the all-day canary loop pattern before spending vision budget or opening noisy PRs.

  • Phase 1 — Add content-preflight and snap-preflight phases to design.py; both run after snap and before vision review, blocking expensive LLM calls on already-red evidence.
  • Phase 2scorecard-mass-failure classification in design-scorecard.py stops the repair loop when visual_distinctness ≤ 10, any category = 0, or weak_findings ≥ 40; grouped diagnostics replace the raw JSON blob.
  • Phase 3snap.py creates a fresh browser context for every cart-empty route so session state from ?demo=cart cannot bleed through.
  • Phase 4design_unblock.py stops immediately with no-op-repair when no source files changed and the only recipe would be a re-shoot.
  • Phase 5 — Scorecard diagnostic packet surfaces grouped top_weak_findings in repair evidence.
  • Phase 6 — Batch runner prints [batch] active child run=… status=…; top-level watcher parses it and writes a What To Watch Now block to STATUS.md so it no longer stays on "starting."
  • Phase 7 — Failed exploratory canary PRs receive a factory-canary-failed label and a cleanup command comment.

Test plan

  • python3 -m pytest tests/tools/ tests/check_py/ — 1086 passed, 1 skipped
  • python3 bin/lint.py — all checks clean
  • Pre-commit and pre-push hooks both green
  • New regression tests: test_snap_cart_state, test_mass_failure_classification_groups_top_findings, test_scorecard_mass_failure_stops_without_recipes, test_parser_tracks_batch_child_status_path_and_phase, test_failed_canaries_get_lifecycle_label_and_cleanup_path, test_cheap_gates_run_before_vision_review

Made with Cursor

Seven guardrails that collectively prevent the all-day canary loop
pattern before spending vision budget or opening noisy PRs:

Phase 1 — Cheap gates before vision
  - Add `content-preflight` (check.py --phase content) and
    `snap-preflight` (snap.py report --strict) as new pipeline phases
    that run after snap and before snap-vision-review in both the flat
    and dress phase lists.  Vision now never starts on a red snap report.

Phase 2 — Stop catastrophic scorecard loops
  - design-scorecard.py: classify results as green / marginal /
    mass-failure; emit grouped top_weak_findings; stop with an explicit
    diagnostic packet when visual_distinctness ≤ 10, any category = 0,
    or weak_findings ≥ 40.
  - design_unblock.py: recognise `scorecard-mass-failure`; set
    human_boundary immediately; skip all recipes / json-llm / tool-rescue.
  - design-watch.py: detect mass-failure and skip the repair while-loop.
  - factory_rules.py: add scorecard-mass-failure prevention rule.

Phase 3 — Fix cart route state
  - snap.py: extract _new_capture_context() helper; create a fresh
    anonymous browser context before every cart-empty route so session
    state from cart-filled (/?demo=cart) cannot bleed through.
  - Regression test: test_snap_cart_state.py.

Phase 4 — Remove no-op scorecard repair
  - design_unblock.py: when recommended_recipes is empty, return a
    stopped record immediately instead of running verification and
    re-proving the same failure.  Also: skip the verification ladder if
    no source files changed and the only recipes are snap_routes /
    design_scorecard.

Phase 5 — Improve repair evidence
  - design_unblock.py: add _latest_design_score(), _scorecard_top_
    findings_snippet(); surface the grouped diagnostic packet in
    source_snippets for both design-score-low and scorecard-mass-failure.

Phase 6 — Make status truthful
  - design-batch.py: print "[batch] active child run=… status=…" before
    each child subprocess so the top-level watcher can parse it.
  - design-watch.py: parse [batch] child announcements and child
    Working/Blocked status lines; update slug/phase/child fields in
    WatchState; write "## What To Watch Now" block to STATUS.md;
    surface active_child_run_id + active_child_status_path in summary.

Phase 7 — Clean failed canary lifecycle
  - design-batch.py: add FAILED_CANARY_LABEL constant, _ensure_label(),
    _is_exploratory_canary(); label canary-failed PRs, attach cleanup
    command, create the GitHub label if absent.

Tests: 1086 passed, 1 skipped (lint + pytest both clean).
Made-with: Cursor
@nickhamze nickhamze marked this pull request as ready for review April 29, 2026 23:29
@nickhamze nickhamze merged commit ca3d47c into main Apr 29, 2026
12 checks passed
@nickhamze nickhamze deleted the agent/factory-pipeline-fix branch April 29, 2026 23:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant