Implement factory pipeline fix plan (Phases 1-7)#100
Merged
Conversation
Seven guardrails that collectively prevent the all-day canary loop
pattern before spending vision budget or opening noisy PRs:
Phase 1 — Cheap gates before vision
- Add `content-preflight` (check.py --phase content) and
`snap-preflight` (snap.py report --strict) as new pipeline phases
that run after snap and before snap-vision-review in both the flat
and dress phase lists. Vision now never starts on a red snap report.
Phase 2 — Stop catastrophic scorecard loops
- design-scorecard.py: classify results as green / marginal /
mass-failure; emit grouped top_weak_findings; stop with an explicit
diagnostic packet when visual_distinctness ≤ 10, any category = 0,
or weak_findings ≥ 40.
- design_unblock.py: recognise `scorecard-mass-failure`; set
human_boundary immediately; skip all recipes / json-llm / tool-rescue.
- design-watch.py: detect mass-failure and skip the repair while-loop.
- factory_rules.py: add scorecard-mass-failure prevention rule.
Phase 3 — Fix cart route state
- snap.py: extract _new_capture_context() helper; create a fresh
anonymous browser context before every cart-empty route so session
state from cart-filled (/?demo=cart) cannot bleed through.
- Regression test: test_snap_cart_state.py.
Phase 4 — Remove no-op scorecard repair
- design_unblock.py: when recommended_recipes is empty, return a
stopped record immediately instead of running verification and
re-proving the same failure. Also: skip the verification ladder if
no source files changed and the only recipes are snap_routes /
design_scorecard.
Phase 5 — Improve repair evidence
- design_unblock.py: add _latest_design_score(), _scorecard_top_
findings_snippet(); surface the grouped diagnostic packet in
source_snippets for both design-score-low and scorecard-mass-failure.
Phase 6 — Make status truthful
- design-batch.py: print "[batch] active child run=… status=…" before
each child subprocess so the top-level watcher can parse it.
- design-watch.py: parse [batch] child announcements and child
Working/Blocked status lines; update slug/phase/child fields in
WatchState; write "## What To Watch Now" block to STATUS.md;
surface active_child_run_id + active_child_status_path in summary.
Phase 7 — Clean failed canary lifecycle
- design-batch.py: add FAILED_CANARY_LABEL constant, _ensure_label(),
_is_exploratory_canary(); label canary-failed PRs, attach cleanup
command, create the GitHub label if absent.
Tests: 1086 passed, 1 skipped (lint + pytest both clean).
Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Seven guardrails that collectively prevent the all-day canary loop pattern before spending vision budget or opening noisy PRs.
content-preflightandsnap-preflightphases todesign.py; both run after snap and before vision review, blocking expensive LLM calls on already-red evidence.scorecard-mass-failureclassification indesign-scorecard.pystops the repair loop whenvisual_distinctness ≤ 10, any category = 0, orweak_findings ≥ 40; grouped diagnostics replace the raw JSON blob.snap.pycreates a fresh browser context for everycart-emptyroute so session state from?demo=cartcannot bleed through.design_unblock.pystops immediately withno-op-repairwhen no source files changed and the only recipe would be a re-shoot.top_weak_findingsin repair evidence.[batch] active child run=… status=…; top-level watcher parses it and writes a What To Watch Now block toSTATUS.mdso it no longer stays on "starting."factory-canary-failedlabel and a cleanup command comment.Test plan
python3 -m pytest tests/tools/ tests/check_py/— 1086 passed, 1 skippedpython3 bin/lint.py— all checks cleantest_snap_cart_state,test_mass_failure_classification_groups_top_findings,test_scorecard_mass_failure_stops_without_recipes,test_parser_tracks_batch_child_status_path_and_phase,test_failed_canaries_get_lifecycle_label_and_cleanup_path,test_cheap_gates_run_before_vision_reviewMade with Cursor