feat(test): Python browser scenario harness + RC-promotion gate by HydraOps-T-rav · Pull Request #8370 · T-rav/hydraflow

HydraOps-T-rav · 2026-04-20T15:32:17Z

Summary

Delivers the full plan from docs/superpowers/specs/2026-04-18-browser-scenario-harness-design.md: a Python-driven Playwright harness that runs the real HydraFlow dashboard against MockWorld fakes, plus a new scenario-browser job wired into rc-promotion-scenario.yml. Replaces the JS Playwright suite entirely.

46 new browser tests land (Tier-1 contract + Tier-2 workflow + Tier-3 scenario + smoke). Full scenario suite post-merge: 160/160 scenario, 150/150 fakes, 48/48+1 xfailed browser.

What's in here

Phase	Content
1 — Skeleton	`scenario_browser` marker, pytest-playwright + pytest-rerunfailures deps, browser test directory
2 — Layer 0 helpers	`FakeClock.freeze`, `MockWorld.clock_start`, `MockWorld.add_repo`, `_wire_targets` (PipelineHarness ↔ ServiceRegistry unified), `FakeGitHub.add_pr` sync helper, `MockWorld.start_dashboard` (with + without orchestrator), `HydraFlowDashboard._uvicorn_server` for port introspection
3 — Tier-1 contract	15 snapshot tests (5 empty + 10 populated), seed helpers, page objects, `body[data-connected]` WS-ready signal, `_HarnessOrchestratorShim` for contract path
4 — JS deletion	Removed `src/ui/e2e/`, `playwright.config.js`, `@playwright/test` npm deps, `__HYDRAFLOW_SEED_STATE__` code path, `visual-capture` CI job, `screenshot` Makefile targets
5 — Tier-2 workflows	5 tests: orchestrator start/stop, HITL skip, repo-register, PR merge-render, config autosave
6 — Tier-3 scenarios	24 tests across H1-H5, S1-S6, E1-E5, L1-L8 — render-after-run pattern
7 — CI gate	New `scenario-browser` job in `rc-promotion-scenario.yml`, `make scenario-browser` target

Architecture decisions worth review attention

Full E2E click-drive isn't viable. /api/control/start builds a fresh orchestrator and loses fake wiring. Tier-3 scenario tests use Python-side run_pipeline() + UI renders result instead. Workflow tests that exercise routes bypassing FakeGitHub (e.g. PRManager → real gh CLI) use page.route() interception.
_HarnessOrchestratorShim: Contract tests pass with_orchestrator=False which routes through a small shim that reports running=True, forwards issue_store to the harness, and returns empty-safe sentinels for everything else. Keeps the dashboard rendering populated state without booting a real orchestrator. Confined to MockWorld; not used in workflow/scenario paths.
Function-scoped browser fixture (not session-scoped) due to a pytest-asyncio cross-loop deadlock with asyncio_default_test_loop_scope=function. Documented inline in conftest.py. Tier-3 pays full Chromium startup per test (~3s). Addressing this is a worthwhile perf win but out of scope here.
Contract snapshot limitation. Populated differs from empty, but within each seed state all tabs render identically because the React SPA's tab state is internal (?tab= URL params are not honored). A follow-up should click tabs in test setup rather than URL-routing.

What this does NOT include (explicitly deferred)

Branch-protection flip. Per the plan, Browser Scenarios is added to CI but not marked required until it demonstrates stability on RC PRs. That's a one-click admin action after a soak window.
a11y / interactions JS-test replacement. Main added src/ui/e2e/{a11y,interactions}.spec.js in test(scenarios): Tier 2 — FakeBeads + bead workflow + 5 caretaker loops #8360/test(scenarios): Tier 3 — 10 caretaker loops + UI interactions + a11y + hook + fuzz #8361 after this branch cut. Those files are removed here as part of the planned JS suite deletion. Re-adding their coverage in the Python harness is a follow-up — not done to keep this PR focused.
Tab-level snapshot differentiation (see limitation above).
Session-scoped browser fixture restoration (see limitation above).

Diff stats

34 commits, ~268 files changed (the large number is mostly wiki .md additions merged in from main)
+4K / −11K (most deletions are the JS suite removal + e2e .spec.js fixtures)

Test plan

make scenario — 160/160 pass
Fakes suite — 150/150 pass
make scenario-browser locally — 48 passed + 1 xfailed (L2 workspace_gc, matching the reference)
Contract snapshots regenerate cleanly
Pyright clean on all modified files
Pre-commit hooks pass without --no-verify
CI green on this PR
Manual review of merge-conflict resolutions in mock_world.py, fake_github.py, test_mock_world.py, package.json

Known non-issues

IDE pyright sometimes shows stale errors on tests/scenarios/fakes/ — caused by the lazy __getattr__ in tests/scenarios/fakes/__init__.py. CLI pyright is clean.
Pre-existing failure test_build_prompt_truncates_long_body on main is unrelated to this PR.

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Raise the pytest-rerunfailures version floor from >=14.0 to >=16.0 to match the locked version (16.1) and avoid silent misbehaviour from breaking changes in 14→15→16. Add "(Tier C)" qualifier to the scenario_browser marker description for consistency with sibling markers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Use concrete import path for MockWorld (avoids __getattr__ union-type resolution that confused pyright) and guard teardown with getattr so future Task-7 methods don't cause pyright errors before they exist. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Extract _wire_targets(target) that accepts any duck-typed object exposing .prs/.triage_runner/.planners/.agents/.reviewers/.workspaces, enabling Task 9 to wire an orchestrator adapter without duplicating logic. The three original _wire_* methods are kept as thin backward-compatible wrappers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Implements the no-orchestrator path of start_dashboard(): boots HydraFlowDashboard in-process on an ephemeral port against a seeded EventBus/StateTracker, exposes the URL via dashboard_url, and provides idempotent start + graceful stop. Adds _uvicorn_server attribute to HydraFlowDashboard so MockWorld can poll for the bound port. The browser smoke test (intentionally failing since Task 2) now passes end-to-end. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Harden MockWorld.stop_dashboard to explicitly close uvicorn's bound listener sockets (asyncio.Server.close + wait_closed) before awaiting dashboard.stop(), ensuring the ephemeral port is released immediately rather than after uvicorn's graceful-shutdown delay. Adds regression test that re-binds the port after teardown to catch any recurrence. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…to a real orchestrator Implements Task 9: _build_wired_orchestrator constructs a real HydraFlowOrchestrator (pipeline_enabled=False) and uses a _SvcAdapter to bridge the ServiceRegistry's `triage` field to the `triage_runner` name that _wire_targets expects, then patches all fake methods onto the live service registry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ports seed data from src/ui/e2e/fixtures/seed-state.js into Python via seed_populated_pipeline / seed_empty_pipeline helpers, backed by a new sync FakeGitHub.add_pr method. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds a useEffect in HydraFlowProvider that mirrors state.connected onto document.body[data-connected] so Playwright's wait_for_ws_ready() helper can detect WebSocket readiness via a stable DOM attribute. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Ports 10 populated + 5 empty captures from JS screenshots.spec.js to Python Playwright. Adds a stdlib-only PNG pixel-diff snapshot fixture (no Pillow/numpy) with --update-snapshots support. Fixes a pytest-asyncio session/function loop deadlock by making browser_context function-scoped, and patches PLAYWRIGHT_BROWSERS_PATH at import time to survive the hermetic HOME=/tmp/hydraflow-test environment. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Wire start_dashboard() to the harness's existing EventBus and IssueStore via _HarnessOrchestratorShim; update seed_populated_pipeline() to call harness.seed_issue() with correct stage names ('find'/'ready' not 'triage'/'implement'); regenerate all 15 contract baselines so populated and empty snapshots are visibly distinct (80220 vs 81561 bytes). Document why session-scoped Chromium fixtures are not used: the cross-loop deadlock requires asyncio_default_test_loop_scope="session" globally, which is high-risk for the existing unit suite. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add data-testid="orchestrator-status" hidden span to Header.jsx and two Tier-2 workflow tests verifying that clicking the dashboard's Start/Stop buttons drives the real HydraFlowOrchestrator's running state. Pyright clean; all 1034 vitest tests and both new browser tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Update HitlPage page object to match real HITLTable.jsx testids: - open() navigates to / and clicks the HITL tab (React ignores ?tab= URL param) - item() → hitl-row-{N} (was hitl-item-{N}) - correction_input() → hitl-textarea-{N} (was hitl-correction-input-{N}) - submit_button() → hitl-retry-{N} (was hitl-submit-{N}) - add detail(), skip_button(), close_button() helpers - Add test_hitl_roundtrip.py: Playwright route-intercepts /api/control/status (to report running) and /api/hitl (to serve issue 208), clicks Skip, asserts the row disappears after the intercepted /api/hitl/208/skip returns ok. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Drive the dashboard to register a repo via the UI, verifying the registration endpoint is called with the correct path and the repo appears in the repo-selector dropdown list. Uses Playwright route interception for GET /api/repos and POST /api/repos/add since MockWorld.start_dashboard() does not wire register_repo_cb or repo_store (making the native backend return 503 and can_register: false). Adds data-testid="repo-register-input" to the filesystem-path input in RegisterRepoDialog for reliable selection. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds a Tier-2 browser scenario that verifies the automated merge_update WebSocket event is propagated from the EventBus through the WS handler to the React frontend, where the Outcomes panel reflects the merged PR state ("#301 (merged)") in the expanded issue row. Uses route interception for /api/prs and /api/issues/history (same seam as Tasks 20/21) and world._harness.bus.publish() to inject the merge_update event that simulates what the review loop emits when PRManager.merge_pr() succeeds. Documents in the module docstring that PR approval in HydraFlow is fully automated with no user-facing Approve button. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds Task 23 — edits rc_cadence_hours in the staging_promotion worker card (StagingPromotionSettingsPanel) and asserts PATCH /api/control/config is fired with the correct payload on blur. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Port H1 from tests/scenarios/test_happy.py to a browser-driven scenario. Establishes the implementation pattern for the remaining 23 Tier-3 ports. Pattern chosen: Python-trigger + UI-assert (not full E2E via UI Start button). Full E2E is unworkable because clicking Start in the dashboard creates a brand-new HydraFlowOrchestrator (dashboard_routes/_control_routes.py line 227) that replaces the pre-wired orchestrator, losing all FakeLLM / FakeGitHub wiring. This is a confirmed Phase 5 finding. The viable pattern: 1. IssueBuilder seeds the world (identical to Python-only test_happy.py H1). 2. world.run_pipeline() drives all four phases through wired fakes. 3. world._harness.store.mark_merged(1) syncs the IssueStore — needed because PipelineHarness builds PostMergeHandler(store=None) so mark_merged is not called during run_pipeline(); FakeGitHub.pr.merged=True but the store is not updated automatically. 4. Dashboard starts with with_orchestrator=False (lightweight shim exposes harness IssueStore; _is_pipeline_active returns True because shim.running). 5. Route interceptor sets control/status to "running" for future multi-tab use. 6. UI asserts: stage-header-merged contains "1 merged"; flow-dot-1 is visible. 7. Python asserts: world.github.pr_for_issue(1).merged is True. This pattern scales to H2–L8: seed via IssueBuilder / set_phase_result, run_pipeline(), mark_merged for merged issues, assert stage-header-{stage}. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds four new scenario_browser tests porting H2-H5 from tests/scenarios/test_happy.py using the render-after-run pattern established by the H1 pilot. Key patterns discovered: - H2: mark_merged must be called for all N issues before start_dashboard; stage-header-merged asserts "N merged". - H3 (failed implement): a failed issue is consumed from the implement queue but never re-queued or merged, so it has no flow-dot in the pipeline snapshot. DOM assertion is that stage-header-merged is absent of "1 merged" and stage-section-implement is present. - H4: identical seed to H1 but additionally asserts review_result is not None before the merged-stage UI check. - H5 (sub-issues): NewIssueSpec bodies shorter than 50 chars are skipped by plan_phase, so sub-issues never reach GitHub. Parent issue #1 still proceeds to merge; DOM asserts "1 merged" for the parent; Python asserts plan_result.new_issues has 2 entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Port all six sad-path scenarios from tests/scenarios/test_sad.py to browser-rendered assertions. Two behavioural findings recorded in test docstrings: S1 stalls at plan (harness consumes the first failure result and does not auto-retry), and S6 now merges (FakeLLM wiring has evolved since the reference comment was written). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Ports five edge-case scenarios (duplicate issues, on_phase hook, stale worktree GC, epic child ordering, zero-diff implement) from tests/scenarios/test_edge.py to Playwright browser tests following the render-after-run pattern established by H1-H5 and S1-S6. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Ports all 8 background-loop scenarios to browser tests in test_loops_browser.py. Each test seeds world state matching the reference test_loops.py, runs the loop via run_with_loops, asserts Python-side state, then boots the dashboard shim and applies a negative DOM assertion confirming no crash and all pipeline sections are visible. L2 carries the same xfail as the reference (workspace_gc calls gh api via run_subprocess which is not stubbed in the harness). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Resolves conflicts in 5 files: - src/ui/package.json + package-lock.json: keep HEAD (no @playwright/test, no @axe-core/playwright). Main's new e2e/{a11y,interactions}.spec.js deleted — JS suite is gone per ADR-0042 plan; a11y + interactions coverage must be re-added to the Python harness as a follow-up. - tests/scenarios/fakes/fake_github.py: keep both add_pr (HEAD) and add_alerts (main). - tests/scenarios/fakes/mock_world.py: keep both clock_start (HEAD) and wiki_store/beads_manager (main) ctor kwargs. - tests/scenarios/fakes/test_mock_world.py: keep both test suites.

T-rav-Hydra-Ops and others added 30 commits April 18, 2026 16:58

test: register scenario_browser marker and pytest-playwright deps

6250c5b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test(browser): skeleton conftest + failing smoke test

b1a3c3f

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(scenarios): FakeClock.freeze accepts unix float or ISO-8601

38f656a

feat(scenarios): MockWorld accepts clock_start for deterministic time

30fa755

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(scenarios): MockWorld.add_repo seeds the repo registry store

cbf8cb0

test(browser): smoke-test orchestrator path as well

bace279

test(browser): Python seed helpers mirroring JS fixture

14ac7b5

Ports seed data from src/ui/e2e/fixtures/seed-state.js into Python via seed_populated_pipeline / seed_empty_pipeline helpers, backed by a new sync FakeGitHub.add_pr method. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test(browser): page objects for dashboard tabs

43c155a

chore(ui): remove JS Playwright suite (superseded by Python harness)

d0a4a24

chore(ui): remove @playwright/test and screenshot npm scripts

a41eaf8

refactor(ui): remove __HYDRAFLOW_SEED_STATE__ code path

ceda0ef

chore(ci): remove visual-capture job and screenshot make targets

d0a1f1a

test(browser): regenerate contract baselines after Phase 5 UI touches

a4ffe96

T-rav-Hydra-Ops and others added 5 commits April 19, 2026 18:50

ci: add scenario-browser gate to rc-promotion workflow + make target

8983b60

fix(make): drop double 'run' in scenario-browser target

bb7887c

HydraOps-T-rav marked this pull request as ready for review April 20, 2026 16:55

HydraOps-T-rav merged commit b55809f into main Apr 20, 2026
24 checks passed

HydraOps-T-rav deleted the browser-scenario-harness branch April 20, 2026 19:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(test): Python browser scenario harness + RC-promotion gate#8370

feat(test): Python browser scenario harness + RC-promotion gate#8370
HydraOps-T-rav merged 35 commits intomainfrom
browser-scenario-harness

HydraOps-T-rav commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HydraOps-T-rav commented Apr 20, 2026

Summary

What's in here

Architecture decisions worth review attention

What this does NOT include (explicitly deferred)

Diff stats

Test plan

Known non-issues

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant