Remove DOM-validation chain (PR #586), add ContextModule prompt registry + recovery fixes#591
Merged
Conversation
…y, fix recovery + VIABLE-prefix gaps This pulls in two distinct lines of work that landed together through a long debugging session against the boattrader by-owner scrape: 1) Remove the PR #586 DOM-validation chain (CUA violation per feedback_cua_no_dom_access.md) - Delete env.cdp_element_matches_selector (xdotool_env.py): DOM-derived decision about whether to dispatch a vision-picked click is exactly the "querySelectorAll-to-derive-target" pattern the architecture bans. window.scrollY-style post-action verification stays; element-at-point pre-action gating goes. - Delete SiteConfig.listing_card_css_selector field + the boattrader default. Per-site CSS selectors are themselves the architectural anti-pattern. - Delete the PRE-CLICK REJECT block in step_handlers/click.py and the scan-phase pre-filter / wipeout-driven OFF_CARD halt I had layered on top to work around it. - Delete the OFF_CARD branch in step_recovery.py and the _off_card_reject_count reset in run_executor._handle_success. - Delete tests/test_dom_aware_click_validation_586.py and the matching OFF_CARD cases in tests/test_step_recovery_policy.py. Failure mode this leaves behind (banner-ad / wrong-target misclick) is handled by the existing CUA-pure path: expect_url_contains / expect_url_excludes hints + _maybe_demote_wrong_target. 2) Keep the genuine recovery + extraction fixes uncovered along the way - Failure-history plumbing (run_executor.py): * _record_failure_for_retry no longer early-returns when _last_submit_target is absent — records bare-coord entries so the critic-frontier's failure_count actually advances. The prior guard made wrong_target loops invisible to escalation and produced 30-minute time_cap halts with zero leads. * Centralized fallback call in _handle_failure for any failure_class in REWRITE_TRIGGERING_CLASSES, with a step_result stamp guard so demote sites that already recorded don't double-count. - Critic-frontier (critic.py): * Reads _step_failure_history under step_result.step_index, not state.step_index — the recovery policy can advance state.step_index past the failed step before _consult_critic runs, so the prior key-by-current-state lookup missed forever. * Growth-gated re-fires replace the binary "fire-once" guard. After firing at failure_count=N, the next fire requires failure_count >= N + threshold. Bounded by the existing per-step / per-run recovery budget caps so total Claude spend stays the same. * Instrumentation: budget consumption logged at WARNING when each site bumps _recovery_attempts_per_step. - Cost-meter (cost_meter.py): * sync_gpu_seconds_from_time_meter fires unconditionally inside record_step (not gated on steps_used > 0) so steps with no brain inference still absorb between-step think growth into their per-step delta. - Cost back-allocation (run_executor.py): * _back_allocate_residual_to_last_step patches the final step's total_usd so the per-step COST column sums to the run total. Catches GPU/think growth captured by the terminal sync after the last per-step delta was computed. - VIABLE-prefix fix (workflow_runner.py): * Priority-1 branch in _extract_data now prepends "VIABLE | " when missing. ListingDedup.successful_lead_data filters on the prefix — without this fix, every lead the brain extracted in the happy-path "Year: 2024 | Make: X | ..." shape got silently dropped from the count, producing 0-leads reports despite real extraction. - Scroll budget cap (step_handlers/holo3.py): * DEFAULT_BRAIN_BUDGET_CAPS["scroll"] bumped 3 -> 8. 3 actions cover ~1800px; modern detail pages are 4000-6000px, so the cap was producing brain_loop_exhausted on plans the decomposer thought it had budget for. 3) ContextModule registry + recovery-analysis prompt sections - New module src/mantis_agent/context_modules.py: * ContextModule dataclass (applies_when predicate + prompt_section + step_hint + handler_hints) + push_step_context context-var so handlers publish dispatch context without signature plumbing through GymRunner / brain layers. * MODULES registry with two starter modules — BLOCKING_OVERLAY (scoped to extraction-phase steps; carousel / cookie / paywall / newsletter dismiss protocol) and KEYBOARD_SCROLL (scoped to scroll steps; large-amount scroll() preferred, no image clicks, no within-page-nav link clicks). * compose_system_prompt + applicable_step_hints + applicable_handler_hints — three composition helpers covering all three contribution channels. - brain_holo3._build_messages composes the system prompt per dispatch via compose_system_prompt(SYSTEM_PROMPT). - gym/step_handlers/holo3.py wraps gym_runner.run in push_step_context({step_type, step_section, step_intent}). - prompts/files/recovery_analysis.txt adds two highest-priority diagnoses ahead of the existing decode table: * SCROLL STEP EXHAUSTED MID-PAGE — when scroll_y advanced but the page is still partially traversed, return add_hint("continue scrolling with larger amount=15") rather than halt. Earlier the analyzer's halt-bias killed scroll steps that just needed another 2-3 iterations. * BLOCKING OVERLAY DETECTED — fullscreen carousel / cookie banner / paywall / signup modal -> add_hint with explicit dismiss-then-retry text. Site-agnostic; works through the existing recovery_hints channel. Test coverage - tests/test_context_modules.py (NEW, 23 tests) — context-var semantics (push/pop/nesting/copy), predicate firing matrix per starter module, prompt composition (sections include/exclude per context), handler hints surface, defensive predicate-error handling, registry sanity. - tests/test_workflow_runner_extract_viable_prefix.py (NEW, 4 tests) — pins the VIABLE prefix is added in Priority-1, left intact when already present, holds across URL backfill, and that Priority-2's existing prefix injection still works. - tests/test_execution_critic.py — added growth-gated re-fire test and step_result.step_index keying test; updated existing fire-once test to the new growth-gated semantic. - tests/test_agentic_retry_feedback.py — updated to reflect that _record_failure_for_retry now always appends (bare record when no target), plus stamp-guard test. - tests/test_cost_meter.py — added "sync even when steps_used == 0" test. - tests/test_observability_augur_518.py — added back-allocation happy path + noise-floor + zero-emitted-steps cases. - tests/test_dom_aware_click_validation_586.py — deleted entirely (the feature it tested is gone). 387 tests pass across the affected suites. Memory note added at ~/.claude/projects/-Users-barada-Sandbox-Mason-mantis/memory/project_pr586_cua_violation_removed_2026_05_22.md so the DOM-gate pattern doesn't get reintroduced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
cdp_element_matches_selector+listing_card_css_selector+ PRE-CLICK REJECT + OFF_CARD recovery. Perfeedback_cua_no_dom_access.mdthis is the "querySelectorAll-to-derive-target" pattern the architecture bans. Banner-ad misclick recovery now flows through the existing CUA-pure path (expect_url_contains+_maybe_demote_wrong_target).src/mantis_agent/context_modules.py) — modular per-step prompt sections + handler hints, composed at brain dispatch time via a context-var. Two starter modules:BLOCKING_OVERLAY(extraction-phase carousel/modal/paywall dismiss) andKEYBOARD_SCROLL(scroll-step image-misclick defense). Future patterns add a singleContextModuleentry, not global prompt edits._record_failure_for_retryearly-returned on missing target), critic-frontier reading the wrong step_index key, binary fire-once guard replaced with growth-gated re-fires.VIABLE |prefix bug inWorkflowRunner._extract_dataPriority-1 (silent 0-lead reports), scroll brain budget cap 3→8 (was producingbrain_loop_exhaustedon 4000+ pixel pages), recovery analyzer prompt teaches "scroll exhausted mid-page" →add_hintnothalt.sync_gpu_seconds_from_time_meterfires unconditionally (per-step deltas absorb between-step think growth); back-allocate finalize residual to last step so per-step COST column sums to header total.Why the DOM-validation chain was removed
cdp_element_matches_selector(x, y, css_selector)queried the DOM at a vision-picked coordinate, walked ancestors against a CSS selector, and gated whether to dispatch the click. That's using DOM to derive the action decision, not to verify a completed action. The memory notefeedback_cua_no_dom_access.mdexplicitly bans this; PR #586 had drifted past it.A 12-hour debugging arc shipped 8 progressive runner-side fixes plumbing information around this gate's failure mode. Live DOM inspection via Chrome MCP showed the selector matched 0 elements on the current boattrader page (BoatTrader had renamed the attribute the gate looked for). Updating the selector would have worked, but the existence of the per-site selector was itself the architectural bug. This PR removes the gate and keeps only the genuinely useful recovery-plumbing fixes from that arc.
Architectural rule for future modules
ContextModule.applies_whenpredicates are the analysis hook. Today's predicates are rule-based (step_type == "scroll"); the signatureCallable[[dict], bool]accepts any analysis — a vision module that runs Claude against the recent screenshot is one entry inMODULES. Adding a new behavior for a newly observed failure pattern means adding ONE module entry, not editing global prompts.Pre-existing always-on guidance (FORM FILLING / NAVIGATION) stays in the base
holo3_system.txtprompt — the registry's job is augmentation, not replacement. Pinned intest_registry_does_not_modulerize_base_prompt_defaults.Test plan
tests/test_context_modules.py(23 new tests) — context-var semantics, predicate firing matrix, prompt composition, handler hints, defensive predicate-error handlingtests/test_workflow_runner_extract_viable_prefix.py(4 new tests) — VIABLE prefix added in Priority-1, no double-prefix, holds across URL backfill, Priority-2 synthetic path unchangedtests/test_execution_critic.py— growth-gated re-fire test + step_result.step_index keying testtests/test_agentic_retry_feedback.py— bare-record-on-missing-target + stamp-guardtests/test_cost_meter.py— sync even whensteps_used == 0tests/test_observability_augur_518.py— back-allocation happy path + noise floor + zero-emitted-stepstests/test_step_recovery_policy.py— OFF_CARD cases removed (feature deleted)tests/test_dom_aware_click_validation_586.py— file deletedWhat this PR does NOT include
plans/boattrader_scrape_urlnavwas rewritten locally (phone-optional viability + carousel hint in step 4 prose) butplans/is not under version control — plan changes should land separately when the team chooses how to version plans.scripts/run_boattrader_urlnav_with_proxy.pyhas pre-session edits per the standing rule against adhoc submit scripts in the repo; not extended in this PR.PlanDecomposerto preserve "DO NOT X" prohibitions from plan prose into MicroIntent.intent orhints.prohibitions— currently they're dropped during decomposition.🤖 Generated with Claude Code