Skip to content

Remove DOM-validation chain (PR #586), add ContextModule prompt registry + recovery fixes#591

Merged
mercurialsolo merged 1 commit into
mainfrom
fix/cua-pure-recovery-and-context-modules
May 22, 2026
Merged

Remove DOM-validation chain (PR #586), add ContextModule prompt registry + recovery fixes#591
mercurialsolo merged 1 commit into
mainfrom
fix/cua-pure-recovery-and-context-modules

Conversation

@mercurialsolo
Copy link
Copy Markdown
Owner

Summary

  • Revert PR click handler: DOM-aware grounding — query element at click coords, reject non-card targets #586 DOM-validation chaincdp_element_matches_selector + listing_card_css_selector + PRE-CLICK REJECT + OFF_CARD recovery. Per feedback_cua_no_dom_access.md this is the "querySelectorAll-to-derive-target" pattern the architecture bans. Banner-ad misclick recovery now flows through the existing CUA-pure path (expect_url_contains + _maybe_demote_wrong_target).
  • Add ContextModule registry (src/mantis_agent/context_modules.py) — modular per-step prompt sections + handler hints, composed at brain dispatch time via a context-var. Two starter modules: BLOCKING_OVERLAY (extraction-phase carousel/modal/paywall dismiss) and KEYBOARD_SCROLL (scroll-step image-misclick defense). Future patterns add a single ContextModule entry, not global prompt edits.
  • Recovery-plumbing fixes uncovered through 12+ hours of run debugging — failure-history bug (_record_failure_for_retry early-returned on missing target), critic-frontier reading the wrong step_index key, binary fire-once guard replaced with growth-gated re-fires.
  • Extraction-quality fixesVIABLE | prefix bug in WorkflowRunner._extract_data Priority-1 (silent 0-lead reports), scroll brain budget cap 3→8 (was producing brain_loop_exhausted on 4000+ pixel pages), recovery analyzer prompt teaches "scroll exhausted mid-page" → add_hint not halt.
  • Cost-meter fixessync_gpu_seconds_from_time_meter fires unconditionally (per-step deltas absorb between-step think growth); back-allocate finalize residual to last step so per-step COST column sums to header total.

Why the DOM-validation chain was removed

cdp_element_matches_selector(x, y, css_selector) queried the DOM at a vision-picked coordinate, walked ancestors against a CSS selector, and gated whether to dispatch the click. That's using DOM to derive the action decision, not to verify a completed action. The memory note feedback_cua_no_dom_access.md explicitly bans this; PR #586 had drifted past it.

A 12-hour debugging arc shipped 8 progressive runner-side fixes plumbing information around this gate's failure mode. Live DOM inspection via Chrome MCP showed the selector matched 0 elements on the current boattrader page (BoatTrader had renamed the attribute the gate looked for). Updating the selector would have worked, but the existence of the per-site selector was itself the architectural bug. This PR removes the gate and keeps only the genuinely useful recovery-plumbing fixes from that arc.

Architectural rule for future modules

ContextModule.applies_when predicates are the analysis hook. Today's predicates are rule-based (step_type == "scroll"); the signature Callable[[dict], bool] accepts any analysis — a vision module that runs Claude against the recent screenshot is one entry in MODULES. Adding a new behavior for a newly observed failure pattern means adding ONE module entry, not editing global prompts.

Pre-existing always-on guidance (FORM FILLING / NAVIGATION) stays in the base holo3_system.txt prompt — the registry's job is augmentation, not replacement. Pinned in test_registry_does_not_modulerize_base_prompt_defaults.

Test plan

  • tests/test_context_modules.py (23 new tests) — context-var semantics, predicate firing matrix, prompt composition, handler hints, defensive predicate-error handling
  • tests/test_workflow_runner_extract_viable_prefix.py (4 new tests) — VIABLE prefix added in Priority-1, no double-prefix, holds across URL backfill, Priority-2 synthetic path unchanged
  • tests/test_execution_critic.py — growth-gated re-fire test + step_result.step_index keying test
  • tests/test_agentic_retry_feedback.py — bare-record-on-missing-target + stamp-guard
  • tests/test_cost_meter.py — sync even when steps_used == 0
  • tests/test_observability_augur_518.py — back-allocation happy path + noise floor + zero-emitted-steps
  • tests/test_step_recovery_policy.py — OFF_CARD cases removed (feature deleted)
  • tests/test_dom_aware_click_validation_586.py — file deleted
  • 387 tests pass across all affected suites

What this PR does NOT include

  • The plan file plans/boattrader_scrape_urlnav was rewritten locally (phone-optional viability + carousel hint in step 4 prose) but plans/ is not under version control — plan changes should land separately when the team chooses how to version plans.
  • scripts/run_boattrader_urlnav_with_proxy.py has pre-session edits per the standing rule against adhoc submit scripts in the repo; not extended in this PR.
  • A follow-up should teach the PlanDecomposer to preserve "DO NOT X" prohibitions from plan prose into MicroIntent.intent or hints.prohibitions — currently they're dropped during decomposition.

🤖 Generated with Claude Code

…y, fix recovery + VIABLE-prefix gaps

This pulls in two distinct lines of work that landed together through a long
debugging session against the boattrader by-owner scrape:

1) Remove the PR #586 DOM-validation chain (CUA violation per feedback_cua_no_dom_access.md)

   - Delete env.cdp_element_matches_selector (xdotool_env.py): DOM-derived
     decision about whether to dispatch a vision-picked click is exactly the
     "querySelectorAll-to-derive-target" pattern the architecture bans.
     window.scrollY-style post-action verification stays; element-at-point
     pre-action gating goes.
   - Delete SiteConfig.listing_card_css_selector field + the boattrader
     default. Per-site CSS selectors are themselves the architectural
     anti-pattern.
   - Delete the PRE-CLICK REJECT block in step_handlers/click.py and the
     scan-phase pre-filter / wipeout-driven OFF_CARD halt I had layered on
     top to work around it.
   - Delete the OFF_CARD branch in step_recovery.py and the
     _off_card_reject_count reset in run_executor._handle_success.
   - Delete tests/test_dom_aware_click_validation_586.py and the matching
     OFF_CARD cases in tests/test_step_recovery_policy.py.

   Failure mode this leaves behind (banner-ad / wrong-target misclick) is
   handled by the existing CUA-pure path: expect_url_contains /
   expect_url_excludes hints + _maybe_demote_wrong_target.

2) Keep the genuine recovery + extraction fixes uncovered along the way

   - Failure-history plumbing (run_executor.py):
     * _record_failure_for_retry no longer early-returns when
       _last_submit_target is absent — records bare-coord entries so the
       critic-frontier's failure_count actually advances. The prior guard
       made wrong_target loops invisible to escalation and produced
       30-minute time_cap halts with zero leads.
     * Centralized fallback call in _handle_failure for any failure_class in
       REWRITE_TRIGGERING_CLASSES, with a step_result stamp guard so demote
       sites that already recorded don't double-count.
   - Critic-frontier (critic.py):
     * Reads _step_failure_history under step_result.step_index, not
       state.step_index — the recovery policy can advance state.step_index
       past the failed step before _consult_critic runs, so the prior
       key-by-current-state lookup missed forever.
     * Growth-gated re-fires replace the binary "fire-once" guard. After
       firing at failure_count=N, the next fire requires failure_count >=
       N + threshold. Bounded by the existing per-step / per-run recovery
       budget caps so total Claude spend stays the same.
     * Instrumentation: budget consumption logged at WARNING when each
       site bumps _recovery_attempts_per_step.
   - Cost-meter (cost_meter.py):
     * sync_gpu_seconds_from_time_meter fires unconditionally inside
       record_step (not gated on steps_used > 0) so steps with no brain
       inference still absorb between-step think growth into their
       per-step delta.
   - Cost back-allocation (run_executor.py):
     * _back_allocate_residual_to_last_step patches the final step's
       total_usd so the per-step COST column sums to the run total.
       Catches GPU/think growth captured by the terminal sync after the
       last per-step delta was computed.
   - VIABLE-prefix fix (workflow_runner.py):
     * Priority-1 branch in _extract_data now prepends "VIABLE | " when
       missing. ListingDedup.successful_lead_data filters on the prefix —
       without this fix, every lead the brain extracted in the
       happy-path "Year: 2024 | Make: X | ..." shape got silently dropped
       from the count, producing 0-leads reports despite real extraction.
   - Scroll budget cap (step_handlers/holo3.py):
     * DEFAULT_BRAIN_BUDGET_CAPS["scroll"] bumped 3 -> 8. 3 actions cover
       ~1800px; modern detail pages are 4000-6000px, so the cap was
       producing brain_loop_exhausted on plans the decomposer thought it
       had budget for.

3) ContextModule registry + recovery-analysis prompt sections

   - New module src/mantis_agent/context_modules.py:
     * ContextModule dataclass (applies_when predicate + prompt_section +
       step_hint + handler_hints) + push_step_context context-var so
       handlers publish dispatch context without signature plumbing
       through GymRunner / brain layers.
     * MODULES registry with two starter modules — BLOCKING_OVERLAY
       (scoped to extraction-phase steps; carousel / cookie / paywall /
       newsletter dismiss protocol) and KEYBOARD_SCROLL (scoped to
       scroll steps; large-amount scroll() preferred, no image clicks,
       no within-page-nav link clicks).
     * compose_system_prompt + applicable_step_hints +
       applicable_handler_hints — three composition helpers covering all
       three contribution channels.
   - brain_holo3._build_messages composes the system prompt per dispatch
     via compose_system_prompt(SYSTEM_PROMPT).
   - gym/step_handlers/holo3.py wraps gym_runner.run in
     push_step_context({step_type, step_section, step_intent}).
   - prompts/files/recovery_analysis.txt adds two highest-priority
     diagnoses ahead of the existing decode table:
     * SCROLL STEP EXHAUSTED MID-PAGE — when scroll_y advanced but the
       page is still partially traversed, return add_hint("continue
       scrolling with larger amount=15") rather than halt. Earlier the
       analyzer's halt-bias killed scroll steps that just needed another
       2-3 iterations.
     * BLOCKING OVERLAY DETECTED — fullscreen carousel / cookie banner /
       paywall / signup modal -> add_hint with explicit dismiss-then-retry
       text. Site-agnostic; works through the existing recovery_hints
       channel.

Test coverage

- tests/test_context_modules.py (NEW, 23 tests) — context-var semantics
  (push/pop/nesting/copy), predicate firing matrix per starter module,
  prompt composition (sections include/exclude per context), handler
  hints surface, defensive predicate-error handling, registry sanity.
- tests/test_workflow_runner_extract_viable_prefix.py (NEW, 4 tests) —
  pins the VIABLE prefix is added in Priority-1, left intact when
  already present, holds across URL backfill, and that Priority-2's
  existing prefix injection still works.
- tests/test_execution_critic.py — added growth-gated re-fire test and
  step_result.step_index keying test; updated existing fire-once test
  to the new growth-gated semantic.
- tests/test_agentic_retry_feedback.py — updated to reflect that
  _record_failure_for_retry now always appends (bare record when no
  target), plus stamp-guard test.
- tests/test_cost_meter.py — added "sync even when steps_used == 0" test.
- tests/test_observability_augur_518.py — added back-allocation happy
  path + noise-floor + zero-emitted-steps cases.
- tests/test_dom_aware_click_validation_586.py — deleted entirely (the
  feature it tested is gone).

387 tests pass across the affected suites.

Memory note added at
~/.claude/projects/-Users-barada-Sandbox-Mason-mantis/memory/project_pr586_cua_violation_removed_2026_05_22.md
so the DOM-gate pattern doesn't get reintroduced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant