Skip to content

gym/learning_runner: oracle-path verifier swap for sim-env runs (#594)#607

Merged
mercurialsolo merged 1 commit into
mainfrom
feat/learning-runner-oracle-594
May 23, 2026
Merged

gym/learning_runner: oracle-path verifier swap for sim-env runs (#594)#607
mercurialsolo merged 1 commit into
mainfrom
feat/learning-runner-oracle-594

Conversation

@mercurialsolo
Copy link
Copy Markdown
Owner

Summary

Closes the third item of the cheap-verifier epic (#592 sim env → #593 reward primitives → this PR). When a LearningRunner has a sim-env session attached, the per-step StepVerifier calls short-circuit to cheap server-side probes instead of Claude-vision API calls.

Scope of the swap

Five verifier callsites in src/mantis_agent/gym/learning_runner.py — four replaced, one intentionally kept on vision:

Site Was Now (oracle path)
L186 verify_step (filter click) screenshot before/after _oracle_verify_url_signals against plan-author hint tokens
L193 verify_filter (post-filter signals) image-text check same — single signal source on the oracle path
L236 verify_filter (retry) image-text check _oracle_verify_url_signals
L267 verify_on_correct_page (pre-iteration) image content check _oracle_verify_url_signals on results-page URL fragments
L297 verify_step (extraction iteration) screenshot before/after stays vision — extraction reads the page and produces no mutation; vision still catches gallery traps / dealer drift / popups

Vision path is unchanged for callers that don't pass oracle_session= — the constructor parameter defaults to None.

New helpers

  • _env_current_url() — best-effort URL accessor that tolerates current_url (callable or attribute), url attribute, or get_url() method. Returns "" when none match; swallows callable errors so the verifier never crashes the run.
  • _oracle_verify_url_signals(signals) — case-insensitive substring check, returns VerificationResult with the same shape as the vision path so callers don't branch on result inspection.
  • _oracle_verify_state_change(expected_ops=None) — fetches /__env__/mutations?since=<cursor> and advances the cursor. Verified on matching operation (or any mutation when no filter set). no_state_change on empty delta; oracle_unreachable on transport failure. Cursor is monotonic — a response with id < current_cursor does not regress it.
  • _sync_mutation_cursor() — phase-start helper called at the top of _learn_setup so the first filter step's verifier doesn't see the env_reset mutation stamped by an earlier reset.

Tests

tests/test_learning_runner_oracle_path.py — 25 tests:

  • URL accessor under each adapter shape (callable, attribute, get_url(), none, throwing, env=None).
  • _oracle_verify_url_signals on substring hit, case-insensitive match, no match, missing URL, empty signal list, empty strings in signal list.
  • _oracle_verify_state_change on any-mutation / matching-op / non-matching-op / empty-delta / fetch-error paths; cursor monotonicity under out-of-order responses; since_id plumbing.
  • _sync_mutation_cursor advances past the tail / no-ops without session / no-ops on empty log.
  • Constructor wiring: default None preserves vision-only path; explicit session is stored.
uv run --extra server --extra dev pytest tests/test_learning_runner_oracle_path.py
# → 25 passed

uv run --extra server --extra dev pytest tests/ -q -k "learning_runner or oracle or rewards or extractor"
# → 235 passed (no regressions)

ruff check .
# → clean

Out of scope

  • No changes to GymRunner or the reward-fn info/state.extras wiring (a separate concern; the reward fn from rewards: oracle-driven per-step + terminal reward primitives (#590 part 1) #593 reads from gym_result.info which the gym loop populates, not the LearningRunner).
  • Real-site runs (oracle_session=None) — the vision path stays untouched.
  • gym/critic.py / gym/step_recovery.py — CUA execution-time, not RL-reward.

Test plan

  • 25 new tests pass locally.
  • 235 tests pass across reward/oracle/learning_runner/extractor suites (no regressions).
  • ruff check . clean.
  • CI green on this PR.
  • After merge: smoke-test a boattrader sim-env training run with oracle_session= passed in, verify StepVerifier API call count drops ≥80%.

Closes #594.

🤖 Generated with Claude Code

Wires the per-step oracle primitives from #590 / PR #593 into the
``LearningRunner``. When a sim-env session is attached, the four
non-extraction ``StepVerifier`` calls short-circuit to cheap
server-side probes:

* Filter steps (``_learn_setup``) — ``verify_step`` + ``verify_filter``
  + retry ``verify_filter`` replaced with ``_oracle_verify_url_signals``.
  Filter clicks change the URL, not the env's mutation log, so the
  oracle path verifies via URL substring against the plan-author's
  hint tokens.
* Pre-iteration page check (``_learn_extraction``) — ``verify_on_correct_page``
  replaced with ``_oracle_verify_url_signals`` keyed on
  ``["/boats", "/listings", "/search", "boats/"]``.
* Extraction iteration (``_learn_extraction:297``) — stays on vision
  unconditionally. Extraction reads the page and produces no mutation,
  so the oracle can't tell success from failure here; vision still
  catches gallery traps / dealer drift / popups.

Three new private helpers:

* ``_env_current_url`` — best-effort accessor that tolerates
  ``current_url`` (callable or attribute), ``url`` attribute, or
  ``get_url()`` method. Returns ``""`` when none match; swallows
  callable errors so the verifier never crashes the run.
* ``_oracle_verify_url_signals(signals)`` — case-insensitive substring
  check, returns ``VerificationResult`` matching the vision path's
  shape so callers don't branch on result inspection.
* ``_oracle_verify_state_change(expected_ops=None)`` — fetches
  ``/__env__/mutations?since=<cursor>`` and advances the cursor.
  Returns ``verified=True`` on a matching ``operation`` (or any
  mutation when no filter set); ``verified=False`` with
  ``issue=no_state_change`` on empty delta; ``oracle_unreachable``
  on transport failure. Cursor is monotonic — a response with
  ``id < current_cursor`` does not regress it.
* ``_sync_mutation_cursor()`` — phase-start helper. Called at the
  top of ``_learn_setup`` so the first filter step's verifier
  doesn't see the ``env_reset`` mutation stamped by an earlier reset.

The constructor gains an optional ``oracle_session: EnvSession | None``
parameter; default ``None`` preserves the existing vision-only path
for real-site runs unchanged.

Tests under ``tests/test_learning_runner_oracle_path.py`` (25 tests):
URL accessor under each adapter shape, oracle URL-signals, state-change
verifier on matching/missing/empty/error paths, cursor monotonicity,
sync helper, constructor wiring. No regressions across the broader
reward/oracle/extractor suite (235 passed).

Closes #594.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mercurialsolo mercurialsolo merged commit 8196f4d into main May 23, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gym/learning_runner: wire oracle primitives, replace vision StepVerifier on sim-env runs (follow-up to #590)

1 participant