gym/learning_runner: oracle-path verifier swap for sim-env runs (#594)#607
Merged
Conversation
Wires the per-step oracle primitives from #590 / PR #593 into the ``LearningRunner``. When a sim-env session is attached, the four non-extraction ``StepVerifier`` calls short-circuit to cheap server-side probes: * Filter steps (``_learn_setup``) — ``verify_step`` + ``verify_filter`` + retry ``verify_filter`` replaced with ``_oracle_verify_url_signals``. Filter clicks change the URL, not the env's mutation log, so the oracle path verifies via URL substring against the plan-author's hint tokens. * Pre-iteration page check (``_learn_extraction``) — ``verify_on_correct_page`` replaced with ``_oracle_verify_url_signals`` keyed on ``["/boats", "/listings", "/search", "boats/"]``. * Extraction iteration (``_learn_extraction:297``) — stays on vision unconditionally. Extraction reads the page and produces no mutation, so the oracle can't tell success from failure here; vision still catches gallery traps / dealer drift / popups. Three new private helpers: * ``_env_current_url`` — best-effort accessor that tolerates ``current_url`` (callable or attribute), ``url`` attribute, or ``get_url()`` method. Returns ``""`` when none match; swallows callable errors so the verifier never crashes the run. * ``_oracle_verify_url_signals(signals)`` — case-insensitive substring check, returns ``VerificationResult`` matching the vision path's shape so callers don't branch on result inspection. * ``_oracle_verify_state_change(expected_ops=None)`` — fetches ``/__env__/mutations?since=<cursor>`` and advances the cursor. Returns ``verified=True`` on a matching ``operation`` (or any mutation when no filter set); ``verified=False`` with ``issue=no_state_change`` on empty delta; ``oracle_unreachable`` on transport failure. Cursor is monotonic — a response with ``id < current_cursor`` does not regress it. * ``_sync_mutation_cursor()`` — phase-start helper. Called at the top of ``_learn_setup`` so the first filter step's verifier doesn't see the ``env_reset`` mutation stamped by an earlier reset. The constructor gains an optional ``oracle_session: EnvSession | None`` parameter; default ``None`` preserves the existing vision-only path for real-site runs unchanged. Tests under ``tests/test_learning_runner_oracle_path.py`` (25 tests): URL accessor under each adapter shape, oracle URL-signals, state-change verifier on matching/missing/empty/error paths, cursor monotonicity, sync helper, constructor wiring. No regressions across the broader reward/oracle/extractor suite (235 passed). Closes #594. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the third item of the cheap-verifier epic (#592 sim env → #593 reward primitives → this PR). When a
LearningRunnerhas a sim-env session attached, the per-stepStepVerifiercalls short-circuit to cheap server-side probes instead of Claude-vision API calls.Scope of the swap
Five verifier callsites in
src/mantis_agent/gym/learning_runner.py— four replaced, one intentionally kept on vision:verify_step(filter click)_oracle_verify_url_signalsagainst plan-author hint tokensverify_filter(post-filter signals)verify_filter(retry)_oracle_verify_url_signalsverify_on_correct_page(pre-iteration)_oracle_verify_url_signalson results-page URL fragmentsverify_step(extraction iteration)Vision path is unchanged for callers that don't pass
oracle_session=— the constructor parameter defaults toNone.New helpers
_env_current_url()— best-effort URL accessor that toleratescurrent_url(callable or attribute),urlattribute, orget_url()method. Returns""when none match; swallows callable errors so the verifier never crashes the run._oracle_verify_url_signals(signals)— case-insensitive substring check, returnsVerificationResultwith the same shape as the vision path so callers don't branch on result inspection._oracle_verify_state_change(expected_ops=None)— fetches/__env__/mutations?since=<cursor>and advances the cursor. Verified on matchingoperation(or any mutation when no filter set).no_state_changeon empty delta;oracle_unreachableon transport failure. Cursor is monotonic — a response withid < current_cursordoes not regress it._sync_mutation_cursor()— phase-start helper called at the top of_learn_setupso the first filter step's verifier doesn't see theenv_resetmutation stamped by an earlier reset.Tests
tests/test_learning_runner_oracle_path.py— 25 tests:get_url(), none, throwing,env=None)._oracle_verify_url_signalson substring hit, case-insensitive match, no match, missing URL, empty signal list, empty strings in signal list._oracle_verify_state_changeon any-mutation / matching-op / non-matching-op / empty-delta / fetch-error paths; cursor monotonicity under out-of-order responses;since_idplumbing._sync_mutation_cursoradvances past the tail / no-ops without session / no-ops on empty log.Nonepreserves vision-only path; explicit session is stored.Out of scope
GymRunneror the reward-fninfo/state.extraswiring (a separate concern; the reward fn from rewards: oracle-driven per-step + terminal reward primitives (#590 part 1) #593 reads fromgym_result.infowhich the gym loop populates, not the LearningRunner).oracle_session=None) — the vision path stays untouched.gym/critic.py/gym/step_recovery.py— CUA execution-time, not RL-reward.Test plan
ruff check .clean.oracle_session=passed in, verifyStepVerifierAPI call count drops ≥80%.Closes #594.
🤖 Generated with Claude Code