rewards: oracle-driven per-step + terminal reward primitives (#590 part 1)#593
Merged
Conversation
…rt 1) Lands the cheap-verifier primitives that let sim-env training swap Claude-vision ``StepVerifier`` calls for server-side ground truth. This PR ships the primitives only — the ``LearningRunner`` wiring that actually replaces the vision calls is deferred to a follow-up so we can land the reward + HTTP-client + reward-fn surface in isolation, with full unit-test coverage, ahead of the larger behavior change. * ``mantis_agent.sim_envs.oracle_client.fetch_mutations(url, token, *, since_id)`` — companion to ``gym.grading.grade_run``; pulls ``GET /__env__/mutations[?since=<id>]`` and returns the audit-log tail the caller diffs against the previous step. ``last_mutation_id`` helper to advance the cursor between polls. Never raises; failures populate an ``error`` key. * ``mantis_agent.rewards.components.oracle_step_reward(delta, expected_ops, value=0.1)`` — pure function awarding ``value`` per matching mutation ``operation`` in the delta. Caps at "no signal" on empty input so it's safe to call every step regardless of whether a sim env is in the loop. * ``mantis_agent.recipes.marketplace_listings.rewards.SyntheticEnvReward`` — drop-in subclass of ``MarketplaceListingReward`` that reads ``gym_result.info["oracle_mutations_delta"]`` for per-step credit and ``state.extras["oracle_terminal"]`` for the F1 + pass bonus at episode end. The reward stays pure — the caller's loop populates the info dict from ``fetch_mutations`` + ``grade_run``. Default op→step-kind table ships with mantis-boattrader operations (``lead_submitted``, ``phone_revealed``, ``consent_set``). Recipes targeting a different sim env pass their own table at construction. Pre-existing latent circular import (``rewards.boattrader`` deprecated alias subclasses ``MarketplaceListingReward`` at import time) is sidestepped in the test file by importing ``mantis_agent.rewards`` before the recipe module. Worth cleaning up separately but out of scope here. Tests: ``tests/test_oracle_client.py`` (15 tests, urllib mocked) + ``tests/test_oracle_step_reward.py`` (20 tests covering the pure function, ``SyntheticEnvReward.step``, ``.episode``, terminal-override semantics, and the default-table shape). 35 passed, all green. Follow-up: wire these into ``gym/learning_runner.py`` so the existing ``verify_filter`` / ``verify_on_correct_page`` / ``verify_step`` callsites short-circuit when a sim env is present. That change touches the runner directly and deserves its own PR. Closes #590. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lands the cheap-verifier primitives that let sim-env training swap Claude-vision
StepVerifiercalls for server-side ground truth. This PR ships the primitives only — theLearningRunnerwiring that actually replaces the vision calls is deferred to a follow-up so we can land the reward + HTTP-client + reward-fn surface in isolation, with full unit-test coverage, ahead of the larger behavior change.The boattrader sim env + oracle harness landed in #592 (closed #588, #589). This PR is the agent-side consumer of that env's
/__env__/mutations+/__env__/oracleendpoints.What's in
mantis_agent.sim_envs.oracle_clientNew module — companion to
gym.gradingwhich already wraps the terminalGET /__env__/oracle?task_id=<id>call.fetch_mutations(url, admin_token, *, since_id=0, timeout_s=10) -> dict— pullsGET /__env__/mutations[?since=<id>]and returns the audit-log tail since the cursor. Never raises; network/HTTP/parse failures populate anerrorkey on the returned dict.last_mutation_id(mutations) -> int— helper to advance the cursor between polls.mantis_agent.rewards.components.oracle_step_rewardPure function:
Awards
valueper matchingoperationin the delta. Caps at "no signal" on empty input so it's safe to call every step regardless of whether a sim env is in the loop.mantis_agent.recipes.marketplace_listings.rewards.SyntheticEnvRewardDrop-in subclass of
MarketplaceListingReward:step()— adds anoracle_stepcomponent whengym_result.info["oracle_mutations_delta"]contains a mutation whoseoperationmatches the expected set for the current step (looked up viainfo["oracle_step_kind"]againstexpected_ops_by_step_kind).episode()— whenstate.extras["oracle_terminal"](theGradingResult.to_dict()) is present, replaces the parent's done-summary gate with the oracle's F1 score plus a pass bonus.infofromfetch_mutationsandstate.extrasfromgrade_run. That keeps the reward fn deterministic given inputs (testable as a pure function) and lets the same fetcher feed multiple reward implementations.Default op→step-kind table ships with mantis-boattrader operations (
lead_submitted,phone_revealed,consent_set). Recipes targeting a different sim env pass their own table at construction.Pre-existing latent circular import — sidestepped, not fixed
rewards.boattrader(deprecated alias) subclassesMarketplaceListingRewardat import time, so the import order matters when a test importsrecipes.marketplace_listings.rewardsfirst. The new test file importsmantis_agent.rewardsahead of the recipe to sidestep it. Worth cleaning up separately — flagging here so reviewers know the noqa isn't a workaround for new code.Tests
tests/test_oracle_client.py(15 tests,unittest.mock.patchoverurlopen): happy path + since param + every failure mode (HTTP error, network error, non-JSON, non-dict payload, missing key, empty url/token).tests/test_oracle_step_reward.py(20 tests):oracle_step_rewardpure-function coverage including iterable kinds, malformed entries, custom values.SyntheticEnvReward.stepwith/without delta, step-kind tagging, empty expected set, cumulativeoracle_step_total.SyntheticEnvReward.episodeterminal override, passed/failed branches, custom weight.Follow-up (not in this PR)
Wire these primitives into
gym/learning_runner.pyso the existingverify_filter/verify_on_correct_page/verify_stepcallsites short-circuit to the oracle when a sim env is present. That change touches the runner directly and deserves its own PR — I'll file the follow-up issue once this lands.Test plan
pytest tests/test_oracle_client.py tests/test_oracle_step_reward.py— 35 passedpytest tests/ -q -k "reward or oracle"— 131 passed (existing 96 + new 35)ruff check .— cleanCloses #590.
🤖 Generated with Claude Code