Skip to content

rewards: oracle-driven per-step + terminal reward primitives (#590 part 1)#593

Merged
mercurialsolo merged 1 commit into
mainfrom
feat/oracle-step-reward-590
May 22, 2026
Merged

rewards: oracle-driven per-step + terminal reward primitives (#590 part 1)#593
mercurialsolo merged 1 commit into
mainfrom
feat/oracle-step-reward-590

Conversation

@mercurialsolo
Copy link
Copy Markdown
Owner

Summary

Lands the cheap-verifier primitives that let sim-env training swap Claude-vision StepVerifier calls for server-side ground truth. This PR ships the primitives only — the LearningRunner wiring that actually replaces the vision calls is deferred to a follow-up so we can land the reward + HTTP-client + reward-fn surface in isolation, with full unit-test coverage, ahead of the larger behavior change.

The boattrader sim env + oracle harness landed in #592 (closed #588, #589). This PR is the agent-side consumer of that env's /__env__/mutations + /__env__/oracle endpoints.

What's in

mantis_agent.sim_envs.oracle_client

New module — companion to gym.grading which already wraps the terminal GET /__env__/oracle?task_id=<id> call.

  • fetch_mutations(url, admin_token, *, since_id=0, timeout_s=10) -> dict — pulls GET /__env__/mutations[?since=<id>] and returns the audit-log tail since the cursor. Never raises; network/HTTP/parse failures populate an error key on the returned dict.
  • last_mutation_id(mutations) -> int — helper to advance the cursor between polls.

mantis_agent.rewards.components.oracle_step_reward

Pure function:

def oracle_step_reward(mutations_delta, expected_ops, value=0.1) -> float

Awards value per matching operation in the delta. Caps at "no signal" on empty input so it's safe to call every step regardless of whether a sim env is in the loop.

mantis_agent.recipes.marketplace_listings.rewards.SyntheticEnvReward

Drop-in subclass of MarketplaceListingReward:

  • step() — adds an oracle_step component when gym_result.info["oracle_mutations_delta"] contains a mutation whose operation matches the expected set for the current step (looked up via info["oracle_step_kind"] against expected_ops_by_step_kind).
  • episode() — when state.extras["oracle_terminal"] (the GradingResult.to_dict()) is present, replaces the parent's done-summary gate with the oracle's F1 score plus a pass bonus.
  • Stays pure — the caller's loop populates info from fetch_mutations and state.extras from grade_run. That keeps the reward fn deterministic given inputs (testable as a pure function) and lets the same fetcher feed multiple reward implementations.

Default op→step-kind table ships with mantis-boattrader operations (lead_submitted, phone_revealed, consent_set). Recipes targeting a different sim env pass their own table at construction.

Pre-existing latent circular import — sidestepped, not fixed

rewards.boattrader (deprecated alias) subclasses MarketplaceListingReward at import time, so the import order matters when a test imports recipes.marketplace_listings.rewards first. The new test file imports mantis_agent.rewards ahead of the recipe to sidestep it. Worth cleaning up separately — flagging here so reviewers know the noqa isn't a workaround for new code.

Tests

  • tests/test_oracle_client.py (15 tests, unittest.mock.patch over urlopen): happy path + since param + every failure mode (HTTP error, network error, non-JSON, non-dict payload, missing key, empty url/token).
  • tests/test_oracle_step_reward.py (20 tests):
    • oracle_step_reward pure-function coverage including iterable kinds, malformed entries, custom values.
    • SyntheticEnvReward.step with/without delta, step-kind tagging, empty expected set, cumulative oracle_step_total.
    • SyntheticEnvReward.episode terminal override, passed/failed branches, custom weight.
    • Default config sanity + custom-table override.
uv run --extra server --extra dev pytest tests/test_oracle_client.py tests/test_oracle_step_reward.py
# → 35 passed
uv run --extra server --extra dev pytest tests/ -q -k 'reward or oracle'
# → 131 passed (no regressions)
ruff check .
# → clean

Follow-up (not in this PR)

Wire these primitives into gym/learning_runner.py so the existing verify_filter / verify_on_correct_page / verify_step callsites short-circuit to the oracle when a sim env is present. That change touches the runner directly and deserves its own PR — I'll file the follow-up issue once this lands.

Test plan

  • pytest tests/test_oracle_client.py tests/test_oracle_step_reward.py — 35 passed
  • pytest tests/ -q -k "reward or oracle" — 131 passed (existing 96 + new 35)
  • ruff check . — clean
  • CI green on this PR

Closes #590.

🤖 Generated with Claude Code

…rt 1)

Lands the cheap-verifier primitives that let sim-env training swap
Claude-vision ``StepVerifier`` calls for server-side ground truth.
This PR ships the primitives only — the ``LearningRunner`` wiring that
actually replaces the vision calls is deferred to a follow-up so we can
land the reward + HTTP-client + reward-fn surface in isolation, with
full unit-test coverage, ahead of the larger behavior change.

* ``mantis_agent.sim_envs.oracle_client.fetch_mutations(url, token, *, since_id)``
  — companion to ``gym.grading.grade_run``; pulls
  ``GET /__env__/mutations[?since=<id>]`` and returns the audit-log tail
  the caller diffs against the previous step. ``last_mutation_id``
  helper to advance the cursor between polls. Never raises; failures
  populate an ``error`` key.

* ``mantis_agent.rewards.components.oracle_step_reward(delta, expected_ops, value=0.1)``
  — pure function awarding ``value`` per matching mutation
  ``operation`` in the delta. Caps at "no signal" on empty input so
  it's safe to call every step regardless of whether a sim env is in
  the loop.

* ``mantis_agent.recipes.marketplace_listings.rewards.SyntheticEnvReward``
  — drop-in subclass of ``MarketplaceListingReward`` that reads
  ``gym_result.info["oracle_mutations_delta"]`` for per-step credit
  and ``state.extras["oracle_terminal"]`` for the F1 + pass bonus at
  episode end. The reward stays pure — the caller's loop populates
  the info dict from ``fetch_mutations`` + ``grade_run``.

  Default op→step-kind table ships with mantis-boattrader operations
  (``lead_submitted``, ``phone_revealed``, ``consent_set``). Recipes
  targeting a different sim env pass their own table at construction.

Pre-existing latent circular import (``rewards.boattrader`` deprecated
alias subclasses ``MarketplaceListingReward`` at import time) is sidestepped in the
test file by importing ``mantis_agent.rewards`` before the recipe
module. Worth cleaning up separately but out of scope here.

Tests: ``tests/test_oracle_client.py`` (15 tests, urllib mocked) +
``tests/test_oracle_step_reward.py`` (20 tests covering the pure
function, ``SyntheticEnvReward.step``, ``.episode``, terminal-override
semantics, and the default-table shape). 35 passed, all green.

Follow-up: wire these into ``gym/learning_runner.py`` so the existing
``verify_filter`` / ``verify_on_correct_page`` / ``verify_step``
callsites short-circuit when a sim env is present. That change touches
the runner directly and deserves its own PR.

Closes #590.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant