rewards: oracle-driven per-step + terminal reward primitives (#590 part 1) by mercurialsolo · Pull Request #593 · mercurialsolo/mantis

mercurialsolo · 2026-05-22T20:06:51Z

Summary

Lands the cheap-verifier primitives that let sim-env training swap Claude-vision StepVerifier calls for server-side ground truth. This PR ships the primitives only — the LearningRunner wiring that actually replaces the vision calls is deferred to a follow-up so we can land the reward + HTTP-client + reward-fn surface in isolation, with full unit-test coverage, ahead of the larger behavior change.

The boattrader sim env + oracle harness landed in #592 (closed #588, #589). This PR is the agent-side consumer of that env's /__env__/mutations + /__env__/oracle endpoints.

What's in

`mantis_agent.sim_envs.oracle_client`

New module — companion to gym.grading which already wraps the terminal GET /__env__/oracle?task_id=<id> call.

fetch_mutations(url, admin_token, *, since_id=0, timeout_s=10) -> dict — pulls GET /__env__/mutations[?since=<id>] and returns the audit-log tail since the cursor. Never raises; network/HTTP/parse failures populate an error key on the returned dict.
last_mutation_id(mutations) -> int — helper to advance the cursor between polls.

`mantis_agent.rewards.components.oracle_step_reward`

Pure function:

def oracle_step_reward(mutations_delta, expected_ops, value=0.1) -> float

Awards value per matching operation in the delta. Caps at "no signal" on empty input so it's safe to call every step regardless of whether a sim env is in the loop.

`mantis_agent.recipes.marketplace_listings.rewards.SyntheticEnvReward`

Drop-in subclass of MarketplaceListingReward:

step() — adds an oracle_step component when gym_result.info["oracle_mutations_delta"] contains a mutation whose operation matches the expected set for the current step (looked up via info["oracle_step_kind"] against expected_ops_by_step_kind).
episode() — when state.extras["oracle_terminal"] (the GradingResult.to_dict()) is present, replaces the parent's done-summary gate with the oracle's F1 score plus a pass bonus.
Stays pure — the caller's loop populates info from fetch_mutations and state.extras from grade_run. That keeps the reward fn deterministic given inputs (testable as a pure function) and lets the same fetcher feed multiple reward implementations.

Default op→step-kind table ships with mantis-boattrader operations (lead_submitted, phone_revealed, consent_set). Recipes targeting a different sim env pass their own table at construction.

Pre-existing latent circular import — sidestepped, not fixed

rewards.boattrader (deprecated alias) subclasses MarketplaceListingReward at import time, so the import order matters when a test imports recipes.marketplace_listings.rewards first. The new test file imports mantis_agent.rewards ahead of the recipe to sidestep it. Worth cleaning up separately — flagging here so reviewers know the noqa isn't a workaround for new code.

Tests

tests/test_oracle_client.py (15 tests, unittest.mock.patch over urlopen): happy path + since param + every failure mode (HTTP error, network error, non-JSON, non-dict payload, missing key, empty url/token).
tests/test_oracle_step_reward.py (20 tests):
- oracle_step_reward pure-function coverage including iterable kinds, malformed entries, custom values.
- SyntheticEnvReward.step with/without delta, step-kind tagging, empty expected set, cumulative oracle_step_total.
- SyntheticEnvReward.episode terminal override, passed/failed branches, custom weight.
- Default config sanity + custom-table override.

uv run --extra server --extra dev pytest tests/test_oracle_client.py tests/test_oracle_step_reward.py
# → 35 passed
uv run --extra server --extra dev pytest tests/ -q -k 'reward or oracle'
# → 131 passed (no regressions)
ruff check .
# → clean

Follow-up (not in this PR)

Wire these primitives into gym/learning_runner.py so the existing verify_filter / verify_on_correct_page / verify_step callsites short-circuit to the oracle when a sim env is present. That change touches the runner directly and deserves its own PR — I'll file the follow-up issue once this lands.

Test plan

pytest tests/test_oracle_client.py tests/test_oracle_step_reward.py — 35 passed
pytest tests/ -q -k "reward or oracle" — 131 passed (existing 96 + new 35)
ruff check . — clean
CI green on this PR

Closes #590.

🤖 Generated with Claude Code

…rt 1) Lands the cheap-verifier primitives that let sim-env training swap Claude-vision ``StepVerifier`` calls for server-side ground truth. This PR ships the primitives only — the ``LearningRunner`` wiring that actually replaces the vision calls is deferred to a follow-up so we can land the reward + HTTP-client + reward-fn surface in isolation, with full unit-test coverage, ahead of the larger behavior change. * ``mantis_agent.sim_envs.oracle_client.fetch_mutations(url, token, *, since_id)`` — companion to ``gym.grading.grade_run``; pulls ``GET /__env__/mutations[?since=<id>]`` and returns the audit-log tail the caller diffs against the previous step. ``last_mutation_id`` helper to advance the cursor between polls. Never raises; failures populate an ``error`` key. * ``mantis_agent.rewards.components.oracle_step_reward(delta, expected_ops, value=0.1)`` — pure function awarding ``value`` per matching mutation ``operation`` in the delta. Caps at "no signal" on empty input so it's safe to call every step regardless of whether a sim env is in the loop. * ``mantis_agent.recipes.marketplace_listings.rewards.SyntheticEnvReward`` — drop-in subclass of ``MarketplaceListingReward`` that reads ``gym_result.info["oracle_mutations_delta"]`` for per-step credit and ``state.extras["oracle_terminal"]`` for the F1 + pass bonus at episode end. The reward stays pure — the caller's loop populates the info dict from ``fetch_mutations`` + ``grade_run``. Default op→step-kind table ships with mantis-boattrader operations (``lead_submitted``, ``phone_revealed``, ``consent_set``). Recipes targeting a different sim env pass their own table at construction. Pre-existing latent circular import (``rewards.boattrader`` deprecated alias subclasses ``MarketplaceListingReward`` at import time) is sidestepped in the test file by importing ``mantis_agent.rewards`` before the recipe module. Worth cleaning up separately but out of scope here. Tests: ``tests/test_oracle_client.py`` (15 tests, urllib mocked) + ``tests/test_oracle_step_reward.py`` (20 tests covering the pure function, ``SyntheticEnvReward.step``, ``.episode``, terminal-override semantics, and the default-table shape). 35 passed, all green. Follow-up: wire these into ``gym/learning_runner.py`` so the existing ``verify_filter`` / ``verify_on_correct_page`` / ``verify_step`` callsites short-circuit when a sim env is present. That change touches the runner directly and deserves its own PR. Closes #590. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mercurialsolo merged commit 56e9d57 into main May 22, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rewards: oracle-driven per-step + terminal reward primitives (#590 part 1)#593

rewards: oracle-driven per-step + terminal reward primitives (#590 part 1)#593
mercurialsolo merged 1 commit into
mainfrom
feat/oracle-step-reward-590

mercurialsolo commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mercurialsolo commented May 22, 2026

Summary

What's in

mantis_agent.sim_envs.oracle_client

mantis_agent.rewards.components.oracle_step_reward

mantis_agent.recipes.marketplace_listings.rewards.SyntheticEnvReward

Pre-existing latent circular import — sidestepped, not fixed

Tests

Follow-up (not in this PR)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`mantis_agent.sim_envs.oracle_client`

`mantis_agent.rewards.components.oracle_step_reward`

`mantis_agent.recipes.marketplace_listings.rewards.SyntheticEnvReward`