sim-envs: mantis-boattrader env + oracle harness (BT01 grader)#592
Merged
Conversation
High-fidelity boattrader.com clone served by FastAPI with deterministic
catalog (seed=42 → 600 boats, 25 makes, dealer cards, ads, lead capture)
plus the oracle harness pattern already proven for mantis-crm / shop /
helpdesk:
* In-memory ``Store.mutations`` audit log; every state-changing public
route (consent, lead submission, phone reveal, env reset) stamps an
entry so graders can reconstruct what the agent did.
* ``GET /__env__/oracle?task_id=<id>`` returns the canonical
``{passed, score, reasons, diff, task_id}`` shape, gated on
``X-Env-Admin``. ``GET /__env__/mutations[?since=<id>]`` exposes the
raw audit log.
* ``app/oracles/__init__.py`` dispatcher; new graders register a file +
table entry. Determinism contract enforced by
``tests/sim_envs/mantis_boattrader/test_oracle_determinism.py``.
First concrete grader: ``BT01_lead_capture_filtered_search``. The
agent filters listings to ``condition=used`` + ``make=Sea Ray`` +
``price_max=200000``, clicks a matching boat, submits the contact form.
Oracle scores F1 over (hit leads on qualifying boats, miss leads on
non-qualifying / malformed payloads); pass = ≥1 hit, 0 misses.
The agent-side oracle client (``src/mantis_agent/gym/grading.py``) is
already generic over task_id and needs no changes.
Closes #588.
Closes #589.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* ``app/seed.py``: ``listing_type`` was read before its assignment in the POA-price branch (was previously surviving on stale loop-scope state). Move the listing-type rng draw above the POA branch so the read is well-defined on iteration 0. Drop unused ``timedelta`` / ``timezone`` imports. * ``app/db.py``: drop unused ``Iterable`` import. * ``tests/.../test_oracle_determinism.py``: drop unused ``seed as seed_mod`` import in the ``seeded_store`` fixture. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lands the mantis-boattrader simulated env (high-fidelity boattrader.com clone, FastAPI, seed=42 → 600 boats, 25 makes, dealer cards, ad rotation, lead capture, cookie consent, configurable latency) plus the oracle harness pattern already shipped for
mantis-crm(#332),mantis-shop(#334), andmantis-helpdesk(#333).The oracle harness is what enables cheap, deterministic verification of agent runs against this env — replacing per-step Claude-vision step-verifier calls (≈$0.0003/call × dozens × thousands of episodes) with a server-side ground-truth probe.
What's in the PR
Env app (
deploy/sim_envs/mantis_boattrader/)app/main.py,app/db.py,app/seed.py,app/templates/*,app/static/*— the env itself (FastAPI + Jinja2; in-memory deterministicStorekeyed offSEED).Dockerfile+requirements.txtfor container builds.deploy/sim_envs/daytona_mantis_boattrader.py+_daytona_patch.py+_daytona_restart.py— Daytona deploy + iterate helpers.Oracle harness (#588)
Store.mutations: list[dict]+emit_mutation(...)helper.lead_submitted←POST /boat/<slug>/contactconsent_set←POST /__site/consentphone_revealed←POST /boat/<slug>/show-phoneenv_reset←POST /__env__/resetGET /__env__/oracle?task_id=<id>returns{passed, score, reasons, diff, task_id}, gated onX-Env-Admin.GET /__env__/mutations[?since=<id>]exposes the raw audit log.GET /__env__/stateextended with mutation count + last 50.app/oracles/__init__.pydispatcher; new graders register a file + table entry.First concrete grader (#589) —
BT01_lead_capture_filtered_searchThe agent filters
/boats/tocondition=used+make=Sea Ray+price_max=200000(seed=42 catalog has 9 matching boats), clicks a matching detail page, submits the dealer contact form with a non-empty name + email.Grader (
app/oracles/bt01_lead_capture_filtered_search.py): F1 over (hit leads on qualifying boats, miss leads on non-qualifying / malformed-payload). Pass = ≥1 hit, 0 misses.Agent side
No changes needed —
src/mantis_agent/gym/grading.py:grade_runis already generic overtask_idand hitsGET /__env__/oracle?task_id=<id>with the admin token.Tests
tests/sim_envs/mantis_boattrader/— 21 tests, all green:test_app_smoke.py(14): harness gating, oracle endpoint shape, mutation emission from each mutating route, reset boundary, end-to-end BT01 round-trip (/boats/?…→/boat/<slug>/contact→/__env__/oraclereturns passed=true).test_oracle_determinism.py(7): dispatcher round-trip, BT01 fail-on-seed / pass-on-qualifying / fail-on-collateral / reject-malformed, determinism parametrize.Full
tests/sim_envs/regression: 194 passed (no regression in CRM/shop/helpdesk).Test plan
pytest tests/sim_envs/mantis_boattrader/passes locally (21 tests).pytest tests/sim_envs/regression passes (194 tests, no breakage in sibling envs).--extra dev --extra server --extra orchestrator --extra metrics— smoke tests requiring jinja2 will skip cleanly, determinism tests run).daytona_mantis_boattrader.pydeploy round-trip +curl /__env__/oracle?task_id=BT01_lead_capture_filtered_searchagainst the live preview.Follow-up
#590 — per-step oracle reward in gym/training (replace
StepVerifier.verify_filterfor sim-env runs). Depends on this PR merging.Tracked under epic #331.
Closes #588.
Closes #589.
🤖 Generated with Claude Code