Skip to content

sim-envs: mantis-boattrader env + oracle harness (BT01 grader)#592

Merged
mercurialsolo merged 2 commits into
mainfrom
worktree-env-boattrader
May 22, 2026
Merged

sim-envs: mantis-boattrader env + oracle harness (BT01 grader)#592
mercurialsolo merged 2 commits into
mainfrom
worktree-env-boattrader

Conversation

@mercurialsolo
Copy link
Copy Markdown
Owner

Summary

Lands the mantis-boattrader simulated env (high-fidelity boattrader.com clone, FastAPI, seed=42 → 600 boats, 25 makes, dealer cards, ad rotation, lead capture, cookie consent, configurable latency) plus the oracle harness pattern already shipped for mantis-crm (#332), mantis-shop (#334), and mantis-helpdesk (#333).

The oracle harness is what enables cheap, deterministic verification of agent runs against this env — replacing per-step Claude-vision step-verifier calls (≈$0.0003/call × dozens × thousands of episodes) with a server-side ground-truth probe.

What's in the PR

Env app (deploy/sim_envs/mantis_boattrader/)

  • app/main.py, app/db.py, app/seed.py, app/templates/*, app/static/* — the env itself (FastAPI + Jinja2; in-memory deterministic Store keyed off SEED).
  • Dockerfile + requirements.txt for container builds.
  • deploy/sim_envs/daytona_mantis_boattrader.py + _daytona_patch.py + _daytona_restart.py — Daytona deploy + iterate helpers.

Oracle harness (#588)

  • Store.mutations: list[dict] + emit_mutation(...) helper.
  • Every state-changing public route stamps an entry:
    • lead_submittedPOST /boat/<slug>/contact
    • consent_setPOST /__site/consent
    • phone_revealedPOST /boat/<slug>/show-phone
    • env_resetPOST /__env__/reset
  • GET /__env__/oracle?task_id=<id> returns {passed, score, reasons, diff, task_id}, gated on X-Env-Admin.
  • GET /__env__/mutations[?since=<id>] exposes the raw audit log.
  • GET /__env__/state extended with mutation count + last 50.
  • app/oracles/__init__.py dispatcher; new graders register a file + table entry.

First concrete grader (#589) — BT01_lead_capture_filtered_search

The agent filters /boats/ to condition=used + make=Sea Ray + price_max=200000 (seed=42 catalog has 9 matching boats), clicks a matching detail page, submits the dealer contact form with a non-empty name + email.

Grader (app/oracles/bt01_lead_capture_filtered_search.py): F1 over (hit leads on qualifying boats, miss leads on non-qualifying / malformed-payload). Pass = ≥1 hit, 0 misses.

Agent side

No changes needed — src/mantis_agent/gym/grading.py:grade_run is already generic over task_id and hits GET /__env__/oracle?task_id=<id> with the admin token.

Tests

tests/sim_envs/mantis_boattrader/ — 21 tests, all green:

  • test_app_smoke.py (14): harness gating, oracle endpoint shape, mutation emission from each mutating route, reset boundary, end-to-end BT01 round-trip (/boats/?…/boat/<slug>/contact/__env__/oracle returns passed=true).
  • test_oracle_determinism.py (7): dispatcher round-trip, BT01 fail-on-seed / pass-on-qualifying / fail-on-collateral / reject-malformed, determinism parametrize.

Full tests/sim_envs/ regression: 194 passed (no regression in CRM/shop/helpdesk).

uv sync --extra dev --extra server
uv pip install jinja2
uv run pytest tests/sim_envs/mantis_boattrader/ -q
# → 21 passed

Test plan

  • pytest tests/sim_envs/mantis_boattrader/ passes locally (21 tests).
  • pytest tests/sim_envs/ regression passes (194 tests, no breakage in sibling envs).
  • CI green on this PR (CI installs --extra dev --extra server --extra orchestrator --extra metrics — smoke tests requiring jinja2 will skip cleanly, determinism tests run).
  • After merge: daytona_mantis_boattrader.py deploy round-trip + curl /__env__/oracle?task_id=BT01_lead_capture_filtered_search against the live preview.

Follow-up

#590 — per-step oracle reward in gym/training (replace StepVerifier.verify_filter for sim-env runs). Depends on this PR merging.

Tracked under epic #331.

Closes #588.
Closes #589.

🤖 Generated with Claude Code

mercurialsolo and others added 2 commits May 22, 2026 12:17
High-fidelity boattrader.com clone served by FastAPI with deterministic
catalog (seed=42 → 600 boats, 25 makes, dealer cards, ads, lead capture)
plus the oracle harness pattern already proven for mantis-crm / shop /
helpdesk:

* In-memory ``Store.mutations`` audit log; every state-changing public
  route (consent, lead submission, phone reveal, env reset) stamps an
  entry so graders can reconstruct what the agent did.
* ``GET /__env__/oracle?task_id=<id>`` returns the canonical
  ``{passed, score, reasons, diff, task_id}`` shape, gated on
  ``X-Env-Admin``. ``GET /__env__/mutations[?since=<id>]`` exposes the
  raw audit log.
* ``app/oracles/__init__.py`` dispatcher; new graders register a file +
  table entry. Determinism contract enforced by
  ``tests/sim_envs/mantis_boattrader/test_oracle_determinism.py``.

First concrete grader: ``BT01_lead_capture_filtered_search``. The
agent filters listings to ``condition=used`` + ``make=Sea Ray`` +
``price_max=200000``, clicks a matching boat, submits the contact form.
Oracle scores F1 over (hit leads on qualifying boats, miss leads on
non-qualifying / malformed payloads); pass = ≥1 hit, 0 misses.

The agent-side oracle client (``src/mantis_agent/gym/grading.py``) is
already generic over task_id and needs no changes.

Closes #588.
Closes #589.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* ``app/seed.py``: ``listing_type`` was read before its assignment in
  the POA-price branch (was previously surviving on stale loop-scope
  state). Move the listing-type rng draw above the POA branch so the
  read is well-defined on iteration 0. Drop unused ``timedelta`` /
  ``timezone`` imports.
* ``app/db.py``: drop unused ``Iterable`` import.
* ``tests/.../test_oracle_determinism.py``: drop unused
  ``seed as seed_mod`` import in the ``seeded_store`` fixture.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant