sim-envs: mantis-boattrader env + oracle harness (BT01 grader) by mercurialsolo · Pull Request #592 · mercurialsolo/mantis

mercurialsolo · 2026-05-22T19:18:06Z

Summary

Lands the mantis-boattrader simulated env (high-fidelity boattrader.com clone, FastAPI, seed=42 → 600 boats, 25 makes, dealer cards, ad rotation, lead capture, cookie consent, configurable latency) plus the oracle harness pattern already shipped for mantis-crm (#332), mantis-shop (#334), and mantis-helpdesk (#333).

The oracle harness is what enables cheap, deterministic verification of agent runs against this env — replacing per-step Claude-vision step-verifier calls (≈$0.0003/call × dozens × thousands of episodes) with a server-side ground-truth probe.

What's in the PR

Env app (`deploy/sim_envs/mantis_boattrader/`)

app/main.py, app/db.py, app/seed.py, app/templates/*, app/static/* — the env itself (FastAPI + Jinja2; in-memory deterministic Store keyed off SEED).
Dockerfile + requirements.txt for container builds.
deploy/sim_envs/daytona_mantis_boattrader.py + _daytona_patch.py + _daytona_restart.py — Daytona deploy + iterate helpers.

Oracle harness (#588)

Store.mutations: list[dict] + emit_mutation(...) helper.
Every state-changing public route stamps an entry:
- lead_submitted ← POST /boat/<slug>/contact
- consent_set ← POST /__site/consent
- phone_revealed ← POST /boat/<slug>/show-phone
- env_reset ← POST /__env__/reset
GET /__env__/oracle?task_id=<id> returns {passed, score, reasons, diff, task_id}, gated on X-Env-Admin.
GET /__env__/mutations[?since=<id>] exposes the raw audit log.
GET /__env__/state extended with mutation count + last 50.
app/oracles/__init__.py dispatcher; new graders register a file + table entry.

First concrete grader (#589) — `BT01_lead_capture_filtered_search`

The agent filters /boats/ to condition=used + make=Sea Ray + price_max=200000 (seed=42 catalog has 9 matching boats), clicks a matching detail page, submits the dealer contact form with a non-empty name + email.

Grader (app/oracles/bt01_lead_capture_filtered_search.py): F1 over (hit leads on qualifying boats, miss leads on non-qualifying / malformed-payload). Pass = ≥1 hit, 0 misses.

Agent side

No changes needed — src/mantis_agent/gym/grading.py:grade_run is already generic over task_id and hits GET /__env__/oracle?task_id=<id> with the admin token.

Tests

tests/sim_envs/mantis_boattrader/ — 21 tests, all green:

test_app_smoke.py (14): harness gating, oracle endpoint shape, mutation emission from each mutating route, reset boundary, end-to-end BT01 round-trip (/boats/?… → /boat/<slug>/contact → /__env__/oracle returns passed=true).
test_oracle_determinism.py (7): dispatcher round-trip, BT01 fail-on-seed / pass-on-qualifying / fail-on-collateral / reject-malformed, determinism parametrize.

Full tests/sim_envs/ regression: 194 passed (no regression in CRM/shop/helpdesk).

uv sync --extra dev --extra server
uv pip install jinja2
uv run pytest tests/sim_envs/mantis_boattrader/ -q
# → 21 passed

Test plan

pytest tests/sim_envs/mantis_boattrader/ passes locally (21 tests).
pytest tests/sim_envs/ regression passes (194 tests, no breakage in sibling envs).
CI green on this PR (CI installs --extra dev --extra server --extra orchestrator --extra metrics — smoke tests requiring jinja2 will skip cleanly, determinism tests run).
After merge: daytona_mantis_boattrader.py deploy round-trip + curl /__env__/oracle?task_id=BT01_lead_capture_filtered_search against the live preview.

Follow-up

#590 — per-step oracle reward in gym/training (replace StepVerifier.verify_filter for sim-env runs). Depends on this PR merging.

Tracked under epic #331.

Closes #588.
Closes #589.

🤖 Generated with Claude Code

High-fidelity boattrader.com clone served by FastAPI with deterministic catalog (seed=42 → 600 boats, 25 makes, dealer cards, ads, lead capture) plus the oracle harness pattern already proven for mantis-crm / shop / helpdesk: * In-memory ``Store.mutations`` audit log; every state-changing public route (consent, lead submission, phone reveal, env reset) stamps an entry so graders can reconstruct what the agent did. * ``GET /__env__/oracle?task_id=<id>`` returns the canonical ``{passed, score, reasons, diff, task_id}`` shape, gated on ``X-Env-Admin``. ``GET /__env__/mutations[?since=<id>]`` exposes the raw audit log. * ``app/oracles/__init__.py`` dispatcher; new graders register a file + table entry. Determinism contract enforced by ``tests/sim_envs/mantis_boattrader/test_oracle_determinism.py``. First concrete grader: ``BT01_lead_capture_filtered_search``. The agent filters listings to ``condition=used`` + ``make=Sea Ray`` + ``price_max=200000``, clicks a matching boat, submits the contact form. Oracle scores F1 over (hit leads on qualifying boats, miss leads on non-qualifying / malformed payloads); pass = ≥1 hit, 0 misses. The agent-side oracle client (``src/mantis_agent/gym/grading.py``) is already generic over task_id and needs no changes. Closes #588. Closes #589. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ``app/seed.py``: ``listing_type`` was read before its assignment in the POA-price branch (was previously surviving on stale loop-scope state). Move the listing-type rng draw above the POA branch so the read is well-defined on iteration 0. Drop unused ``timedelta`` / ``timezone`` imports. * ``app/db.py``: drop unused ``Iterable`` import. * ``tests/.../test_oracle_determinism.py``: drop unused ``seed as seed_mod`` import in the ``seeded_store`` fixture. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mercurialsolo and others added 2 commits May 22, 2026 12:17

mercurialsolo merged commit 3405898 into main May 22, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sim-envs: mantis-boattrader env + oracle harness (BT01 grader)#592

sim-envs: mantis-boattrader env + oracle harness (BT01 grader)#592
mercurialsolo merged 2 commits into
mainfrom
worktree-env-boattrader

mercurialsolo commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mercurialsolo commented May 22, 2026

Summary

What's in the PR

Env app (deploy/sim_envs/mantis_boattrader/)

Oracle harness (#588)

First concrete grader (#589) — BT01_lead_capture_filtered_search

Agent side

Tests

Test plan

Follow-up

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Env app (`deploy/sim_envs/mantis_boattrader/`)

First concrete grader (#589) — `BT01_lead_capture_filtered_search`