HeadlessRim Docker + real benchmarks + leaderboard infrastructure (#13)#15
HeadlessRim Docker + real benchmarks + leaderboard infrastructure (#13)#15
Conversation
Reproducible installs via uv.lock, faster CI, modern Python target. Dev environment targets 3.14 via .python-version. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…olations - Add type args to all bare dict/list annotations across 12 files - Fix callable → Callable, object → GameState, missing TYPE_CHECKING imports - Add types-PyYAML stub and mypy overrides for optional deps (wandb, huggingface_hub) - Ruff auto-fixed import sorting from stricter py313 target Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents decision to replace mock benchmarks with Docker HeadlessRim, expand scoring to 10 metrics, add bootstrap CIs, OpenRouter cost tracking, and structured event logging. Includes stdlib-only stats rationale. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dockerfile (debian:bookworm-slim + Xvfb), docker-compose with volume mounts for game files/mods/saves, entrypoint with RIMAPI healthcheck. Game files mounted at runtime to avoid distributing copyrighted content. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DockerGameServer manages HeadlessRim container via async subprocess. Shared wait_for_rimapi() utility replaces ad-hoc polling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
BootstrapCI pydantic model + bootstrap_ci() and bootstrap_paired_delta() using random.choices(). 18 tests covering CIs, reproducibility, edge cases. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CostTracker accumulates token usage, fetches per-token pricing from OpenRouter's public API (no auth). Graceful fallback to $0.00. 16 tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add coordination and communication_efficiency metrics. Extend MetricContext with conflict/message tracking fields. Redistribute DEFAULT_WEIGHTS across 10 metrics (process metrics get 20% combined). Update all 6 scenario YAMLs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
EventLog appends JSONL per run: deliberations, conflicts, action executions, scores, errors. RunSummary aggregates for CI artifacts. Context manager support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LeaderboardEntry model, Leaderboard class with from_history(), to_markdown() model×scenario matrix, to_csv(), and pareto_frontier(). Reuses _std/_t_to_p from delta.py. 12 tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ci.yml: ruff + mypy + pytest + smoke-test on push/PR. benchmark.yml: manual dispatch + weekly schedule with Docker template. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lta, cost tracking hooks - run_benchmark.py: --docker, --smoke-test (deprecate --dry-run), --ablation stub, N>=4 enforcement - run_scenario.py: replace ad-hoc polling with shared wait_for_rimapi() - base_role.py: capture _last_usage and _last_raw_output from provider calls - delta.py: add agent_ci/delta_ci properties using bootstrap module - config.py: add docker_image, docker_port fields - metadata.py: add docker_mode, random_seed placeholders - scoring/__init__.py: export bootstrap types - Fix ruff violations and test assertions for 10-metric weights Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Code ReviewApprove with follow-up items. Infrastructure is solid, 326 tests passing, well-documented ADR. Issues to fix before merge
Nits
All being fixed in follow-up commit. |
…rfile - run_scenario.py: restore colonist_count polling after wait_for_rimapi() - run_benchmark.py: Docker stop in try/except so container doesn't leak - delta.py: cached_property on agent_ci/delta_ci to avoid recomputing 10K bootstrap iterations - Dockerfile: add actual HeadlessRimPatch v1.0.0 download via curl + unzip Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
All review issues addressed in
326 tests, ruff clean, mypy strict clean. |
There was a problem hiding this comment.
PR Review — HeadlessRim Docker + Benchmarks + Leaderboard
Nice scope here @jkbennitt — this is the right infrastructure for making RLE a credible benchmark. CI is green, 326 tests passing. A few things to address before merge:
Blocking
1. .dockerignore won't be read
It's at docker/.dockerignore, but docker-compose.yml sets context: .. (repo root). Docker reads .dockerignore from the build context root. Move it to the repo root or it's silently ignored — .git, __pycache__, .env all get copied into the build layer.
2. RIMAPI IPv6 loopback vs Docker port forwarding
RIMAPI binds to [::1]:8765 (IPv6 loopback). Docker port forwarding maps 0.0.0.0:HOST_PORT → container_ip:8765, but won't reach ::1 inside the container. The healthcheck may work (localhost resolves to ::1 inside), but the host benchmark runner can't connect. Needs a RIMAPI config override to bind 0.0.0.0 inside the container.
3. Entrypoint writes to :ro mount
The entrypoint runs mkdir -p "$MODS_DIR" and ln -sf under /opt/game, which is mounted :ro in docker-compose. Those writes will fail. Either remove :ro or use a writable overlay for the Mods directory.
4. .python-version says 3.14
pyproject.toml says >=3.13, CI uses 3.13, and Caleb's machine runs 3.12.3 (CLAUDE.md explicitly warns against 3.14 due to PyTorch ROCm incompatibility). This will break uv for anyone not on 3.14. Should be 3.13 or removed.
5. Leaderboard CI uses broken t-approximation for small N
t_crit = 1.96 if n > 30 else 2.0 — for n=2 (two runs), the real t(0.025, df=1) is 12.7, not 2.0. This produces misleadingly tight confidence intervals for 2-5 runs. Since bootstrap_ci already exists and is correct, use it here instead of this approximation.
6. New metrics never wired up
coordination and communication_efficiency (20% combined weight) depend on MetricContext.conflicts_total, conflicts_resolved, messages_sent, messages_acted_on — but the game loop never populates these counters. Both metrics default to 1.0, silently inflating every score. Either wire them up in this PR or set their weight to 0.0 until they are.
Should Fix
7. Docker cleanup not in finally — In run_benchmark.py, the docker stop/cleanup at ~L688 sits after the main block. If an exception occurs, the container leaks. Use try/finally or the existing DockerGameServer.__aexit__ context manager.
8. No --rm on docker run — DockerGameServer.start() doesn't pass --rm. If stop() is never called (crash, Ctrl+C), orphaned containers accumulate. At minimum, clean up stale containers with the same name on start.
9. Container runs as root — No USER directive in the Dockerfile. Low priority for a local dev tool, but worth adding.
10. DEFAULT_WEIGHTS is a breaking change — Went from 8 to 10 metrics. CLAUDE.md still says "8 metrics, weighted composite" — needs updating. Existing tooling that consumes score snapshots may not expect the new fields.
11. Inconsistent std deviation (ddof=0 vs ddof=1) — bootstrap_ci() uses population std (ddof=0), while PairedResult uses sample std (ddof=1). The CI bounds are from percentile resampling so they're unaffected, but the .std field on BootstrapCI will confuse anyone comparing it against PairedResult.agent_std.
Nits
--ablationis a no-op stub — either document it as WIP in help text or remove from parser until implemented.- Late import
from rle.docker import wait_for_rimapiinsiderun_scenario.pyat the call site — move to top-level imports. EventLogopens a file handle outside awithblock (# noqa: SIM115). If init fails after open, handle leaks.
…ap CI, metric weights 1. Move .dockerignore to repo root (build context is ..) 2. Pre-seed RIMAPI config with serverIP=0.0.0.0 (fixes IPv6 loopback in Docker) 3. Writable mods overlay (/opt/mods-merged) instead of writing to :ro game mount 4. Python 3.14 across pyproject.toml, ruff, mypy, CI workflows 5. Replace broken t-approximation in leaderboard with bootstrap_ci() 6. Zero process metric weights until game loop wires MetricContext counters 7. try/finally for Docker cleanup in run_benchmark.py 8. --rm + stale container cleanup on docker run 9. Non-root USER in Dockerfile 10. CLAUDE.md updated for 10 metrics 11. Consistent sample std (ddof=1) in bootstrap.py 12. Nits: --ablation WIP label, top-level import, test assertions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
All 12 review items addressed in Blocking (all fixed):
Should fix (all fixed): Nits (all fixed):
326 tests, ruff clean, mypy strict clean, smoke-test green. |
README/CONTRIBUTING/CLAUDE.md: Python 3.14, 10 metrics, Docker section, smoke-test commands, CI/CD, complete package tree, uv conventions. code-style.md: mypy strict, no scipy, from __future__ import annotations. benchmark SKILL: use .env config instead of hardcoded model, new flags. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Reviewed all 12 items, verified against the code. Everything checks out:
326 tests pass, ruff clean, mypy strict clean on both 3.12 and 3.14. Speaking of which -- we verified that PyTorch ROCm nightly ( One thing to add before merge: update CLAUDE.md to note that PyTorch ROCm nightly is confirmed working on 3.14 (torch 2.12.0.dev+rocm7.2). The old "ROCm incompatible with 3.14" caveat can go. LGTM. @jkbennitt you are hereby granted permission to smash that merge button with the mass and velocity of a mass driver shell. Send it into orbit. Launch the ship. May your CI be green and your containers never leak. Godspeed, you beautiful code-slinging maniac. |
Summary
Implements #13 — the infrastructure for real automated benchmarks and a multi-model leaderboard. Replaces the fake
--dry-runmock benchmarks with Docker-containerized HeadlessRim, expands scoring to 10 metrics, and adds the statistical rigor and observability needed for a credible AGI benchmark.What's new
DockerGameServermanages container start/stop/restart via async subprocess. Sharedwait_for_rimapi()utility.BootstrapCImodel +bootstrap_ci()/bootstrap_paired_delta(). stdlib-only (no scipy). 18 tests.CostTrackerwith real-time pricing from OpenRouter's public API. Graceful fallback. 16 tests.coordinationandcommunication_efficiencyprocess metrics (20% combined weight). All 6 scenario YAMLs updated.RunSummaryfor CI artifacts.ci.yml(ruff + mypy + pytest + smoke-test on push/PR),benchmark.yml(manual dispatch + weekly Docker template).--dockerflag,--smoke-test(deprecates--dry-run),--ablationstub, N≥4 enforcement for HF push, bootstrap CIs inPairedResult, usage capture inbase_role.py.Stats
Test plan
uv run ruff check src/ tests/ scripts/— all checks passeduv run mypy src/rle/— 42 files, no issues (strict)uv run pytest— 326 passedpython scripts/run_benchmark.py --dry-run --ticks 3— smoke test cleanCloses #13
🤖 Generated with Claude Code