Skip to content

HeadlessRim Docker: real automated benchmarks + leaderboard infrastructure #13

@jkbennitt

Description

@jkbennitt

Problem

Current --dry-run benchmarks are fake. They run against hardcoded static mock state (3 colonists, 8000 wealth, never changes). Mock LLM returns no_action every tick. Metrics are flat. This tests JSON parse rate and pipeline plumbing, not colony management.

Real benchmarks require a live RimWorld instance — and the leaderboard (#8) requires running 6 scenarios × N models × 4+ runs = hundreds of game sessions with statistical rigor. That's not possible manually.

Solution

HeadlessRim + HeadlessRimPatch + RIMAPI in Docker. Ilya confirmed RIMAPI endpoints work headlessly (HeadlessRim#1) and published HeadlessRimPatch the same day. Save files load via RIMAPI endpoints + Docker mount.

Our existing scoring, evaluation, and paired comparison code works unchanged — just swap mock transport for real HTTP to the containerized game.

Why this is the leaderboard blocker

The leaderboard vision from #8 ("Chatbot Arena for RimWorld") requires:

Requirement Without Docker With Docker
Run 96+ game sessions (6 scenarios × 4 models × 4 runs) Human babysitting a PC for hours Scripted, unattended
Statistical significance (N=4+ per model) Impractical manually Automated with parallel containers
Reproducibility Different random events per manual session Same Docker image + save = same start
New model drops, update leaderboard Redo everything manually CI triggers full matrix automatically
Community submissions "Trust us, we ran it" Publish Docker image, anyone can replicate

Without HeadlessRim, the leaderboard is a one-off manual effort. With it, it's a living automated pipeline.

Architecture

Docker container (HeadlessRim + HeadlessRimPatch + RIMAPI on :8765)
    ↕ REST API + SSE
RLE Python orchestrator (run_benchmark.py)
    ↕ hub-spoke
7 agents (Felix Agent SDK) + any LLM provider
    ↕
Score → paired stats → leaderboard update

Leaderboard pipeline:

for model in [claude, gpt-4o, nemotron, llama, ...]:
  for scenario in [crashlanded, first_winter, ...]:
    for run in range(4):
      docker start → load save via RIMAPI
      run_benchmark.py --model $model
      docker stop
    compute paired stats (agent vs baseline, Cohen's d, p-value)
publish leaderboard

Tasks

Docker image

  • Clone HeadlessRimPatch, review Harmony patches
  • Build Docker image: HeadlessRim + HeadlessRimPatch + RIMAPI + scenario save files
  • Test single scenario run against containerized game
  • Benchmark tick speed — can we push speed 4 headlessly?

Benchmark tooling

  • Rename --dry-run to --smoke-test (honest about what it measures)
  • Add --docker flag to run_benchmark.py (connect to containerized game)
  • Multi-container parallel runs for N=4+ statistical significance

CI / leaderboard

  • GitHub Actions workflow: spin up container → run full benchmark suite → collect results
  • Leaderboard generation from benchmark_history.jsonl
  • Publish to HuggingFace Hub (existing infra in run_benchmark.py)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions