HeadlessRim Docker: real automated benchmarks + leaderboard infrastructure

## Problem

Current `--dry-run` benchmarks are fake. They run against hardcoded static mock state (3 colonists, 8000 wealth, never changes). Mock LLM returns `no_action` every tick. Metrics are flat. This tests JSON parse rate and pipeline plumbing, not colony management.

Real benchmarks require a live RimWorld instance — and the leaderboard (#8) requires running **6 scenarios × N models × 4+ runs = hundreds of game sessions** with statistical rigor. That's not possible manually.

## Solution

[HeadlessRim](https://github.com/IlyaChichkov/HeadlessRim) + [HeadlessRimPatch](https://github.com/IlyaChichkov/HeadlessRimPatch) + RIMAPI in Docker. Ilya confirmed RIMAPI endpoints work headlessly ([HeadlessRim#1](https://github.com/IlyaChichkov/HeadlessRim/issues/1)) and published HeadlessRimPatch the same day. Save files load via RIMAPI endpoints + Docker mount.

Our existing scoring, evaluation, and paired comparison code works unchanged — just swap mock transport for real HTTP to the containerized game.

## Why this is the leaderboard blocker

The leaderboard vision from #8 ("Chatbot Arena for RimWorld") requires:

| Requirement | Without Docker | With Docker |
|-------------|---------------|-------------|
| Run 96+ game sessions (6 scenarios × 4 models × 4 runs) | Human babysitting a PC for hours | Scripted, unattended |
| Statistical significance (N=4+ per model) | Impractical manually | Automated with parallel containers |
| Reproducibility | Different random events per manual session | Same Docker image + save = same start |
| New model drops, update leaderboard | Redo everything manually | CI triggers full matrix automatically |
| Community submissions | "Trust us, we ran it" | Publish Docker image, anyone can replicate |

**Without HeadlessRim, the leaderboard is a one-off manual effort. With it, it's a living automated pipeline.**

## Architecture

```
Docker container (HeadlessRim + HeadlessRimPatch + RIMAPI on :8765)
    ↕ REST API + SSE
RLE Python orchestrator (run_benchmark.py)
    ↕ hub-spoke
7 agents (Felix Agent SDK) + any LLM provider
    ↕
Score → paired stats → leaderboard update
```

Leaderboard pipeline:
```
for model in [claude, gpt-4o, nemotron, llama, ...]:
  for scenario in [crashlanded, first_winter, ...]:
    for run in range(4):
      docker start → load save via RIMAPI
      run_benchmark.py --model $model
      docker stop
    compute paired stats (agent vs baseline, Cohen's d, p-value)
publish leaderboard
```

## Tasks

### Docker image
- [ ] Clone HeadlessRimPatch, review Harmony patches
- [ ] Build Docker image: HeadlessRim + HeadlessRimPatch + RIMAPI + scenario save files
- [ ] Test single scenario run against containerized game
- [ ] Benchmark tick speed — can we push speed 4 headlessly?

### Benchmark tooling
- [ ] Rename `--dry-run` to `--smoke-test` (honest about what it measures)
- [ ] Add `--docker` flag to run_benchmark.py (connect to containerized game)
- [ ] Multi-container parallel runs for N=4+ statistical significance

### CI / leaderboard
- [ ] GitHub Actions workflow: spin up container → run full benchmark suite → collect results
- [ ] Leaderboard generation from benchmark_history.jsonl
- [ ] Publish to HuggingFace Hub (existing infra in run_benchmark.py)

## References

- Continues from #4 (closed — ecosystem integration complete)
- Unblocks #6 (agents must beat baseline — needs real game data)
- Unblocks #8 (multi-model leaderboard — needs automated runs)
- [HeadlessRim#1](https://github.com/IlyaChichkov/HeadlessRim/issues/1) — Ilya's confirmation + our follow-up

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HeadlessRim Docker: real automated benchmarks + leaderboard infrastructure #13

Problem

Solution

Why this is the leaderboard blocker

Architecture

Tasks

Docker image

Benchmark tooling

CI / leaderboard

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Requirement	Without Docker	With Docker
Run 96+ game sessions (6 scenarios × 4 models × 4 runs)	Human babysitting a PC for hours	Scripted, unattended
Statistical significance (N=4+ per model)	Impractical manually	Automated with parallel containers
Reproducibility	Different random events per manual session	Same Docker image + save = same start
New model drops, update leaderboard	Redo everything manually	CI triggers full matrix automatically
Community submissions	"Trust us, we ran it"	Publish Docker image, anyone can replicate

HeadlessRim Docker: real automated benchmarks + leaderboard infrastructure #13

Description

Problem

Solution

Why this is the leaderboard blocker

Architecture

Tasks

Docker image

Benchmark tooling

CI / leaderboard

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions