RLE v1.0: Multi-model colony management leaderboard

## Vision

RLE is two things:

1. **A rigorous multi-agent game benchmark** modeled after FLE (Factorio Learning Environment, NeurIPS 2025) — but for multi-agent coordination under uncertainty instead of single-agent factory optimization
2. **Chatbot Arena for RimWorld** — a public leaderboard where different LLMs compete at managing a colony through 7 specialized agents

The leaderboard is the product. FLE's methodology is the credibility.

**The clip:** A clean results table showing Claude vs GPT vs Nemotron vs Llama on colony survival. "Claude keeps 5/5 alive through a raid, GPT loses 2." That's what gets shared.

**Three audiences, one dataset:**
- AI/ML researchers → FLE-style paper with rigorous methodology, baselines, p-values
- Dev community → Felix SDK showcase, livestream demo with dashboard
- RimWorld/gaming community → AI colonies, mod potential, entertaining failures

## How RLE Differs from FLE

| | FLE | RLE |
|---|---|---|
| Game | Factorio (deterministic) | RimWorld (stochastic) |
| Agents | Single agent | 6 role-specialized, hub-spoke coordination |
| Communication | None | CentralPost with phase/score broadcasts |
| Environment | Deterministic (fixed seeds) | Stochastic (raids, disease, mood, weather) |
| Task structure | 24 lab-play + open-play | 6 scenarios + paired agent-vs-baseline |
| Scoring | Binary pass + Production Score | 10-metric composite + delta over baseline |
| Model comparison | 6 frontier models | Local (4B) to cloud (120B), any provider |
| Baseline | None (gap in FLE) | Unmanaged colony (RimWorld built-in AI) |
| Human baseline | None (gap in FLE) | Planned (RimWorld has large player base) |

## FLE Patterns We're Following

- **Fixed-seed reproducibility**: Save/load same colony state for every run
- **Multiple runs per model**: N=4+ with mean ± std, report median for skewed distributions
- **Binary + continuous metrics**: Victory/failure conditions AND composite score
- **Difficulty progression**: Easy (Crashlanded) → Extreme (Ship Launch)
- **Comparative results table**: Model × scenario matrix
- **Paired evaluation**: Agent score vs baseline score (FLE doesn't have this — we're better here)

## FLE Patterns We're Adding

- **Stochastic robustness**: Different random events per run (RimWorld storyteller varies), requiring more runs for significance
- **Multi-agent coordination**: 7 agents must coordinate without conflicting, measured by conflict resolution stats
- **Ablation**: Remove one agent at a time to measure per-agent contribution
- **Human baseline**: Expert RimWorld players on same scenarios (FLE acknowledged this gap)
- **Real-time visualization**: Helix + dashboard overlay for qualitative assessment

## Current State

**Infrastructure: DONE**
- [x] 7 role agents with CentralPost hub-spoke communication
- [x] Parallel deliberation (7 agents concurrently)
- [x] SSE events wired into agent decisions
- [x] Helix phase adaptation (exploration → analysis → synthesis)
- [x] Paired benchmark (agent vs unmanaged baseline with save/load)
- [x] Delta scoring with statistical tests (Cohen's d, Welch's t-test)
- [x] Detailed colonist data (skills, traits, current job, needs)
- [x] Dashboard with 5 RLE widgets
- [x] Terminal helix visualizer
- [x] Benchmark tracking (JSONL history, baselines, W&B, HuggingFace Hub)
- [x] Provider-agnostic (local 4B, cloud 120B, Anthropic, OpenAI)

**Agent quality: IN PROGRESS**
- [x] Agents beat baseline for the first time (+0.018, N=2, not yet significant)
- [x] N=4 run with detailed colonist data for statistical significance
- [ ] Harder scenario saves where baseline struggles more (#7)
- [ ] Ablation (remove one agent, measure contribution)

**Multi-model comparison: NOT STARTED**
- [ ] Run same scenario with 4+ different models
- [ ] Build the leaderboard table
- [ ] Identify which models are best at which scenarios

## Milestones

### M1: Agents consistently beat baseline (#6)
- N=4 paired runs on Crashlanded with positive delta (p < 0.05)
- Detailed colonist data improving agent decisions
- Target: Agent 0.85 vs Baseline 0.75 → delta +0.10

### M2: Multi-scenario benchmark suite (#7)
- 4+ scenario save files created (Crashlanded, First Winter, Raid Defense, Plague Response)
- Paired results across scenarios showing agents help more on harder scenarios
- FLE parallel: like their 24 lab-play tasks but with RimWorld's stochastic challenge progression
- Target: positive delta on at least 4/6 scenarios

### M3: Multi-model leaderboard
- Run the benchmark suite with 4+ models:
  - Nemotron Nano 4B (local, free)
  - Nemotron Super 120B (OpenRouter, ~$0.09/run)
  - Claude Sonnet (Anthropic)
  - GPT-4o (OpenAI)
  - Qwen3.5-9B (local, free)
- FLE parallel: their 6-model comparison table but with delta scores instead of pass rates
- Publish: model × scenario matrix with delta, p-value, effect size
- Target: Clear differentiation between models

### M4: Public release
- Results on HuggingFace Hub (appsprout/rle-benchmarks)
- Blog post / Twitter thread with the leaderboard
- Dashboard recording / livestream VOD
- README with full reproduction instructions
- FLE parallel: their open-source release with reproducible evaluation

### M5: Paper
- FLE-style methodology: environment description, agent architecture, evaluation protocol
- Results table: model × scenario × metric
- Ablation study: per-agent contribution
- Human baseline comparison (3-5 expert RimWorld players)
- Novel contributions over FLE: multi-agent coordination, stochastic environment, paired baseline, human comparison
- Target venue: NeurIPS workshop, AAAI, or standalone arXiv

## Success Criteria

1. The leaderboard exists with 4+ models showing statistically significant differences
2. At least one scenario where agents demonstrably save a colony that would otherwise fail
3. People share the results because it's a fun, intuitive way to compare LLM capabilities
4. The methodology is rigorous enough that researchers take it seriously

## Timeline

No fixed date. Quality over speed. Momentum-dependent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RLE v1.0: Multi-model colony management leaderboard #8

Vision

How RLE Differs from FLE

FLE Patterns We're Following

FLE Patterns We're Adding

Current State

Milestones

M1: Agents consistently beat baseline (#6)

M2: Multi-scenario benchmark suite (#7)

M3: Multi-model leaderboard

M4: Public release

M5: Paper

Success Criteria

Timeline

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	FLE	RLE
Game	Factorio (deterministic)	RimWorld (stochastic)
Agents	Single agent	6 role-specialized, hub-spoke coordination
Communication	None	CentralPost with phase/score broadcasts
Environment	Deterministic (fixed seeds)	Stochastic (raids, disease, mood, weather)
Task structure	24 lab-play + open-play	6 scenarios + paired agent-vs-baseline
Scoring	Binary pass + Production Score	10-metric composite + delta over baseline
Model comparison	6 frontier models	Local (4B) to cloud (120B), any provider
Baseline	None (gap in FLE)	Unmanaged colony (RimWorld built-in AI)
Human baseline	None (gap in FLE)	Planned (RimWorld has large player base)

RLE v1.0: Multi-model colony management leaderboard #8

Description

Vision

How RLE Differs from FLE

FLE Patterns We're Following

FLE Patterns We're Adding

Current State

Milestones

M1: Agents consistently beat baseline (#6)

M2: Multi-scenario benchmark suite (#7)

M3: Multi-model leaderboard

M4: Public release

M5: Paper

Success Criteria

Timeline

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions