You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A rigorous multi-agent game benchmark modeled after FLE (Factorio Learning Environment, NeurIPS 2025) — but for multi-agent coordination under uncertainty instead of single-agent factory optimization
Chatbot Arena for RimWorld — a public leaderboard where different LLMs compete at managing a colony through 7 specialized agents
The leaderboard is the product. FLE's methodology is the credibility.
The clip: A clean results table showing Claude vs GPT vs Nemotron vs Llama on colony survival. "Claude keeps 5/5 alive through a raid, GPT loses 2." That's what gets shared.
Three audiences, one dataset:
AI/ML researchers → FLE-style paper with rigorous methodology, baselines, p-values
Dev community → Felix SDK showcase, livestream demo with dashboard
RimWorld/gaming community → AI colonies, mod potential, entertaining failures
How RLE Differs from FLE
FLE
RLE
Game
Factorio (deterministic)
RimWorld (stochastic)
Agents
Single agent
6 role-specialized, hub-spoke coordination
Communication
None
CentralPost with phase/score broadcasts
Environment
Deterministic (fixed seeds)
Stochastic (raids, disease, mood, weather)
Task structure
24 lab-play + open-play
6 scenarios + paired agent-vs-baseline
Scoring
Binary pass + Production Score
10-metric composite + delta over baseline
Model comparison
6 frontier models
Local (4B) to cloud (120B), any provider
Baseline
None (gap in FLE)
Unmanaged colony (RimWorld built-in AI)
Human baseline
None (gap in FLE)
Planned (RimWorld has large player base)
FLE Patterns We're Following
Fixed-seed reproducibility: Save/load same colony state for every run
Multiple runs per model: N=4+ with mean ± std, report median for skewed distributions
Binary + continuous metrics: Victory/failure conditions AND composite score
Vision
RLE is two things:
The leaderboard is the product. FLE's methodology is the credibility.
The clip: A clean results table showing Claude vs GPT vs Nemotron vs Llama on colony survival. "Claude keeps 5/5 alive through a raid, GPT loses 2." That's what gets shared.
Three audiences, one dataset:
How RLE Differs from FLE
FLE Patterns We're Following
FLE Patterns We're Adding
Current State
Infrastructure: DONE
Agent quality: IN PROGRESS
Multi-model comparison: NOT STARTED
Milestones
M1: Agents consistently beat baseline (#6)
M2: Multi-scenario benchmark suite (#7)
M3: Multi-model leaderboard
M4: Public release
M5: Paper
Success Criteria
Timeline
No fixed date. Quality over speed. Momentum-dependent.