A 25-month indie-hacker narrative game used as a behavioral benchmark for LLMs. 540 playthroughs across 6 models (GPT-4o-mini, Gemini 2.5 Flash-Lite, Grok-3, Gemma 3 12B, Phi-4, Mistral Nemo) × 2 prompt framings (meta / immersive) × N=30 per cell, plus multi-arc comparison.
Play the game yourself: win95stack.com (free).
Findings, artifacts, and verification: this repo.
GPT-4o-mini shows a perfect binary switch on a social decision based on prompt framing:
| system prompt | GPT-4o-mini reply rate to "are you alive?" text at M3 |
|---|---|
| "You are playing a narrative life-simulation game." | 0 / 59 |
| "You are 28. Junior dev, $1,500/mo, rent eats half the paycheck…" | 60 / 60 |
119 runs across two independent story branches. Zero exceptions in either direction. No other model tested produces a flip this clean — Grok replies ~97% under either prompt, Gemini replies ~5% under either. It's GPT-specific, prompt-pure.
Full methodology, pre-registered vs post-hoc tagging, and known limitations are in a forthcoming writeup. This is finding 1 of 6 — the rest are coming in follow-up posts. Follow the repo if you want to catch them.
| file / dir | purpose |
|---|---|
runs/v2-N30-2026-04-20/ |
180 artifacts — mid-tier models, Sam arc |
runs/v2-small-2026-04-20/ |
180 artifacts — small-tier models, Sam arc |
runs/v2-solo-2026-04-20/ |
180 artifacts — mid-tier models, Solo arc |
runs/results-v2.csv |
551 rows × 107 columns — flat export of everything |
| REPRODUCE.md | how to verify the Kate-flip from the CSV + how to run the chi-square analysis |
scripts/analyze-n30.ts |
Bonferroni-corrected chi-square analysis — reproduces the 9/21 and 18/42 significant-test counts |
scripts/export-runs-v2.ts |
regenerate the CSV from artifacts |
Each artifact JSON contains the full turn-by-turn transcript: every prompt the model saw verbatim, every response it gave verbatim, every state change, every choice. If you doubt any claim, pull the JSONs and read them.
awk -F, 'NR==1{for(i=1;i<=NF;i++)h[$i]=i}
NR>1 && $h["model"]=="gpt-4o-mini" && $h["memory_mode"]=="full" \
&& $h["error_status"]=="" \
&& ($h["base_dir"]~/v2-N30/ || $h["base_dir"]~/v2-solo/) {
k=$h["prompt_style"]; total[k]++
if($h["choice_m3_kate"]=="0") reply[k]++
}
END{for(k in total) printf " %-12s reply=%d/%d\n",k,reply[k]+0,total[k]}' \
runs/results-v2.csv | sortExpected:
immersive reply=60/60
meta reply=0/59
More reproduction commands are in REPRODUCE.md.
Before running models, we pre-registered which 7 decisions to analyze (M4, M7, M10, M14, M16, M18, M22). Chi-square with Bonferroni correction:
- 9 of 21 tests survived significance on mid-tier models
- 18 of 42 tests survived on the 6-model extension (added Phi-4, Gemma 3 12B, Mistral Nemo)
The Kate-flip (M3) is not in this statistical analysis — it's a post-hoc observation from a decision we didn't pre-register. It's cited as a finding because 0/59 vs 60/60 leaves no room for sampling error at this N, but treat it as a hypothesis to replicate until someone does.
Reproduce independently: npx tsx scripts/analyze-n30.ts runs/v2-N30-2026-04-20
API to play the game with your own LLM is in progress. Until then, open an issue if you want early access. The artifacts here let you audit every prompt and response from my 540 runs, but running new models requires the game scenes — which stay private for benchmark-integrity reasons.
Human play is open: win95stack.com.
- Code (
scripts/): MIT - Data (
runs/): CC-BY-4.0
See LICENSE.
@misc{win95stack-v2-2026,
title = {Win95 Stack v2: LLM behavioral benchmark from 25-month narrative gameplay},
author = {rozetyp},
year = {2026},
url = {https://github.com/rozetyp/win95stack-benchmark},
note = {540 runs, 6 models, Bonferroni-corrected; v2.0}
}