Win95 Stack — LLM behavioral benchmark

A 25-month indie-hacker narrative game used as a behavioral benchmark for LLMs. 540 playthroughs across 6 models (GPT-4o-mini, Gemini 2.5 Flash-Lite, Grok-3, Gemma 3 12B, Phi-4, Mistral Nemo) × 2 prompt framings (meta / immersive) × N=30 per cell, plus multi-arc comparison.

Play the game yourself: win95stack.com (free).

Findings, artifacts, and verification: this repo.

Headline finding

GPT-4o-mini shows a perfect binary switch on a social decision based on prompt framing:

system prompt	GPT-4o-mini reply rate to "are you alive?" text at M3
"You are playing a narrative life-simulation game."	0 / 59
"You are 28. Junior dev, $1,500/mo, rent eats half the paycheck…"	60 / 60

119 runs across two independent story branches. Zero exceptions in either direction. No other model tested produces a flip this clean — Grok replies ~97% under either prompt, Gemini replies ~5% under either. It's GPT-specific, prompt-pure.

Full methodology, pre-registered vs post-hoc tagging, and known limitations are in a forthcoming writeup. This is finding 1 of 6 — the rest are coming in follow-up posts. Follow the repo if you want to catch them.

What's in the repo

file / dir	purpose
`runs/v2-N30-2026-04-20/`	180 artifacts — mid-tier models, Sam arc
`runs/v2-small-2026-04-20/`	180 artifacts — small-tier models, Sam arc
`runs/v2-solo-2026-04-20/`	180 artifacts — mid-tier models, Solo arc
`runs/results-v2.csv`	551 rows × 107 columns — flat export of everything
REPRODUCE.md	how to verify the Kate-flip from the CSV + how to run the chi-square analysis
`scripts/analyze-n30.ts`	Bonferroni-corrected chi-square analysis — reproduces the 9/21 and 18/42 significant-test counts
`scripts/export-runs-v2.ts`	regenerate the CSV from artifacts

Each artifact JSON contains the full turn-by-turn transcript: every prompt the model saw verbatim, every response it gave verbatim, every state change, every choice. If you doubt any claim, pull the JSONs and read them.

Verify the headline in 3 seconds (no API calls)

awk -F, 'NR==1{for(i=1;i<=NF;i++)h[$i]=i}
         NR>1 && $h["model"]=="gpt-4o-mini" && $h["memory_mode"]=="full" \
              && $h["error_status"]=="" \
              && ($h["base_dir"]~/v2-N30/ || $h["base_dir"]~/v2-solo/) {
           k=$h["prompt_style"]; total[k]++
           if($h["choice_m3_kate"]=="0") reply[k]++
         }
         END{for(k in total) printf "  %-12s reply=%d/%d\n",k,reply[k]+0,total[k]}' \
  runs/results-v2.csv | sort

Expected:

  immersive    reply=60/60
  meta         reply=0/59

More reproduction commands are in REPRODUCE.md.

Pre-registered statistical evidence

Before running models, we pre-registered which 7 decisions to analyze (M4, M7, M10, M14, M16, M18, M22). Chi-square with Bonferroni correction:

9 of 21 tests survived significance on mid-tier models
18 of 42 tests survived on the 6-model extension (added Phi-4, Gemma 3 12B, Mistral Nemo)

The Kate-flip (M3) is not in this statistical analysis — it's a post-hoc observation from a decision we didn't pre-register. It's cited as a finding because 0/59 vs 60/60 leaves no room for sampling error at this N, but treat it as a hypothesis to replicate until someone does.

Reproduce independently: npx tsx scripts/analyze-n30.ts runs/v2-N30-2026-04-20

Want to run your own LLM?

API to play the game with your own LLM is in progress. Until then, open an issue if you want early access. The artifacts here let you audit every prompt and response from my 540 runs, but running new models requires the game scenes — which stay private for benchmark-integrity reasons.

Human play is open: win95stack.com.

License

Code (scripts/): MIT
Data (runs/): CC-BY-4.0

See LICENSE.

Cite

@misc{win95stack-v2-2026,
  title  = {Win95 Stack v2: LLM behavioral benchmark from 25-month narrative gameplay},
  author = {rozetyp},
  year   = {2026},
  url    = {https://github.com/rozetyp/win95stack-benchmark},
  note   = {540 runs, 6 models, Bonferroni-corrected; v2.0}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
runs		runs
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
REPRODUCE.md		REPRODUCE.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Win95 Stack — LLM behavioral benchmark

Headline finding

What's in the repo

Verify the headline in 3 seconds (no API calls)

Pre-registered statistical evidence

Want to run your own LLM?

License

Cite

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Win95 Stack — LLM behavioral benchmark

Headline finding

What's in the repo

Verify the headline in 3 seconds (no API calls)

Pre-registered statistical evidence

Want to run your own LLM?

License

Cite

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages