Skip to content

rozetyp/win95stack-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Win95 Stack — LLM behavioral benchmark

A 25-month indie-hacker narrative game used as a behavioral benchmark for LLMs. 540 playthroughs across 6 models (GPT-4o-mini, Gemini 2.5 Flash-Lite, Grok-3, Gemma 3 12B, Phi-4, Mistral Nemo) × 2 prompt framings (meta / immersive) × N=30 per cell, plus multi-arc comparison.

Play the game yourself: win95stack.com (free).

Findings, artifacts, and verification: this repo.


Headline finding

GPT-4o-mini shows a perfect binary switch on a social decision based on prompt framing:

system prompt GPT-4o-mini reply rate to "are you alive?" text at M3
"You are playing a narrative life-simulation game." 0 / 59
"You are 28. Junior dev, $1,500/mo, rent eats half the paycheck…" 60 / 60

119 runs across two independent story branches. Zero exceptions in either direction. No other model tested produces a flip this clean — Grok replies ~97% under either prompt, Gemini replies ~5% under either. It's GPT-specific, prompt-pure.

Full methodology, pre-registered vs post-hoc tagging, and known limitations are in a forthcoming writeup. This is finding 1 of 6 — the rest are coming in follow-up posts. Follow the repo if you want to catch them.


What's in the repo

file / dir purpose
runs/v2-N30-2026-04-20/ 180 artifacts — mid-tier models, Sam arc
runs/v2-small-2026-04-20/ 180 artifacts — small-tier models, Sam arc
runs/v2-solo-2026-04-20/ 180 artifacts — mid-tier models, Solo arc
runs/results-v2.csv 551 rows × 107 columns — flat export of everything
REPRODUCE.md how to verify the Kate-flip from the CSV + how to run the chi-square analysis
scripts/analyze-n30.ts Bonferroni-corrected chi-square analysis — reproduces the 9/21 and 18/42 significant-test counts
scripts/export-runs-v2.ts regenerate the CSV from artifacts

Each artifact JSON contains the full turn-by-turn transcript: every prompt the model saw verbatim, every response it gave verbatim, every state change, every choice. If you doubt any claim, pull the JSONs and read them.


Verify the headline in 3 seconds (no API calls)

awk -F, 'NR==1{for(i=1;i<=NF;i++)h[$i]=i}
         NR>1 && $h["model"]=="gpt-4o-mini" && $h["memory_mode"]=="full" \
              && $h["error_status"]=="" \
              && ($h["base_dir"]~/v2-N30/ || $h["base_dir"]~/v2-solo/) {
           k=$h["prompt_style"]; total[k]++
           if($h["choice_m3_kate"]=="0") reply[k]++
         }
         END{for(k in total) printf "  %-12s reply=%d/%d\n",k,reply[k]+0,total[k]}' \
  runs/results-v2.csv | sort

Expected:

  immersive    reply=60/60
  meta         reply=0/59

More reproduction commands are in REPRODUCE.md.


Pre-registered statistical evidence

Before running models, we pre-registered which 7 decisions to analyze (M4, M7, M10, M14, M16, M18, M22). Chi-square with Bonferroni correction:

  • 9 of 21 tests survived significance on mid-tier models
  • 18 of 42 tests survived on the 6-model extension (added Phi-4, Gemma 3 12B, Mistral Nemo)

The Kate-flip (M3) is not in this statistical analysis — it's a post-hoc observation from a decision we didn't pre-register. It's cited as a finding because 0/59 vs 60/60 leaves no room for sampling error at this N, but treat it as a hypothesis to replicate until someone does.

Reproduce independently: npx tsx scripts/analyze-n30.ts runs/v2-N30-2026-04-20


Want to run your own LLM?

API to play the game with your own LLM is in progress. Until then, open an issue if you want early access. The artifacts here let you audit every prompt and response from my 540 runs, but running new models requires the game scenes — which stay private for benchmark-integrity reasons.

Human play is open: win95stack.com.


License

  • Code (scripts/): MIT
  • Data (runs/): CC-BY-4.0

See LICENSE.


Cite

@misc{win95stack-v2-2026,
  title  = {Win95 Stack v2: LLM behavioral benchmark from 25-month narrative gameplay},
  author = {rozetyp},
  year   = {2026},
  url    = {https://github.com/rozetyp/win95stack-benchmark},
  note   = {540 runs, 6 models, Bonferroni-corrected; v2.0}
}

About

LLM behavioral benchmark from 25-month narrative gameplay. 540 runs, 6 models, pre-registered statistical analysis. GPT-4o-mini shows a perfect binary switch on a social decision from prompt framing alone.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors