A full-stack 3D chess game where you journey through twenty ages of human history — from the age of dinosaurs to transcendent cosmic realms — powered by a custom Rust chess engine compiled to WebAssembly.
- Playable right now — open this link and you're in a 3D chess game. No install, no account, no loading screen.
- Chess engine runs entirely in your browser — custom Rust engine compiled to WebAssembly (~5M positions/sec). Zero server cost for AI — AI games scale with zero backend load.
- Real-time multiplayer with persistence — Socket.io WebSocket server, JWT auth, guest play, ELO matchmaking, game rooms, reconnection handling, Prisma/SQLite storage.
- 854 tests across 3 languages, production-hardened — Vitest + cargo test + Playwright E2E (4 suites) + 3 k6 load test suites. Rate limiting, Helmet.js security headers, graceful shutdown, crash recovery.
Stack: TypeScript · Three.js · Rust · WebAssembly · Node.js · Express · Socket.io · Prisma · SQLite · Zod · Playwright · Vitest · k6 · Docker · Fly.io · Vercel
| Claim | Proof |
|---|---|
| Playable game | 🎮 Play Now |
| Server is running | 📈 Health Check · 📊 Prometheus Metrics (public — METRICS_TOKEN not set in this deployment) |
| 420 frontend unit tests | npm test |
| 218 Rust engine tests | cd rust-engine && cargo test |
| 168 server tests | cd server && npm test |
| 48 E2E browser tests (4 suites) | npx playwright test |
| k6 load testing | k6 run load-tests/http-load-test.js — methodology ↓ |
| Perft correctness | Depth 5 = 4,865,609 nodes ✅ — cargo test perft |
| Security hardening | Security Posture ↓ |
- Perft-validated move generation — engine matches all standard node counts through depth 5 (
cargo test perft) - E2E Playwright tests that play real games — automated agent makes legal moves, verifies board state, checks for crashes (
npx playwright test) - Prometheus metrics + health check — 16 custom metrics, live
/healthand/metricsendpoints, k6 SLO validation - Offline / PWA support — installable on mobile, service worker caching, Android hybrid via Capacitor
- Zod protocol validation — every WebSocket message is schema-validated with version enforcement (
v: 1)
I'm the sole maintainer and take responsibility for correctness, security, and performance. I use AI-assisted tooling where helpful, but I review every change, write tests, and validate behavior with E2E and benchmarks.
- Role: Solo owner — design, implementation, testing, deployment
- Standard: No change lands without tests passing, E2E green, TypeScript strict mode clean
- AI policy: AI-assisted code is allowed. I review, refactor, and verify. I can explain and extend every component.
- Proof hooks:
window.__GAME__andwindow.__RENDERER__are exposed for E2E test automation — Playwright tests use these to make real moves and inspect board state
🎮 Play it live — loads in under 2 seconds, no install required
3D Staunton pieces · 20 historical eras · Stockfish AI · real-time multiplayer
Screenshot/GIF coming soon — in the meantime, the live link above is the best demo.
| What you want | Where to find it | Time |
|---|---|---|
| See the game running | 🎮 Play Now | 10 sec |
| Stack + resume bullets | Impact ↑ + Stack ↑ | 30 sec |
| Proof (tests, metrics, links) | Evidence ↑ | 1 min |
| Talking points for interview | Why It's Interesting ↓ | 1 min |
| Interview drill questions | Interview Drill ↓ | 2 min |
| What you want | Where to find it | Time |
|---|---|---|
| Architecture + data boundaries | Architecture ↓ | 1 min |
| Engine internals (bitboards, search) | Section B ↓ / Section C ↓ | 5–30 min |
| Multiplayer protocol (Zod schemas) | B11 ↓ + Protocol ↓ | 3 min |
| Security posture + threat model | Security Posture ↓ | 2 min |
| Performance numbers (reproducible) | Performance ↓ | 2 min |
| System invariants | Invariants ↓ | 1 min |
| Testing strategy | D10 ↓ | 2 min |
| Load testing + SLOs | D14 ↓ | 3 min |
| Deploy + operations | A7 ↓ / Section F ↓ | 5 min |
| What you want | Where to find it | Time |
|---|---|---|
| Clone + play in 2 minutes | Quick Start ↓ | 2 min |
| Full IKEA-style setup guide | Section A ↓ | 10 min |
| Rebuild WASM engine from source | A6 ↓ | 5 min |
Each part is also available as a standalone document:
| Part | Standalone |
|---|---|
| Summary | docs/PART1_SUMMARY.md |
| Tech Stack | docs/PART2_TECH_STACK.md |
| Quick Start | docs/PART3_QUICK_START.md |
| Full Tutorial | docs/PART4_FULL_TUTORIAL.md |
30 seconds. What this is, what it does, why it matters.
A chess game that combines:
- Custom Rust chess engine compiled to WebAssembly (bitboards, magic bitboards, alpha-beta search, transposition tables)
- 3D rendering with Three.js — 20 procedurally generated era environments with procedural skyboxes, L-system trees, Lorenz attractor particles, and dynamic lighting
- 24 piece styles (7 3D + 17 2D canvas-drawn including Art Deco, Steampunk, and Tribal) and 12 board visual styles with per-style theme-aware highlights
- 8 UI themes (Newspaper, Obsidian, Arctic, Ember, Jade, Dusk, Ivory, Cobalt) with full CSS variable theming via
themeSystem.ts - Welcome Dashboard — newspaper-themed landing screen with game mode buttons, difficulty/GFX preferences, and a live stats ribbon (ELO, wins, streak, level). Every pre-game option in one glance.
- Classic Mode — one-button toggle to a chess.com / lichess-style dark UI, hides newspaper chrome, perfect for mobile stealth play
- Graphics Quality presets — Low / Medium / High with per-preset control over shadows, particles, skybox, environment, and render scale
- AI Aggression system — 20-level slider controlling bonus pieces, board rearrangement, and pawn upgrades
- Real-time multiplayer via Socket.io with ELO matchmaking, JWT auth, guest play, and game persistence
- Progressive Web App — installable on mobile, offline-capable, with Android hybrid build via Capacitor
- Stability hardening — click debounce, input lock, RAF coalescing, WebGL context-loss toast, Three.js disposal
| Talking Point | Detail |
|---|---|
| Systems programming | Rust engine: bitboard move gen, magic bitboard lookups, Zobrist hashing — all compiled to WASM |
| Full-stack ownership | Frontend (TS + Three.js), backend (Node + Express + Prisma), engine (Rust), infra (Docker + Fly.io) |
| Testing discipline | 854 tests across 4 test suites: 218 Rust (cargo test) + 420 frontend (Vitest) + 168 server (Vitest) + 48 E2E Playwright browser tests |
| Performance engineering | Engine does ~5M positions/sec in WASM. Magic bitboards reduce sliding piece lookup from O(28) to O(1) |
| Graceful degradation | Triple AI fallback: Rust WASM → Stockfish.js Worker → TypeScript minimax. Game always works. |
| Production resilience | Rate limiting (HTTP + WS), graceful shutdown, crash recovery, Helmet.js security headers, k6 load testing |
| UI / UX polish | 8 full themes, Classic Mode stealth toggle, 3-tier GFX quality, stability hardening (debounce, RAF coalescing, WebGL recovery) |
| Large-scale AI experimentation | 1-million-player tournament runner with Swiss pairing, A/B testing, rayon parallelism, SQLite analytics |
| Metric | Value |
|---|---|
| Rust engine source | 12 files, ~7,000 lines (includes 866-line tournament runner) |
| Frontend source | 40+ files, TypeScript (renderer3d.ts alone is 5,000+ lines) |
| Server source | 10+ files, 1,020-line main server + resilience module |
| Load test scripts | 3 k6 scripts (HTTP, WebSocket, stress) |
| Perft correctness | Matches all standard values through depth 5 (4,865,609 nodes) |
| WASM binary | ~170 KB gzipped |
| Piece styles | 24 total — 7 3D + 17 2D canvas-drawn (Art Deco, Steampunk, Tribal, Celtic, Gothic, Pixel, and more) |
| Board styles | 12 with per-style theme-aware highlight colors |
| UI themes | 8 full themes (Newspaper, Obsidian, Arctic, Ember, Jade, Dusk, Ivory, Cobalt) |
| Classic Mode | One-button dark chess.com-style UI — hides newspaper chrome |
| Graphics Quality | 3 presets (Low / Med / High) — shadows, particles, skybox, render scale |
| Era environments | 20 with procedural skyboxes, dynamic lighting, L-system trees, and particle systems |
| Test count | 806 unit + 48 E2E Playwright (854 total) across 3 languages |
| Prometheus metrics | 16 custom metrics + Node.js defaults |
1 minute. What's used, how it fits together, and the key design decisions.
| Layer | Technology | Why |
|---|---|---|
| Frontend | TypeScript, Three.js, Vite | WebGL 3D rendering, zero-framework for canvas-heavy app |
| Chess Engine | Rust → WebAssembly (wasm-bindgen) | 10–100× faster than JS, runs client-side for zero server cost |
| Multiplayer | Node.js, Express, Socket.io | Real-time WebSocket with HTTP long-polling fallback |
| Database | Prisma ORM, SQLite (dev/prod) | Type-safe queries, zero-config dev, persistent volume in prod |
| Auth | JWT + bcryptjs | Stateless auth, guest accounts with optional registration |
| Security | Helmet.js, express-rate-limit, CORS | Security headers, brute-force protection, origin whitelisting |
| Metrics | Prometheus (prom-client) | 16 custom metrics + Node.js defaults, /metrics endpoint |
| Load Testing | k6 (Grafana) | HTTP, WebSocket, and stress test scripts with SLO thresholds |
| AI Tournament | Rust (rayon, clap, rusqlite) | 1M-player Swiss tournament with A/B testing and parallel execution |
| Testing | Vitest + cargo test + Playwright | Unit, integration, E2E across all 3 languages |
| Deploy | Vercel (frontend), Docker + Fly.io (server) | Edge CDN for static, persistent VM for WebSocket server |
┌──────────────────────────────────────────────────────────────┐
│ Browser │
│ │
│ ┌──────────┐ ┌─────────────┐ ┌───────────────┐ │
│ │ Three.js │ │ Game │ │ Socket.io │ │
│ │ Renderer │◄──┤ Controller ├──►│ Client │ │
│ └──────────┘ └──────┬──────┘ └───────┬───────┘ │
│ scene graph, │ │ │
│ piece meshes, │ │ JSON messages: │
│ highlights │ │ {type, v:1, ...} │
│ ┌─────────▼─────────┐ │ │
│ │ Engine Bridge │ │ · create_table │
│ │ (TypeScript) │ │ · join_table │
│ └─────────┬─────────┘ │ · make_move │
│ │ │ · resign │
│ FEN string + depth │ · reconnect │
│ ────────▼──────── │ │
│ ┌───────────────────┐ │ │
│ │ Rust Engine │ │ │
│ │ (WASM) │ │ │
│ └───────────────────┘ │ │
│ SAN move string ▲ │ │
└───────────────────────────────────────────┼──────────────────┘
│ WebSocket (wss://)
│ JWT in handshake
┌────────▼──────────┐
│ Chess Server │
│ Express + WS │
├────────────────────┤
│ Zod validation │ ← all inbound
│ Rate limiting │ ← per-IP + per-socket
│ Helmet.js headers │ ← all responses
├────────────────────┤
│ TableManager │ open tables model
│ GameRoom │ chess.js validation
│ ELO calculator │ K=32 standard
├────────────────────┤
│ Prisma + SQLite │
│ users, games, │
│ ELO history │
└────────────────────┘
The engine runs in the browser, not on the server. Three engines cascade for 100% availability:
Request → Rust WASM (~1M+ NPS)
↓ if WASM fails to load
Stockfish.js Worker (~200K NPS, skill 0-20)
↓ if Worker fails
TypeScript minimax (~10K NPS, always works)
| Decision | Rationale |
|---|---|
| Engine in browser, not server | Zero latency for single-player, zero server cost for AI, AI games scale with zero backend load |
| Vanilla TS, no React | App is 80% canvas. React's virtual DOM adds overhead for <canvas> updates |
| SQLite in production | Portfolio-scale traffic. Persistent Fly.io volume. Avoids Postgres complexity |
| Bitboard representation | O(1) attack lookups via magic bitboards. Industry standard for chess engines |
| 16-bit move encoding | 2 bytes per move. 256-move list fits in 512 bytes (L1 cache) |
| Boundary | Threat | Mitigation | Status |
|---|---|---|---|
| HTTP API | Brute force / DDoS | express-rate-limit — 100 req/min per IP |
✅ Enforced |
| WebSocket | Message flood | Per-socket rate limit — 20 msg/sec sliding window (resilience.ts) |
✅ Enforced |
| WebSocket | Connection flood | Per-IP connection cap — max 10 concurrent (trackConnection) |
✅ Enforced |
| Auth | No account required | Guest tokens — play immediately, register optionally | ✅ Enforced |
| Auth | Token theft | JWT (HS256) + bcrypt password hashing. Stateless — no server-side revocation (see trade-offs) | |
| Game moves | Illegal moves | Server-side chess.js validation — rejects and returns error | ✅ Enforced |
| Game moves | Wrong turn | Server checks playerColor === currentTurn before accepting |
✅ Enforced |
| Protocol | Malformed messages | Zod schema validation on every inbound WebSocket message | ✅ Enforced |
| Protocol | Version mismatch | v: 1 literal in every schema — unknown versions rejected |
✅ Enforced |
| Headers | XSS / clickjack / sniffing | Helmet.js — HSTS, X-Frame-Options, nosniff, referrer-policy. CSP enforced via <meta> tag + Vercel vercel.json headers (not Helmet — disabled to avoid conflicts with WASM/Socket.io) |
✅ Enforced |
| CORS | Origin spoofing | Allowlist: Vercel domain + localhost dev only | ✅ Enforced |
| Rooms | Memory exhaustion | Max 500 active rooms (canCreateRoom) |
✅ Enforced |
| Secrets | Key exposure | JWT_SECRET set via Fly.io secrets (never in code); .env.example documents required vars without real values; rotate secrets on each deploy |
✅ Enforced |
| Supply chain | Dependency vulnerabilities | npm audit run before each release; Dependabot enabled on GitHub; lockfile committed |
✅ Enforced |
| Game moves | Engine-assisted cheating | Server validates legality only — no move-quality analysis | |
| Anti-cheat | Statistical detection | Time-per-move / move-quality correlation analysis | 🔲 Planned |
| Server | Horizontal scaling | Single Fly.io VM — no clustering yet | 🔲 Planned |
Honesty note: Anti-cheat beyond legality checking is not implemented. JWT auth is purely stateless — no server-side revocation, no refresh tokens (leaked tokens are valid until 1-day expiry). For ranked multiplayer at scale, the server would need move-quality analysis and token rotation. Current scope: portfolio project with honest, real security hardening for every boundary that IS protected.
All numbers are reproducible. Commands included.
Engine (Rust → WASM)
| Benchmark | Desktop (Chrome) | Mobile (Pixel 7) | How to reproduce |
|---|---|---|---|
| Perft depth 5 (starting pos) | 4,865,609 nodes ✅ | — | cargo test perft |
| Move generation throughput | ~5M positions/sec | ~2M positions/sec | benchmarks/perft.html |
| Depth 5 search | ~300ms | ~700ms | In-game AI response |
| WASM binary size | ~170 KB gzipped | — | ls -la public/wasm/ |
| WASM cold-start init | ~50–100ms | ~150ms | First initEngine() call |
| JS fallback (TypeScript minimax) | ~10K positions/sec | ~5K positions/sec | Automatic if WASM fails |
Definitions:
- positions/sec = perft leaf nodes (fully legal move generation, no bulk-counting shortcuts)
- NPS (in AI Fallback Chain) = search nodes visited including static evaluation + transposition table lookups + move ordering
- Measured on AMD Ryzen 5 5600X, Chrome 131, WASM via
wasm-pack --release. Mobile numbers from Pixel 7, Chrome 131.
Server (Node.js + Express + Socket.io)
| Metric | SLO Target | How to reproduce |
|---|---|---|
| HTTP P95 latency | < 500ms | k6 run load-tests/http-load-test.js |
| HTTP P99 latency | < 1,000ms | Same |
| HTTP error rate | < 5% | Same |
| WS connection P95 | < 2,000ms | k6 run load-tests/websocket-load-test.js |
| WS message P95 | < 500ms | Same |
| WS connection success | > 90% | Same |
| Health check P95 | < 200ms | Same (HTTP test, health scenario) |
| Guest auth P95 | < 800ms | Same (HTTP test, auth scenario) |
| Stress test peak | 500 RPS + 250 concurrent WS | k6 run load-tests/stress-test.js |
Test Suites
| Suite | Language | Count | Command |
|---|---|---|---|
| Frontend unit | TypeScript (Vitest) | 420 | npm test |
| Rust engine | Rust (cargo test) | 218 | cd rust-engine && cargo test |
| Server | TypeScript (Vitest) | 168 | cd server && npm test |
| E2E browser (4 suites) | TypeScript (Playwright) | 48 | npx playwright test |
| k6 HTTP load | JavaScript (k6) | 6 scenarios | k6 run load-tests/http-load-test.js |
| k6 WebSocket load | JavaScript (k6) | ramp to 200 VUs | k6 run load-tests/websocket-load-test.js |
| k6 stress (breaking point) | JavaScript (k6) | 500 RPS / 250 WS | k6 run load-tests/stress-test.js |
| Total | 3 languages | 854 + 3 k6 |
All messages are JSON over WebSocket, validated with Zod schemas. Protocol version v: 1.
Client → Server
| Message | Key Fields | Validation |
|---|---|---|
create_table |
playerName (1–20 chars), elo (0–4000), pieceBank |
Zod: string length, int range, optional bank |
join_table |
tableId, playerName, elo |
Zod: required tableId string |
list_tables |
(none) | Zod: type + version only |
leave_table |
(none) | Zod: type + version only |
make_move |
gameId (UUID), move (2–6 chars, SAN or UCI) |
Zod: UUID format, string length |
resign |
gameId (UUID) |
Zod: UUID format |
offer_draw |
gameId (UUID) |
Zod: UUID format |
accept_draw / decline_draw |
gameId (UUID) |
Zod: UUID format |
reconnect |
playerToken, gameId (UUID) |
Zod: token string, UUID |
Server → Client
| Message | Key Fields |
|---|---|
tables_list |
tables[] — id, host name, host ELO, created time |
table_created |
tableId |
game_found |
gameId, color, opponent (name + ELO), timeControl, fen, piece banks |
move_ack |
gameId, move (SAN), fen, clock times |
opponent_move |
gameId, move (UCI), fen, clock times |
game_over |
gameId, result, reason, ELO change |
draw_offer / draw_declined |
gameId, from (opponent name) |
error |
code, message |
Prometheus Metrics (16 custom)
| Metric | Type | What it measures |
|---|---|---|
chess_connected_players |
Gauge | Current WebSocket connections |
chess_active_games |
Gauge | Games currently in progress |
chess_games_started_total |
Counter | Lifetime games started |
chess_games_completed_total |
Counter | Completed games (labeled: result, reason) |
chess_queue_length |
Gauge | Players waiting for match |
chess_queue_wait_seconds |
Histogram | Time in queue before match found |
chess_moves_total |
Counter | Total moves across all games |
chess_move_processing_seconds |
Histogram | Move validation + execution time |
chess_auth_total |
Counter | Auth attempts (labeled: type, result) |
chess_errors_total |
Counter | Errors by code |
chess_db_query_seconds |
Histogram | Database query duration (labeled: operation) |
chess_shutdown_in_progress |
Gauge | 1 during graceful shutdown drain |
chess_rate_limit_hits_total |
Counter | HTTP rate limit rejections |
chess_ws_rate_limit_total |
Counter | WebSocket rate limit rejections |
chess_process_crashes_total |
Counter | Uncaught exceptions / unhandled rejections |
| + Node.js defaults | Various | CPU, memory, event loop lag, GC, handles |
Guarantees the system makes — auditable in source:
- Server is authoritative for multiplayer game state. Clients submit moves; server validates legality via chess.js before broadcasting. Invalid moves are rejected with an error message. (source:
GameRoom.makeMove()) - AI always works. Triple fallback chain: Rust WASM → Stockfish.js Worker → TypeScript minimax. If one engine fails to load, the next takes over silently. The user always gets a working opponent. (source:
aiService.ts) - Every WebSocket message is schema-validated. Zod discriminated union parses all inbound messages. Unknown types, wrong versions, and malformed fields are rejected before reaching game logic. (source:
protocol.ts,ClientMessageSchema) - Game state is recoverable. Single-player: save/load via localStorage + JSON file export. Multiplayer: player token reconnection within 30-second grace period + game persistence in SQLite. (source:
saveSystem.ts,GameRoom.DISCONNECT_GRACE_MS) - Rendering never blocks game logic. Game controller is synchronous; renderer updates are RAF-coalesced and decoupled. Scene transitions don't freeze the game state machine. (source:
main-3d.tsRAF loop) - WebGL failure is non-fatal. Context-loss triggers a toast notification; game logic continues; renderer attempts automatic recovery. (source:
renderer3d.tscontext-loss handler) - Clock integrity in multiplayer. Server tracks wall-clock elapsed time per move. Clocks are updated server-side before broadcasting — clients display but don't control time. (source:
GameRoom.makeMove()clock logic) - Graceful shutdown preserves connections. SIGTERM/SIGINT triggers: stop accepting new connections → notify all clients → drain timeout → force disconnect. Fly.io deploys don't orphan games. (source:
resilience.ts)
Questions a senior engineer will ask, with honest 1-sentence answers and deep-dive links:
| Question | Short Answer | Deep Dive |
|---|---|---|
| Why Rust WASM instead of a server-side engine? | Zero latency for single-player, zero server cost for AI, AI games scale with zero backend load — server only needed for multiplayer. | B2 ↓ |
| How do you prevent cheating in multiplayer? | Server validates every move via chess.js. Statistical move-quality detection is planned but not built — I'm honest about that. | D2 ↓ |
| What's the engine interface boundary? | FEN string + depth in → SAN move string out, via wasm-bindgen. Not UCI — custom bridge optimized for browser context. | B9 ↓ |
| How do you manage Three.js memory / GC pressure? | Explicit dispose() on every geometry, material, and texture during scene transitions. WebGL context-loss handler for recovery. No circular references. |
B10 ↓ |
| What are the biggest perf bottlenecks? | Magic bitboard init (~50ms cold start), Three.js scene transitions (~200ms), Stockfish Worker init (~500ms). Measured via performance.now() instrumentation. |
D7 ↓ |
| How do you validate the engine is correct? | Perft test: depth 5 starting position = 4,865,609 nodes, matching published values. 218 Rust tests cover edge cases (en passant, castling, promotion, pins). | C13 ↓ |
| Why vanilla TS instead of React? | 80% of the app is <canvas>. React's virtual DOM adds overhead for canvas-driven rendering. Game state is a single chess position — no component tree needed. |
D5 ↓ |
| How does the multiplayer protocol handle reconnection? | Player gets a unique token at game start. On disconnect, server holds the seat for 30 seconds. Client sends reconnect with token + gameId to resume. |
B11 ↓ |
| What would you do differently? | Solid.js for non-canvas UI panels, ECS for 3D scene management, PostgreSQL from day one, tapered eval in the engine. | D9 ↓ |
| How do you test a 3D game with no visible output in CI? | Mock Three.js (no GPU), test game logic via exposed window.__GAME__ API, E2E Playwright tests with real browser + canvas interaction. |
D10 ↓ |
Multiplayer status note: The multiplayer infrastructure is built and deployed (auth, open tables, game rooms, reconnect, ELO, draw/resign, Zod protocol). It has not been stress-tested with real concurrent human players beyond k6 simulations. The WebSocket server runs on a single Fly.io VM. Treat as "works in demo, not battle-tested at scale."
2 minutes. Clone, install, play.
- Node.js 18+
- Rust + wasm-pack (only if rebuilding the WASM engine — pre-built binary included)
git clone https://github.com/beautifulplanet/Promotion-Variant-Chess.git
cd Promotion-Variant-Chess
npm install
npm run devOpen http://localhost:5173. That's it.
cd server
npm install
cp .env.example .env
npx prisma migrate dev
npm run devServer starts on http://localhost:3001.
npm test # 420 frontend tests
cd server && npm test # 168 server tests
cd rust-engine && cargo test # 218 Rust engine testsnpm run build # TypeScript check + Vite → dist/cd rust-engine
wasm-pack build --target web --release --out-dir ../public/wasmNeed more detail? See Part 4: Full Tutorial for step-by-step setup with explanations, or the standalone tutorial doc.
The IKEA manual. Step-by-step setup, complete engine reference, system design Q&A. Everything you need to understand, modify, or rebuild any part of this project.
This section is large. Use the table of contents below to jump to what you need. It's also available as a standalone document → docs/PART4_FULL_TUTORIAL.md with its own table of contents.
- A1. System Requirements
- A2. Clone & Install (Frontend)
- A3. Run the Game Locally
- A4. Set Up the Multiplayer Server
- A5. Run All Tests
- A6. Rebuild the Rust Engine from Source
- A7. Deploy to Production
- B1. System Overview in 60 Seconds
- B2. The AI Engine Fallback Chain
- B3. Bitboard Representation
- B4. Magic Bitboards for Sliding Pieces
- B5. Move Generation
- B6. Search Algorithm
- B7. Position Evaluation
- B8. Zobrist Hashing & Transposition Tables
- B9. WASM Bridge Architecture
- B10. Rendering Pipeline
- B11. Multiplayer Architecture
- C1. Board Representation from First Principles
- C2. Types and Move Encoding
- C3. The Position Struct
- C4. Attack Tables — Knights, Kings, and Pawns
- C5. Magic Bitboards — The Complete Theory
- C6. Move Generation — Pseudolegal to Legal
- C7. Position Evaluation — Material and Piece-Square Tables
- C8. Zobrist Hashing — Incremental Position Fingerprinting
- C9. Transposition Table — Caching Search Results
- C10. Search — Minimax, Alpha-Beta, and Beyond
- C11. WASM Compilation and the TypeScript Bridge
- C12. GameState — Full Game Lifecycle in Rust
- C13. Testing and Correctness — Perft
- D1. How would you scale to 10 billion users?
- D2. How do you detect and handle cheating?
- D3. Why Three.js instead of native mobile rendering?
- D4. Why do you have multiple AI engines?
- D5. Why vanilla TypeScript instead of React/Vue/Svelte?
- D6. How does the WASM binary get loaded in the browser?
- D7. What are the performance characteristics on mobile?
- D8. How does the ELO system work?
- D9. What would you do differently if you started over?
- D10. How do you test a 3D game?
- D11. What is the AI Tournament System?
- D12. What metrics do you capture and why?
- D13. What is your production resilience strategy?
- D14. What are your load testing methodology and SLOs?
- F1. Bottleneck Analysis by User Scale
- F2. Scaling Roadmap: 100 to 10 Billion Users
- F3. Statistics Captured and How They Drive Decisions
- F4. Documentation Index
| Tool | Version | Required? | What it's for |
|---|---|---|---|
| Node.js | 18+ | Yes | Frontend dev server, server runtime |
| npm | 9+ | Yes | Package management (comes with Node) |
| Rust | 1.70+ | Only for engine rebuild | Compiles the WASM chess engine |
| wasm-pack | 0.12+ | Only for engine rebuild | Rust → WASM build tool |
| Docker | 20+ | Only for server deploy | Containerized server deployment |
| Git | 2.30+ | Yes | Clone the repo |
Don't have Rust? That's fine. The pre-built WASM binary is included in public/wasm/. You only need Rust if you want to modify the chess engine.
Step 1: Clone
git clone https://github.com/beautifulplanet/Promotion-Variant-Chess.git
cd Promotion-Variant-ChessStep 2: Install dependencies
npm installThis installs: Three.js (3D rendering), chess.js (move validation fallback), Vite (dev server & bundler), Vitest (testing), Playwright (E2E tests), TypeScript, and Socket.io client.
Done. Two commands.
npm run devOpen http://localhost:5173 in your browser.
You should see:
- The Welcome Dashboard — a newspaper-themed landing screen with your stats (ELO, wins, streak), game mode buttons (Play AI, Multiplayer, Classic Mode), and difficulty/GFX preferences
- Click Play to enter the game: a 3D chess board with the starting position
- A sidebar with game controls (difficulty, undo, settings)
- Era-themed environment (starts at Stone Age for new players)
Play against AI: Click a white piece to see legal moves highlighted with theme-aware colors (each board style has its own highlight palette). Click a destination to move. The AI responds in <1 second.
Controls:
| Input | Action |
|---|---|
| Click/Tap | Select piece, make move |
| Scroll/Pinch | Zoom in/out |
| Drag | Orbit camera around the board |
Step 1: Navigate
cd serverStep 2: Install
npm installStep 3: Configure
cp .env.example .envThe defaults work out of the box (port 3001, SQLite, dev JWT secret).
Step 4: Initialize database
npx prisma migrate devCreates prisma/dev.db with Player and Game tables.
Step 5: Start
npm run devServer runs on http://localhost:3001.
| Endpoint | What |
|---|---|
GET /health |
Status + DB connectivity |
GET /metrics |
Prometheus metrics (optionally protected — set METRICS_TOKEN env var to require Bearer auth) |
POST /api/auth/register |
Create account |
POST /api/auth/login |
Get JWT token |
WebSocket / |
Real-time gameplay |
Step 6: Test multiplayer — Open two browser tabs. Both connect and enter the matchmaking queue automatically.
# Frontend (420 tests, ~5s)
npm test
# Server (168 tests, ~8s)
cd server && npm test
# Rust engine (218 tests, ~2s)
cd rust-engine && cargo test
# E2E browser tests (48 tests across 4 suites)
npx playwright install chromium # First time only
npm run e2eTotal: 806 unit/integration tests + 48 E2E Playwright tests (854 total)
| Suite | Count | Covers |
|---|---|---|
| Rust engine | 218 | Bitboards, attacks, magic bitboards, move gen, search, eval, TT, Zobrist, perft, game state, tournament |
| Frontend | 420 | Game controller, ELO, era system, save system, chess engine, performance, AI aggression |
| Server | 168 | Auth, API, database CRUD, matchmaker, game rooms, metrics, protocol, CORS |
| E2E — playtest | 13 | Gameplay (8 turns, undo, new-game, PGN), visual correctness (flip, turn indicator, board state), stress (rapid clicks, mobile viewport, UI buttons, console audit) |
| E2E — welcome dashboard | 18 | Dashboard visibility, beta badge, date display, stats ribbon, button navigation (Play AI, Classic Mode, Multiplayer, Explore), dismiss/return, preference persistence |
| E2E — classic mode | 12 | Classic layout toggle, dark theme rendering, board sizing, overlay hidden, scrollable articles, Explore mode |
| E2E — smoke | 5 | Page load, AI response, save/load, console error audit |
Only needed if you modify files in rust-engine/src/.
# Install Rust (skip if you have it)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Add WASM target
rustup target add wasm32-unknown-unknown
# Install wasm-pack
cargo install wasm-pack
# Build
cd rust-engine
wasm-pack build --target web --release --out-dir ../public/wasm
# Verify
cargo test # All 218 testsOutput goes to public/wasm/ — a .wasm binary (~170 KB gzipped) + JavaScript glue code.
Push to main. Vercel auto-deploys.
git push origin maincd server
# Install CLI + auth (one-time)
# Windows: irm https://fly.io/install.ps1 | iex
# Mac/Linux: curl -L https://fly.io/install.sh | sh
fly auth login
# Create app (detects fly.toml)
fly launch --no-deploy
# Create persistent volume for SQLite
fly volumes create chess_data --region iad --size 1
# Set secrets
fly secrets set JWT_SECRET=$(openssl rand -hex 32)
# Deploy
fly deploy
# Verify
curl https://chess-server-falling-lake-2071.fly.dev/health
curl https://chess-server-falling-lake-2071.fly.dev/metrics
# Note: /metrics is public when METRICS_TOKEN is unset.
# To protect: fly secrets set METRICS_TOKEN=$(openssl rand -hex 16)
# Then: curl -H "Authorization: Bearer <token>" .../metricsThree independently deployable components:
- Frontend (TypeScript + Three.js + Vite) — SPA with WebGL 3D chessboard, 20 era environments, mouse/touch input
- Rust Chess Engine (WASM) — Bitboard engine in the browser. Move gen, eval, alpha-beta search. 10–100× faster than JavaScript.
- Multiplayer Server (Node.js + Express + Socket.io + Prisma) — Matchmaking, game rooms, ELO, JWT auth, SQLite persistence
Key insight: Engine runs in the browser. Zero latency for single-player. Zero server cost for AI. Server only coordinates multiplayer.
┌─────────────────────────────────────────────────────┐
│ AI Move Request │
│ │
│ 1. Rust WASM Engine (fastest, ~1M+ NPS) │
│ └─ if WASM fails to load ─────────────────┐ │
│ │ │
│ 2. Stockfish.js Web Worker (strongest, │ │
│ skill 0-20) │ │
│ └─ if Worker fails ────────────────────┐ │ │
│ │ │ │
│ 3. TypeScript Engine (always works, │ │ │
│ chess.js + minimax) │ │ │
└────────────────────────────────────────────┘ │ │
// aiService.ts — simplified
if (this.rustEngineReady) {
move = rustEngine.getBestMove(fen, depth);
} else if (this.workerReady) {
move = await this.requestFromWorker(board, turn, elo);
} else {
move = this.fallbackEngine.getBestMove(board, turn, depth);
}Why 3 engines? WASM can fail (old browsers, CSP). Workers can fail (Safari bugs). TypeScript always works. User never sees a broken AI.
64 squares → 64-bit integer. One bit per square. Bit 0 = a1, bit 63 = h8.
pub struct Position {
pieces: [[Bitboard; 6]; 2], // 12 bitboards: (color, piece_type)
occupied_by_color: [Bitboard; 2], // All white, all black
occupied_all: Bitboard, // Combined
}| Operation | Bitboard | Array |
|---|---|---|
| "Piece on e4?" | 1 AND | 1 array access |
| "Count pieces" | 1 POPCNT | Loop 64 |
| "Knight moves from e4" | 1 lookup | 8 bounds checks |
| "Rook moves from e4" | 1 mul + shift + lookup | Ray-cast loop |
Directional shifts: north = << 8, east = (<< 1) & NOT_FILE_A.
Sliding pieces (rook, bishop, queen) attack depends on blockers. Magic bitboards: O(1) lookup.
- Precompute relevant occupancy mask per square (excluding edges)
- Enumerate all 2^N blocker configs
- Find magic number M:
(blockers × M) >> (64 - N)= unique index - Store attack bitboard per index
Runtime: 5 operations total (AND + multiply + shift + 2 lookups). Memory: ~840 KB tables.
Phase 1 — Pseudolegal: All moves obeying piece rules (ignoring check). Pawns, knights/kings (table lookup), sliding pieces (magic lookup), castling.
Phase 2 — Legal: Make each move, check if king in check, unmake if illegal.
~5M legal positions/sec in WASM. Stack-allocated MoveList (512 bytes, L1-cache-friendly).
Perft verified: depth 5 = 4,865,609 nodes ✅
Negamax alpha-beta with iterative deepening, enhanced with:
| Technique | Effect |
|---|---|
| Transposition Table | Cache results by Zobrist hash (~2× speedup) |
| Null Move Pruning | Skip turn — if still winning, prune (~3×) |
| Late Move Reductions | Later moves at reduced depth (~2×) |
| Killer Moves | Prioritize quiet moves that caused cutoffs (~1.5×) |
| MVV-LVA Ordering | Best captures first (~2×) |
| Quiescence Search | Resolve captures at leaf nodes |
Move ordering: TT best → Captures (MVV-LVA) → Promotions → Killers → Quiet. Reduces branching factor from ~35 to ~6.
Centipawns (100 = 1 pawn). Components:
- Material: P=100, N=320, B=330, R=500, Q=900
- Piece-Square Tables: Positional bonuses (center, castled king, advanced pawns)
- Bishop Pair: +30cp
- Phase Detection: <2000cp non-king material → endgame king PST
Simple eval + deep search (via WASM speed) > complex eval + shallow search.
64-bit position fingerprint via XOR of random keys. 781 keys generated at compile time (const fn PRNG). O(1) incremental update per move.
TT: 262,144 entries (~5 MB). Stores hash, depth, score, flag (Exact/Lower/Upper), best move. Depth-preferred replacement. Mate score adjustment for correct distance.
wasm-pack build --target web --release → .wasm (~170 KB gzipped) + JS glue.
Bridge (rustEngine.ts): blob URL dynamic import (Vite-compatible), try/catch every call, pos.free() after every use, cross-platform time via #[cfg(target_arch)].
Three.js WebGL renderer (5,000+ lines in renderer3d.ts) with a deep visual customization system.
Modular boundaries inside
renderer3d.ts: While still a single file, the code is organized into clearly separated responsibility zones:
Zone Approx. lines Responsibility Scene lifecycle ~200 init, dispose, resize, context-loss recovery Asset management ~400 texture loading, geometry caching, material pools Piece mesh / material factory ~1,200 7 3D + 17 2D piece style constructors, color mapping Board construction & highlights ~600 square meshes, selection rings, legal-move dots Input handling ~300 raycasting, click debounce, screenToBoardcoord flipCamera & controls ~200 orbit setup, flip-board view rotation Post-processing & lighting ~400 shadow mapping, environment maps, bloom State sync ( updatePieces/updateState)~500 diff-based piece add/remove/move, animation Era environment generation ~800 20 themed worlds, skyboxes, particles, trees Extracting these into separate modules (or an ECS architecture) is the top refactor target — see "What would you do differently".
Board & Piece Visuals:
- 24 piece styles — 7 3D geometry sets (Staunton, Lewis, Modern, Crystal, Neon, Marble, Wood) + 17 2D canvas-drawn styles (Classic, Staunton 2D, Modern, Symbols, Newspaper, Editorial, Outline, Figurine, Pixel Art, Gothic, Minimalist, Celtic, Sketch, Pharaoh sprite, Art Deco, Steampunk, Tribal)
- 12 board styles — Classic Wood, Tournament Green, Walnut & Maple, Ebony & Ivory, Italian Marble, Ancient Stone, Crystal Glass, Neon Grid, Newspaper Print, Ocean Depths, Forest Grove, Royal Purple — each with unique
selectedSquareColorandlegalMoveColorfor theme-aware highlights
Environment Generation:
- Procedural skyboxes (
proceduralSkybox.ts) — per-era sky colors, gradients, star fields with configurable density, and atmospheric effects - L-system trees (
assetMutator.ts, 1,200 lines) — 3 grammar presets (Oak, Pine, Willow) generate procedural 3D trees via recursive string rewriting with configurable depth, branch angles, and leaf density - Lorenz attractor particles (
eraWorlds.ts) — the Digital era features a chaotic attractor particle system using ODE integration (σ=10, ρ=28, β=8/3) rendered as animated point clouds - Dynamic lighting (
dynamicLighting.ts, 1,100+ lines) — per-era ambient, directional, and point light configurations with real-time shadow mapping
Performance:
- Shadow mapping, orbit controls, 20 era environments with procedural skyboxes, themed materials, dynamic lighting, and particle systems
- Mobile adaptive: auto-detect → disable shadows/antialias, cap DPR at 2.0
- Debounced resize (150ms)
- Stability hardening: click debounce (100ms),
_processingClickreentrance guard, RAF coalescing for rapid state updates, non-blocking DOM toast on WebGL context loss, Three.js geometry/material disposal on piece removal
Socket.io (WebSocket + HTTP long-polling fallback):
- Auth: JWT in socket handshake
- Matchmaking: Ranked queue, expanding ELO range
- Game Rooms: Server-side chess.js validation, state broadcast, reconnect handling
- ELO: Standard formula (K=32), persisted via Prisma
- State: In-memory Map — appropriate for portfolio scale
Answer "what are the legal moves?" millions of times per second. Board representation determines speed.
Finding rook attacks = loop through 7 squares × 4 directions with bounds checks. O(28) per rook. Branchy.
u64 where each bit = one square:
White pawns starting position:
8 . . . . . . . . Hex: 0x000000000000FF00
2 X X X X X X X X ← bits 8-15 set
1 . . . . . . . .
| Chess Op | CPU Instruction |
|---|---|
| "Piece on e4?" | AND |
| "Empty squares" | NOT |
| "Pawns north" | SHIFT |
| "Count pieces" | POPCNT |
| "Find first" | TZCNT |
| "Pop first" | AND + SUB |
#[derive(Clone, Copy, PartialEq, Eq, Default)]
pub struct Bitboard(pub u64);
// Directional shifts with edge masking
pub const fn east(self) -> Bitboard {
Bitboard((self.0 << 1) & NOT_FILE_A.0)
}
pub const fn north(self) -> Bitboard {
Bitboard(self.0 << 8)
}
// Iteration: Kernighan's bit-pop
pub fn pop_lsb(&mut self) -> Option<Square> {
if self.0 == 0 { return None; }
let sq = Square(self.0.trailing_zeros() as u8);
self.0 &= self.0 - 1;
Some(sq)
}#[repr(u8)]
pub enum PieceType { Pawn = 0, Knight = 1, Bishop = 2, Rook = 3, Queen = 4, King = 5 }
#[repr(u8)]
pub enum Color { White = 0, Black = 1 }
pub struct Square(pub u8); // 0-63
pub struct Move(pub u16); // 16-bit packed
// Bits 0-5: from, 6-11: to, 12-13: promotion, 14-15: flags16-bit encoding: 2 bytes per move. MoveList (256 max) = 512 bytes. L1 cache.
pub struct Position {
pieces: [[Bitboard; 6]; 2],
occupied_by_color: [Bitboard; 2],
occupied_all: Bitboard,
side_to_move: Color,
castling: CastlingRights, // 4-bit mask: KQkq
en_passant: Option<Square>,
halfmove_clock: u8,
fullmove_number: u16,
hash: u64, // Zobrist, incrementally updated
}Make/Unmake: Save undo info → apply move → update castling/EP/hash → check king safety → return None if illegal. Millions of calls during search. unmake reverses using saved UndoInfo.
Fixed patterns. Precomputed at compile time (Rust const eval). 512 bytes baked into binary.
pub static KNIGHT_ATTACKS: [Bitboard; 64] = { /* 8 L-shapes, bounds-checked */ };
pub static KING_ATTACKS: [Bitboard; 64] = { /* 8 adjacent */ };
pub static WHITE_PAWN_ATTACKS: [Bitboard; 64] = { /* NW, NE */ };
pub static BLACK_PAWN_ATTACKS: [Bitboard; 64] = { /* SW, SE */ };Usage: KNIGHT_ATTACKS[sq.index()] — one memory read.
Bishop on d4, blocker on f6 → can't see g7/h8. Attack set depends on blockers. Mask has N relevant bits → 2^N configs. Need O(1) lookup.
index = (blockers × magic_number) >> (64 - N)
Multiplication "gathers" relevant bits into top N bits. Magic found by brute-force search.
for sq in 0..64 {
let mask = rook_mask(sq);
let mut blockers = Bitboard::EMPTY;
loop {
let attacks = rook_attacks_slow(sq, blockers); // Ray-cast
let index = (blockers * MAGIC) >> (64 - bits);
table[sq][index] = attacks;
blockers = (blockers.wrapping_sub(mask)) & mask; // Carry-Rippler
if blockers == 0 { break; }
}
}fn rook_attacks(sq: Square, occupied: Bitboard) -> Bitboard {
let blockers = occupied & ROOK_MASKS[sq]; // AND
let index = (blockers * MAGIC) >> shift; // MUL + SHIFT
ROOK_TABLE[sq][index] // LOOKUP
}Queen = rook | bishop. Two lookups + OR.
Memory: rook ~800 KB + bishop ~40 KB. OnceLock lazy init.
pub struct MoveList {
moves: [Move; 256], // No heap
count: usize,
}- Single push:
pawns.north() & empty(all pawns at once) - Double push:
(singles & RANK_3).north() & empty - Captures: per-pawn
pawn_attacks(from) & enemies - Promotions: rank 8 moves → 4 variants (Q/R/B/N)
- En passant
Rights exist + not in check + path empty + king doesn't cross attacked squares.
for m in pseudo_legal.iter() {
if let Some(undo) = pos.make_move(*m) {
legal.push(*m);
pos.unmake_move(*m, &undo);
}
}evaluate(pos) → Score (centipawns, side-to-move perspective).
Material: P=100, N=320, B=330, R=500, Q=900, K=20000
PST highlights:
| Piece | Good square | Bonus | Bad square | Penalty |
|---|---|---|---|---|
| Pawn | d4/e4 (center) | +25 | a3/h3 (flank) | -20 |
| Pawn | rank 7 | +50 | — | — |
| Knight | center | +20 | rim | -50 |
| King (midgame) | g1 (castled) | +30 | e1 (center) | -50 |
| King (endgame) | center | +40 | — | — |
Bishop pair: +30. Phase: <2000cp non-king → endgame. Black mirroring: sq ^ 56.
XOR random keys for each (piece, square) + side + castling + EP. 781 keys via compile-time const fn xorshift64.
Incremental update (O(1)): XOR is self-inverse. Move piece: hash ^= key(from); hash ^= key(to).
Collision: ~1 in 2^64 ≈ 1.8×10^19. Negligible in any search.
pub struct TTEntry {
hash: u64, depth: u8, score: Score,
flag: TTFlag, // Exact | LowerBound | UpperBound
best_move: Option<Move>,
}262,144 entries (~5 MB). Depth-preferred replacement.
Mate adjustment: Store as node-relative (score + ply), read as root-relative (score - ply).
Depth 1 → 2 → 3 → ... TT shared across iterations. Previous depth's best move searched first.
Skip turn; if opponent can't beat beta despite two moves, prune. Conditions: not in check, not root, has pieces. Reduction: 2 plies.
After first 4 moves, search later moves at depth-1. Re-search at full depth if promising. Skip reduction for captures, promotions, killers, checks.
At depth 0, search all captures until "quiet." Stand-pat: static eval as baseline. Eliminates horizon effect.
TT best (+100K) → Captures MVV-LVA (+10K) → Promotions (+9K) → Killers (+5K) → Quiet (0)
MVV-LVA: victim × 10 - attacker. QxP(100) < PxQ(8900).
wasm-pack build --target web --release --out-dir ../public/wasmwasm_bindgen generates bindings: #[wasm_bindgen] pub fn get_best_move(...) → callable from JS.
const jsCode = await fetch('./wasm/chess_engine.js').then(r => r.text());
const blob = new Blob([jsCode], { type: 'application/javascript' });
const wasm = await import(URL.createObjectURL(blob));
await wasm.default('./wasm/chess_engine_bg.wasm');pos.free() after every use. Every call try/caught. Cross-platform time: js_sys::Date::now() in WASM, SystemTime in native.
pub struct GameState {
position: Position,
hash_history: Vec<u64>, // Threefold repetition
move_history: Vec<(Move, UndoInfo)>, // Undo support
uci_history: Vec<String>, // Human-readable
}Status: Checkmate → Stalemate → Insufficient material → 50-move → Threefold → Playing.
Undo: Pop from all three vectors, unmake move.
Board JSON: 8×8 array for TypeScript rendering.
Count all leaf nodes at depth N. Standard correctness benchmark.
pub fn perft(pos: &mut Position, depth: u32) -> u64 {
if depth == 0 { return 1; }
let moves = generate_legal_moves(pos);
if depth == 1 { return moves.len() as u64; }
moves.iter().map(|m| {
if let Some(undo) = pos.make_move(*m) {
let n = perft(pos, depth - 1);
pos.unmake_move(*m, &undo);
n
} else { 0 }
}).sum()
}| Position | Depth | Nodes | Status |
|---|---|---|---|
| Starting | 5 | 4,865,609 | ✅ |
| Kiwipete | 4 | 4,085,603 | ✅ |
218 Rust tests: bitboards, attacks, magic validation, move gen, make/unmake, search, TT, Zobrist, game state, perft, tournament runner.
This project is designed with a scaling roadmap from portfolio-scale to planetary-scale. Each tier identifies the bottleneck, the fix, and the infrastructure change.
Framing note: This is a prepared system-design answer demonstrating architectural thinking at each scale boundary. The current build is intentionally Tier 0 to stay shippable as a one-person portfolio project — over-engineering the infrastructure would be the wrong trade-off at this stage.
Current Production (Tier 0 — up to ~100 concurrent):
Single Node.js process on Fly.io shared-cpu-1x (256MB). In-memory Map for game rooms. SQLite on a 1GB persistent volume. All AI runs client-side (WASM). Rate-limited: 100 req/min HTTP, 20 msg/sec WebSocket, 10 connections/IP, 500 room cap. Graceful shutdown with 15-second drain.
Tier 1 (100–1K concurrent):
Bottleneck: Memory exhaustion from 500+ game rooms in Map. SQLite write lock contention.
Fix: Scale to shared-cpu-2x 512MB. Add WAL mode to SQLite. Optimize Map cleanup. Deploy Litestream for continuous DB backup to S3.
Tier 2 (1K–10K concurrent): Bottleneck: Single-threaded event loop saturates at ~200 WebSocket messages/sec sustained. Single machine = single point of failure. Fix:
Load Balancer (sticky sessions via cookie)
┌──────────┬──────────┬──────────┐
▼ ▼ ▼ ▼
Server 1 Server 2 Server 3 Server 4
└──────────┴──────────┴──────────┘
│
Redis Pub/Sub (Socket.io adapter)
│
PostgreSQL (write) + Read Replica
Migrate to PostgreSQL with connection pooling (PgBouncer). Redis Pub/Sub for cross-server Socket.io. Separate matchmaker service. CDN for all static assets. Horizontal auto-scale 2–10 machines.
Tier 3 (10K–100K concurrent): Bottleneck: Matchmaker becomes hot path. PostgreSQL single-writer bottleneck. WebSocket connection distribution uneven across regions. Fix: Dedicated matchmaker microservice with Redis Streams work queue. Multi-region deployment (US-East, EU-West, APAC). PostgreSQL with Citus for sharding. Game state in Redis (TTL-based expiry). API Gateway for WebSocket routing. Health-check-driven auto-scaling with custom Prometheus alerting.
Tier 4 (100K–10M concurrent): Bottleneck: Monolithic game server can't specialize. Redis single-instance limits. ELO calculations become bottleneck with millions of concurrent rating updates. Fix:
Global Load Balancer (GeoDNS)
┌──────────────────────────┐
│ │ │
US-East EU-West APAC
┌────┐ ┌────┐ ┌────┐
│ K8s│ │ K8s│ │ K8s│
└──┬─┘ └──┬─┘ └──┬─┘
│ │ │
┌──▼──────────▼───────────────▼──┐
│ Redis Cluster (sharded) │
└──────────────┬─────────────────┘
│
┌──────────────▼─────────────────┐
│ CockroachDB / Spanner (global) │
└────────────────────────────────┘
Kubernetes with horizontal pod autoscaling. Redis Cluster (16+ shards). ELO updates batched via Apache Kafka event stream → async workers. Game replay storage in object store (S3). Dedicated services: Auth, Matchmaker, GameRoom, ELO, Replay, Analytics. gRPC between services. Circuit breakers (Istio service mesh).
Tier 5 (10M–1B concurrent): Bottleneck: Database writes at billions of game records/day. Global latency for real-time moves. Cost of always-on infrastructure. Fix: Event sourcing — games stored as move streams in Kafka, materialized views for queries. CRDT-based game state for conflict-free multi-region writes. Edge compute (Cloudflare Workers / Fly.io Machines) for move validation close to players. Tiered storage: hot (Redis) → warm (PostgreSQL) → cold (S3 Parquet). Cost optimization: spot instances for AI tournament workloads, reserved instances for stateful services.
Tier 6 (1B–10B total registered users): Bottleneck: You're now operating at planetary scale. The challenge is no longer technical — it's organizational, economic, and regulatory. Fix: This is the Meta/Google tier. User table sharded by region. Data sovereignty compliance (GDPR, CCPA, etc.). Multi-cloud (AWS + GCP + Azure) for resilience. Custom CDN. Dedicated SRE team. The interesting architectural note: because our AI engine runs client-side in WASM, the compute cost for AI games is always zero regardless of user count. Only multiplayer games cost server resources — and even at 10B users, the concurrent player count is a fraction (typically 1–5%). This means the real scaling target for the server is ~50M–500M concurrent connections, which is achievable with Tier 5 architecture.
Full scaling analysis → docs/PRODUCTION_RESILIENCE.md Load test methodology → docs/LOAD_TEST_PLAN.md Bottleneck analysis → Section F1
Now: Server-side move validation, rate limiting.
At scale: Time-per-move analysis (engines are suspiciously consistent), move quality correlation (>90% top-3 match = flagged), ELO volatility (800→2200 in one session = flagged), browser fingerprinting, behavioral analysis (tab-switching, no mouse movement).
Progressive: warning → temp ban → permanent ban.
Pro: One codebase, zero install friction (link → play), 30-second deploys, 97%+ WebGL support, WASM for compute.
Con: 25–40% render penalty vs Metal/Vulkan, higher memory, no native APIs, Safari limitations.
Mitigations: Adaptive quality, PWA, full touch controls. If funded: native renderers sharing Rust engine via static lib/JNI.
| Engine | Role | Strength |
|---|---|---|
| Rust WASM | Primary (fastest) | ~1800 ELO depth 5 |
| Stockfish.js | Strongest backup | ~800–2800 ELO |
| TypeScript | Always works | ~1200 ELO depth 4 |
| Learning AI | Experimental | Varies |
Graceful degradation. User always gets a working AI.
- Three.js IS the framework. 80% canvas. React adds virtual DOM overhead for canvas updates.
- Simple state. One chess position. No nested component rerenders.
- Performance. Direct scene graph updates. O(1) piece moves.
- Bundle. ~400 KB total. React alone = +45 KB.
If UI grew: Solid.js for non-canvas panels. Canvas stays vanilla.
initEngine()at startup- Fetch JS glue code → blob URL → dynamic import
wasm.default(path)→WebAssembly.instantiateStreaming(compile while downloading)- ~50–100ms load. ~170 KB gzipped.
- If fails → fallback to Stockfish → TypeScript
| Metric | Desktop | Mobile (Pixel 7) | Budget |
|---|---|---|---|
| FPS (mobile mode) | 60 | 50–60 | 30–40 |
| Move gen (WASM) | ~5M pos/s | ~2M pos/s | — |
| Depth 5 search | ~300ms | ~700ms | ~5000ms (JS) |
| Memory | ~80 MB | ~50 MB | ~50 MB |
WASM = ~60% desktop speed on mobile. JS fallback = ~10× slower.
R_new = R_old + K × (S - E) where K=32, E = 1/(1 + 10^((R_opp - R)/400))
1200 beats 1500 → expected 15% → new rating: 1227 (+27). Starting ELO: 400. ELO ranges map to 20 eras.
Keep: Rust WASM, bitboards, Three.js, Vite, Socket.io.
Change: Lightweight UI framework (Solid.js), split renderer into SceneManager/CameraController/PieceRenderer, ECS pattern for 3D, type-safe WebSocket messages (tRPC/Zod), PostgreSQL from day one, tapered evaluation.
| Layer | Tool | Count |
|---|---|---|
| Engine | cargo test | 218 |
| Frontend | Vitest | 420 |
| Server | Vitest | 168 |
| E2E (4 suites) | Playwright | 48 |
| Load (HTTP) | k6 | 6 scenarios |
| Load (WebSocket) | k6 | ramp to 200 VUs |
| Stress | k6 | 500 RPS / 250 WS |
Mocked: Three.js (no GPU), chess.js, Socket.io, localStorage.
E2E suites (48 tests):
| Suite | Tests | Focus |
|---|---|---|
| playtest | 13 | Gameplay, visual correctness, stress |
| welcome-dashboard | 18 | Dashboard UI, buttons, stats, dismiss/return |
| classic-mode | 12 | Classic layout, Explore mode, sizing |
| smoke | 5 | Load, AI response, save/load, console audit |
Load testing: 3 k6 scripts validate SLOs under pressure — HTTP API (P95 < 500ms, <5% error rate), WebSocket gameplay simulation (200 concurrent, <2s connect), and stress/breaking point discovery (500 RPS, 250 concurrent WS). See D14 for full methodology.
Priority: Correctness (engine) > Functionality (game) > Reliability (server) > Load (capacity) > Appearance (renderer).
The project includes a standalone 1-million-player AI tournament runner (rust-engine/src/bin/tournament.rs, 866 lines) that exercises the chess engine at scale for statistical analysis and A/B testing.
CLI (clap) → Generate AI Personas → Swiss Pairing → Parallel Games (rayon) → SQLite Results
↑ repeat for N rounds ↓
Each AI player has unique personality traits generated from a seeded RNG:
| Trait | Range | Effect |
|---|---|---|
search_depth |
1–6 | How many plies deep the engine searches |
aggression |
0.0–1.0 | Preference for captures and forward moves |
opening_style |
5 types | First move preference: King's Pawn (e4), Queen's Pawn (d4), English (c4), Réti (Nf3), or Random |
blunder_rate |
0.0–0.15 | Probability of playing a random move instead of the best move |
Standard Swiss-system tournament: players with similar scores are paired each round. This produces statistically meaningful ELO distributions without requiring a full round-robin (which would be O(N²) games for N players).
| Players | Rounds | Total Games | Time (est.) |
|---|---|---|---|
| 1,000 | 10 | 5,000 | ~2 minutes |
| 100,000 | 15 | 750,000 | ~30 minutes |
| 1,000,000 | 20 | 10,000,000 | ~5 hours |
Players are split into two groups:
- Group A (Control): Standard search with no modifications
- Group B (Treatment): Receives "reward bonuses" — evaluation score adjustments that incentivize certain play patterns
Hypothesis: Do reward bonuses produce stronger or weaker players over many games?
Metrics captured per group:
- Mean ELO after N rounds
- Win/loss/draw ratios
- Average game length (moves)
- Blunder frequency
- Opening style effectiveness (win rate by first move)
- Score variance and standard deviation
Statistical analysis: The tournament outputs to SQLite, enabling post-hoc SQL queries:
-- Compare mean ELO by group
SELECT group_name, AVG(elo), STDDEV(elo), COUNT(*) FROM players GROUP BY group_name;
-- Win rate by opening style
SELECT opening_style,
SUM(wins) * 1.0 / (SUM(wins) + SUM(losses) + SUM(draws)) as win_rate
FROM players GROUP BY opening_style;
-- Search depth vs ELO correlation
SELECT search_depth, AVG(elo) FROM players GROUP BY search_depth ORDER BY search_depth;cd rust-engine
# Quick test (1K players, ~2 min)
cargo run --release --bin tournament -- --players 1000 --rounds 10
# Full run (1M players, ~5 hours, all cores)
cargo run --release --bin tournament -- --players 1000000 --rounds 20 --threads 0
# With custom seed for reproducibility
cargo run --release --bin tournament -- --players 10000 --rounds 12 --seed 12345 --output results.dbThe tournament runner answers questions that direct database and infrastructure design:
- ELO distribution shape → Determines shard key ranges for user partitioning at scale
- Game length distribution → Informs timeout policies and memory budgets per game room
- Blunder rate vs depth → Guides adaptive AI difficulty (how to set difficulty for 10B users with varying skill)
- Opening diversity → Validates that the engine produces interesting games (player retention)
- A/B test methodology → Proves the framework works before testing on real users
Every metric is chosen to answer a specific operational question.
| Metric | Type | Question It Answers |
|---|---|---|
chess_connected_players |
Gauge | How many users are online right now? |
chess_active_games |
Gauge | How many game rooms are consuming memory? |
chess_games_started_total |
Counter | What's our game creation rate? |
chess_games_completed_total |
Counter | What's the completion rate? (labeled by result + reason) |
chess_queue_length |
Gauge | Are players waiting too long for matches? |
chess_queue_wait_seconds |
Histogram | P50/P95/P99 matchmaking wait time |
chess_moves_total |
Counter | Total move throughput across all games |
chess_move_processing_seconds |
Histogram | Is move validation creating latency? |
chess_auth_total |
Counter | Auth attempt rate by type (guest/register/login) and result |
chess_errors_total |
Counter | Error rate by code (used for alerting thresholds) |
chess_db_query_seconds |
Histogram | Is SQLite becoming a bottleneck? |
chess_rate_limit_hits_total |
Counter | Are legitimate users being rate-limited? |
chess_ws_rate_limit_total |
Counter | WebSocket abuse detection rate |
chess_shutdown_in_progress |
Gauge | Is the server currently draining? (deploy awareness) |
chess_process_crashes_total |
Counter | Crash frequency — any value > 0 needs investigation |
chess_* (default) |
Various | Node.js process: CPU, memory, event loop lag, GC pause |
chess_connected_players > 150 → Warning: approaching Tier 1 capacity
chess_active_games > 300 → Warning: approaching room limit (500)
chess_db_query_seconds P95 > 1s → SQLite contention: migrate to PostgreSQL
chess_queue_wait_seconds P95 > 30s → Matchmaker bottleneck: needs dedicated service
chess_move_processing_seconds P95 > 500ms → CPU saturation: scale horizontally
event_loop_lag_seconds > 0.1 → Event loop blocking: profile and optimize
| Table | Columns | Purpose |
|---|---|---|
players |
id, name, elo, depth, aggression, opening, blunder_rate, group, wins, losses, draws, total_moves, blunders | Per-AI final state and personality |
rounds |
round_num, total_games, avg_elo_change, duration_ms | Per-round tournament health |
games |
white_id, black_id, result, moves, duration_ms | Individual game replay data |
ab_results |
group, mean_elo, stddev, win_rate, avg_game_length | A/B test aggregate statistics |
Seven layers of defense, each protecting against a specific failure class.
Layer 1: Fly.io Edge → TLS termination, DDoS protection, auto-start
Layer 2: Helmet.js → Security headers (HSTS, X-Frame-Options, nosniff). CSP via <meta> + vercel.json
Layer 3: Rate Limiting → 100 req/min HTTP, 20 msg/sec WS, 10 conn/IP
Layer 4: Input Validation → Zod schemas, chess.js move validation, size limits
Layer 5: Resource Protection → 500 room cap, stale cleanup, 16KB body limit
Layer 6: Observability → 16 Prometheus metrics, health check with DB test
Layer 7: Recovery → Graceful shutdown (15s drain), crash handlers, memory alerts
When Fly.io sends SIGTERM (during deploy or scale-down):
- Set
shutdownInProgress = true— reject new connections with 503 - Send
server_shutdownevent to all connected WebSocket clients - Wait up to 15 seconds for active connections to drain naturally
- Force-disconnect any remaining sockets
- Run cleanup: clear intervals, disconnect Prisma, clear rate-limit maps
- Exit with code 0
This ensures players get a "server restarting" message instead of a silent disconnect.
uncaughtException: Log full stack trace, incrementchess_process_crashes_total, exit(1) → Fly.io auto-restarts the containerunhandledRejection: Log warning, increment counter, continue running (non-fatal)- Memory warning: At 85% heap utilization, log warning for proactive investigation
| Scope | Limit | Window | Action on Exceed |
|---|---|---|---|
| Global HTTP API | 100 requests | 1 minute | 429 + RATE_LIMITED error |
| Auth endpoints | 10 requests | 1 minute | 429 + AUTH_RATE_LIMITED error |
| WebSocket messages | 20 messages | 1 second | Disconnect with RATE_LIMITED |
| Connections per IP | 10 sockets | — | Reject with CONNECTION_LIMIT |
| Game rooms | 500 total | — | Reject with SERVER_FULL |
Full resilience documentation → docs/PRODUCTION_RESILIENCE.md Incident response runbook → docs/INCIDENT_RESPONSE.md
| Category | Metric | Target |
|---|---|---|
| Availability | Uptime | 99.5% (monthly) |
| HTTP Latency | P95 | < 500ms |
| HTTP Latency | P99 | < 1,000ms |
| HTTP Errors | Error rate | < 5% |
| WebSocket Connect | P95 | < 2,000ms |
| WebSocket Message | P95 | < 500ms |
| WS Connection Success | Rate | > 90% |
| Script | Pattern | Peak Load | Duration |
|---|---|---|---|
http-load-test.js |
Ramp 10→50→100 VUs | 100 concurrent | 5 min |
websocket-load-test.js |
Ramp 10→50→200 VUs | 200 concurrent WS | 4 min |
stress-test.js |
Arrival rate 10→500 RPS + 250 WS | 500 RPS | 5 min |
HTTP Load Test: 6 scenarios — health check, root endpoint, guest auth, leaderboard, Prometheus metrics, rate limiter verification. Confirms the API stays within SLO under normal traffic.
WebSocket Load Test: Simulates real gameplay — connect, join queue, handle matchmaking, make moves, handle opponent moves. Validates the full game lifecycle under concurrent load.
Stress Test: Pushes past the breaking point. Discovers where the first failure occurs (VU count), measures maximum sustainable RPS, and verifies rate limiters engage correctly under extreme load.
# Install k6 (one-time)
winget install k6 # Windows
brew install k6 # macOS
# Run against production
k6 run load-tests/http-load-test.js
k6 run load-tests/websocket-load-test.js
k6 run load-tests/stress-test.js
# Run against local dev server
BASE_URL=http://localhost:3001 k6 run load-tests/http-load-test.js
WS_URL=ws://localhost:3001 k6 run load-tests/websocket-load-test.jsFull methodology → docs/LOAD_TEST_PLAN.md
├── src/ # Frontend TypeScript (40+ files)
│ ├── main-3d.ts # Entry point, DOM wiring (1,626 lines)
│ ├── gameController.ts # Core game logic (1,900+ lines)
│ ├── renderer3d.ts # Three.js 3D rendering (5,000+ lines)
│ ├── classicMode.ts # Classic Mode toggle + GFX quality presets (117 lines)
│ ├── themeSystem.ts # 8 UI themes, CSS variable theming (283 lines)
│ ├── pieceStyles.ts # 24 piece style definitions (7 3D + 17 2D)
│ ├── boardStyles.ts # 12 board styles with theme-aware highlights
│ ├── eraSystem.ts # ELO → era progression (20 eras)
│ ├── eraWorlds.ts # 3D environment builder + Lorenz particles (1,157 lines)
│ ├── assetMutator.ts # L-system procedural tree generator (1,204 lines)
│ ├── dynamicLighting.ts # Per-era lighting configs (1,149 lines)
│ ├── proceduralSkybox.ts # Procedural sky, stars, gradients (460 lines)
│ ├── chessEngine.ts # chess.js wrapper engine
│ ├── rustEngine.ts # WASM bridge to Rust
│ ├── stockfishEngine.ts # Stockfish.js Worker wrapper
│ ├── aiService.ts # AI fallback chain orchestrator
│ ├── overlayRenderer.ts # Overlay bar UI controls
│ ├── moveListUI.ts # Move history panel
│ ├── moveQualityAnalyzer.ts # Move quality evaluation
│ ├── multiplayerClient.ts # Socket.io client wrapper
│ ├── multiplayerUI.ts # Multiplayer + guest play UI
│ ├── eras/ # 10 era-specific world definitions
│ └── ... # Sound, save, stats, themes, newspaper articles
│
├── rust-engine/ # Rust chess engine → WASM
│ └── src/
│ ├── lib.rs # WASM entry points + GameState
│ ├── search.rs # Alpha-beta with TT, NMP, LMR
│ ├── movegen.rs # Legal move generation
│ ├── eval.rs # Material + PST evaluation
│ ├── magic.rs # Magic bitboard tables
│ ├── attacks.rs # Precomputed attack tables
│ ├── bitboard.rs # 64-bit board representation
│ ├── position.rs # Board state + make/unmake
│ ├── types.rs # Piece, Square, Move encoding
│ └── bin/
│ └── tournament.rs # 1M AI tournament runner (866 lines)
│
├── server/ # Multiplayer backend
│ ├── src/
│ │ ├── index.ts # Express + Socket.io (1,020 lines)
│ │ ├── resilience.ts # Graceful shutdown, crash recovery, rate limiting
│ │ ├── metrics.ts # 16 Prometheus metrics
│ │ ├── GameRoom.ts # Game session management
│ │ ├── Matchmaker.ts # Ranked queue + pairing
│ │ ├── auth.ts # JWT authentication
│ │ ├── database.ts # Prisma service layer
│ │ └── protocol.ts # Zod message schemas
│ ├── prisma/schema.prisma # Player + Game models
│ ├── Dockerfile # Multi-stage production build
│ └── fly.toml # Fly.io deployment config
│
├── load-tests/ # k6 load testing suite
│ ├── http-load-test.js # HTTP API: 6 scenarios, ramp to 100 VUs
│ ├── websocket-load-test.js # WebSocket: gameplay sim, 200 concurrent
│ └── stress-test.js # Breaking point: 500 RPS, 250 WS connections
│
├── tests/ # Frontend test suite (420 tests)
├── e2e/ # Playwright E2E tests (48 tests, 4 suites)
│ ├── playtest.spec.ts # Gameplay + visual correctness + stress (13 tests)
│ ├── welcome-dashboard.spec.ts # Dashboard UI, buttons, stats, dismiss (18 tests)
│ ├── classic-mode.spec.ts # Classic layout toggle, Explore mode (12 tests)
│ └── smoke.spec.ts # Load, AI, save/load, console audit (5 tests)
├── public/wasm/ # Pre-built WASM binary
├── docs/ # Documentation
│ ├── PART1_SUMMARY.md # Standalone Part 1
│ ├── PART2_TECH_STACK.md # Standalone Part 2
│ ├── PART3_QUICK_START.md # Standalone Part 3
│ ├── PART4_FULL_TUTORIAL.md # Standalone Part 4
│ ├── SCOPE.md # MVP definition, non-goals, invariants, perf floors
│ ├── REQUIREMENTS.md # MUST/SHOULD/MAY requirements (RFC 2119)
│ ├── ACCEPTANCE_TESTS.md # Requirements → verification mapping (42 tests)
│ ├── DEFINITION_OF_DONE.md # Per-change quality checklist
│ ├── RELEASE_CHECKLIST.md # Pre-deploy verification steps
│ ├── INCIDENT_RESPONSE.md # P0-P3 incident runbook
│ ├── LOAD_TEST_PLAN.md # k6 methodology, SLOs, capacity planning
│ ├── PRODUCTION_RESILIENCE.md # Defense-in-depth, failure modes, SLOs
│ ├── ARCHITECTURE_FAQ.md # "Why X over Y?" for every decision
│ ├── adr/ # Architecture Decision Records
│ └── blog/ # Blog post drafts
├── TESTING.md # Playtest agent docs — bugs found, 13 tests, architecture
├── CHANGELOG.md # Keep-a-Changelog format — all releases and unreleased changes
├── ANDROID_RELEASE.md # Google Play Store release guide (Capacitor)
├── .github/ISSUE_TEMPLATE/ # Scope-first change template (scope, acceptance, rollback)
└── index.html # Single-page app entry (2,200+ lines)
Every system has a bottleneck at every scale. The goal is to know what breaks next before it breaks.
| Concurrent Users | First Bottleneck | Second Bottleneck | Symptom | Detection Metric |
|---|---|---|---|---|
| 50–100 | Memory (256MB) | JS event loop | Slow responses, OOM | process_resident_memory_bytes |
| 100–500 | SQLite write lock | Game room Map growth | Auth/leaderboard timeout | chess_db_query_seconds P95 |
| 500–2K | Single-core CPU | WebSocket throughput | Event loop lag > 100ms | nodejs_eventloop_lag_seconds |
| 2K–10K | Single machine | No failover | Total outage on crash | chess_process_crashes_total |
| 10K–100K | Matchmaker latency | PostgreSQL connections | Queue wait > 30s | chess_queue_wait_seconds P95 |
| 100K–1M | Redis memory | Cross-region latency | Stale game state | Redis used_memory, RTT |
| 1M–100M | DB write throughput | Global consistency | Write conflicts | Kafka consumer lag |
| 100M–10B | Organizational complexity | Regulatory compliance | Feature velocity drops | Deployment frequency |
Interviewers ask "how would you scale this?" The correct answer isn't just "add more servers." It's:
- Identify the bottleneck at the current scale
- Explain what metric tells you it's happening
- Describe the fix and what it costs (complexity, money, latency)
- Predict the next bottleneck after the fix
This table is that answer, pre-computed.
A detailed infrastructure plan at each order of magnitude, with cost estimates and architectural notes.
Cost: ~$0–6/month (Fly.io auto-stop, Vercel free tier)
Stack: Single Node.js + SQLite + Vercel CDN
Key insight: AI runs client-side (WASM), so AI games cost $0 in server resources.
| Component | Spec | Cost |
|---|---|---|
| Frontend | Vercel free tier | $0 |
| Backend | Fly.io shared-cpu-1x, 256MB, auto-stop | $0–6/mo |
| Database | SQLite on 1GB volume | Included |
| AI Engine | Client-side WASM | $0 |
Cost: ~$15–30/month
Change: Bigger instance, SQLite WAL, Litestream backups
New bottleneck to watch: SQLite write lock contention
Cost: ~$100–300/month
Change: PostgreSQL (Neon/Supabase), Redis, 2–4 server instances, load balancer
New bottleneck: Matchmaker becomes a hot service
Cost: ~$1,000–5,000/month
Change: Multi-region, Kubernetes, dedicated matchmaker, PostgreSQL read replicas
New bottleneck: Cross-region game state consistency
Cost: ~$10,000–100,000/month
Change: CockroachDB/Spanner, Redis Cluster, Kafka event bus, microservices
New bottleneck: Organizational — single team can't own all services
Cost: ~$500,000–5,000,000/month
Change: Event sourcing, CRDT game state, edge compute, tiered storage
New bottleneck: Regulatory (GDPR, data sovereignty per region)
Cost: $10M+/month
Context: ~50M-500M peak concurrent (1-5% of registered users)
Key architectural advantage: AI is client-side, so 10B single-player sessions = $0 server cost.
Only multiplayer sessions require server resources.
This architecture has a unique property: the most expensive computation (chess AI at ~5M positions/sec) runs entirely in the user's browser via WASM. This means:
- 1 trillion single-player games/year = $0 server cost
- Server only scales with multiplayer games
- At 10B users, if 1% play multiplayer simultaneously, that's 100M concurrent — which is Tier 5 architecture
- The remaining 99% of users are playing against WASM AI with zero server involvement
This is why the architecture was designed with browser-side AI from the start.
Browser Server Analytics
┌──────────┐ ┌──────────────┐ ┌─────────────┐
│ Game play │────────→│ Socket.io │────────→│ Prometheus │
│ events │ WS │ handlers │ metrics │ /metrics │
└──────────┘ │ │ └──────┬──────┘
│ ┌──────────┐ │ │
│ │ Prisma │─┼───────→ SQLite (games, users, ELO)
│ └──────────┘ │ │
└──────────────┘ ▼
Grafana dashboards
k6 load test reports
Tournament Runner Tournament SQLite DB
┌──────────────┐
│ 1M AI games │─────────────────────────────→ analytics.db
│ A/B testing │ (personas, rounds, games, ab_results)
└──────────────┘
| Statistic | Decision It Drives |
|---|---|
chess_connected_players trend |
When to scale up (>150 warning, >250 critical) |
chess_queue_wait_seconds P95 |
Whether matchmaker needs optimization or dedicated service |
chess_db_query_seconds P95 |
When to migrate from SQLite to PostgreSQL |
chess_games_completed_total by reason |
Whether games end naturally (checkmate) or abnormally (disconnect) |
chess_rate_limit_hits_total rate |
Whether rate limits are too aggressive (false positives) or too lenient (abuse) |
chess_errors_total by code |
Which error paths need hardening |
nodejs_eventloop_lag_seconds |
Whether the server is CPU-bound and needs horizontal scaling |
process_heap_used_bytes |
Memory leak detection; when to increase instance RAM |
| Statistic | Design Question It Answers |
|---|---|
| ELO distribution by group (A vs B) | Do reward bonuses improve play quality? |
Win rate by search_depth |
What depth range provides the most interesting games? |
Win rate by opening_style |
Are certain openings overpowered in our engine? (engine bug indicator) |
| Average game length | How much memory/time should we budget per game room? |
| Blunder rate vs ELO correlation | Does blunder rate map linearly to ELO? (difficulty tuning) |
| Games per round timing | How long does the engine take per game? (performance regression detection) |
| Score variance per round | Is the Swiss pairing producing fair matchups? |
| k6 Metric | Capacity Decision |
|---|---|
| HTTP P95 latency at 50 VUs | Baseline — our SLO target (< 500ms) |
| HTTP P95 latency at 100 VUs | Are we within SLO under 2× normal load? |
| First HTTP failure VU count | Maximum safe concurrent users |
| WS connection success rate at 200 | Can we handle our target concurrent player count? |
| Stress test breaking-point VU | Absolute server capacity ceiling |
| Rate limit trigger count | Are our rate limits calibrated correctly? |
| Time to first byte at peak load | CDN/edge performance under pressure |
| Document | Purpose | Audience |
|---|---|---|
| README.md | Everything — summary through deep dive | Everyone |
| docs/PART1_SUMMARY.md | 30-second project summary | Hiring managers |
| docs/PART2_TECH_STACK.md | Architecture and stack decisions | Senior engineers |
| docs/PART3_QUICK_START.md | Clone, install, run in 2 minutes | Developers |
| docs/PART4_FULL_TUTORIAL.md | Complete engine manual + system design | Learners |
| docs/SCOPE.md | MVP definition, non-goals, invariants, performance floors | Anyone scoping changes |
| docs/REQUIREMENTS.md | MUST/SHOULD/MAY requirements (RFC 2119) | Reviewers / testers |
| docs/ACCEPTANCE_TESTS.md | Every requirement → exact verification command | QA / CI |
| docs/DEFINITION_OF_DONE.md | Per-change quality checklist | Contributors |
| docs/RELEASE_CHECKLIST.md | Pre-deploy verification steps (~5 min) | Release engineers |
| docs/PRODUCTION_RESILIENCE.md | SLOs, defense-in-depth, failure modes | SRE / DevOps |
| docs/LOAD_TEST_PLAN.md | k6 methodology, capacity planning, CI integration | Performance engineers |
| docs/INCIDENT_RESPONSE.md | P0–P3 runbook, diagnostic commands, rollback | On-call engineers |
| docs/ARCHITECTURE_FAQ.md | "Why did you choose X?" — every architectural trade-off explained | Staff+ interviewers |
| CHANGELOG.md | All releases in Keep-a-Changelog format | Anyone tracking changes |
| TESTING.md | Playtest agent — bugs found, 13 E2E tests, architecture | QA / developers |
| ANDROID_RELEASE.md | Google Play Store release guide (Capacitor) | Mobile developers |
Built with Rust, TypeScript, and Three.js. 806 unit tests + 48 E2E Playwright tests across 4 suites. 3 k6 load test suites. 1-million-AI tournament runner. Zero frameworks. One <canvas>.