Status: Accepted Date: 2026-03-24 Supersedes: ADR: Custom PySide6 Game Client (2026-03-22) Design Reference: Neal Stephenson, The Diamond Age — The Young Lady's Illustrated Primer (unchanged)
The PySide6 custom client (ADR 2026-03-22) was built over 10 days and is functional: WebSocket connection, QWebEngineView narrative rendering, local TTS/music/SFX, genre CSS theming. However, development friction is compounding:
-
CSS fights. QWebEngineView embeds Chromium but Qt's widget layer interferes with styling. Style inheritance leaks between Qt widgets and the embedded web view. Scoping issues that would be trivial in a real browser require workarounds in Qt.
-
Component model mismatch. The developer's expertise is React — composable components, hooks, declarative state management. PySide6's signal/slot system and QWidget hierarchy require a fundamentally different mental model. Every UI interaction is slower to build, harder to debug, and more fragile than it would be in React.
-
QWebEngineView is a browser in disguise. The narrative rendering surface is already HTML + CSS + JavaScript running in Chromium. PySide6 is a desktop shell wrapping a browser engine. The question becomes: why maintain the shell?
-
Distribution friction. PySide6 requires Python 3.11+, torch, PySide6, pygame-ce, sounddevice, mlx-whisper — a complex dependency tree that must be frozen into PyInstaller builds per platform. A web client requires a URL.
-
The WebSocket protocol is transport-agnostic. The existing message protocol (
GameMessage,MessageTypeenum, JSON payloads) works identically whether the client is PySide6, React, or curl. The server doesn't know or care what renders the narration.
What the PySide6 ADR got right (and this ADR preserves):
- Don't touch the Orchestrator
- Silent asymmetry is absolute
- Voice-only player communication
- The Primer metaphor (illustrated book, not app chrome)
- Genre pack CSS theming
What changed: The developer's assessment of where their time is best spent. PySide6 is learnable but not leverageable. React is mastered and productive.
Replace the PySide6 desktop client with a React web application. Move TTS synthesis and Whisper STT to the server. Keep WebRTC peer-to-peer voice chat in the browser. Preserve the existing WebSocket protocol, game server, and all game logic unchanged.
Browser (React + Vite)
├── React UI
│ ├── NarrativeView (HTML narrative flow, real CSS, inline images)
│ ├── PartyPanel (character portraits, HP, status)
│ ├── InputBar (action text + aside notes)
│ ├── TranscriptPreview (STT confirmation before submit)
│ └── AudioStatus (now-playing, volume controls)
├── WebSocket client (native WebSocket API)
├── Web Audio API
│ ├── Music channel (crossfade, genre tracks)
│ ├── SFX channel (combat, environment)
│ └── Voice channel (narrator TTS playback)
├── Local Whisper STT (Transformers.js + WebGPU)
│ └── whisper-tiny.en — 75MB cached in IndexedDB
├── MediaStream API (mic capture for PTT)
└── WebRTC (peer-to-peer group voice)
│
↓ WebSocket (JSON text + binary TTS audio)
│ (no raw voice audio sent — STT is local)
│
Game Server (M3 Max, aiohttp — unchanged)
├── Static file serving (React build + genre assets)
├── WebSocket handler (GameServer dispatch — unchanged)
├── MultiplayerSession (unchanged)
├── Orchestrator (unchanged)
├── Server-side Kokoro TTS → stream opus/PCM chunks to client
├── WebRTC signaling relay (unchanged)
└── Renderer daemon connection (Flux, ACE-Step — unchanged)
Same as before. The Orchestrator, MultiplayerSession, PerceptionRewriter, TurnManager, MessageComposer — all unchanged. The React client connects to the same WebSocket endpoint and speaks the same GameMessage protocol.
Each browser receives only its player's narration stream. No UI indicators of private information. Unchanged from the PySide6 ADR.
The audio pipeline splits by direction:
- STT (player → game): Local in browser. Transformers.js v3 runs
whisper-tiny.enwith WebGPU acceleration directly in the browser. Audio never leaves the client. The player holds PTT, speaks, releases, sees a transcript preview in ~1-1.7 seconds, edits if needed, confirms with Enter. Only the confirmed text is sent to the server as aPLAYER_ACTION— identical to typing. The server cannot distinguish voice input from keyboard input. - TTS (game → player): Server-side. Server runs Kokoro, streams audio chunks (opus or PCM) over WebSocket as binary frames. Client plays via Web Audio API. Server already has a TTSWorker in the renderer daemon — this extends it to handle per-player synthesis.
- Music/SFX: Server serves genre pack audio files via HTTP static. Client fetches and plays via Web Audio API. Same mood/SFX cue protocol over WebSocket.
Why local STT works:
- Transformers.js v3 with whisper-tiny.en + WebGPU: ~1.1-1.7s for 10-second clip on M3 Max
- Model is 75MB, cached in IndexedDB after first download
- WebGPU ships in all major browsers (Chrome, Firefox, Safari, Edge) as of late 2025
- Fallback: WebAssembly CPU execution (~2-3s, acceptable with transcript preview UX)
- Raw audio never leaves the browser — privacy by architecture
- No server STT load, no binary audio WebSocket frames, simpler protocol
Why server TTS works at scale (4-5 players):
- Kokoro synthesis is fast (~80ms per utterance on M3 Max)
- Narration is sequential — only one player's narration synthesizes at a time
- Music/SFX are pre-rendered files, not generated on the fly
- The M3 Max has 128GB RAM and 80GB ML budget — headroom for concurrent TTS
The push-to-talk → local Whisper → transcript preview → confirm → submit pipeline has inherent latency from STT processing (~1-1.7s). This latency is consumed by the UX: the player reads their transcript, edits if needed, and confirms. The confirmation step prevents misheard commands ("I said attack, not attract!"). Instant STT would actually require an artificial pause for the same UX. The processing time is invisible because it's filled with useful interaction.
WebRTC peer-to-peer voice in the browser. No text chat between players. Text input is for game actions and Narrator notes only.
The push-to-talk → server STT → transcript preview → confirm → submit pipeline has inherent latency from the network round-trip. This latency is consumed by the UX: the player reads their transcript, edits if needed, and confirms. The confirmation step prevents misheard commands. A local STT that returned instantly would require an artificial delay for the same UX.
client/
├── src/
│ ├── App.tsx # Root layout, WebSocket provider
│ ├── components/
│ │ ├── NarrativeView.tsx # Scrolling narrative with inline images
│ │ ├── PartyPanel.tsx # Character cards, HP, status effects
│ │ ├── InputBar.tsx # Action input + aside toggle
│ │ ├── TranscriptPreview.tsx # STT confirmation overlay
│ │ ├── AudioStatus.tsx # Now-playing bar, volume sliders
│ │ ├── ConnectScreen.tsx # Server URL + player name
│ │ └── CharacterCreation/ # Multi-step character builder
│ ├── hooks/
│ │ ├── useWebSocket.ts # WebSocket connection + reconnect
│ │ ├── useAudio.ts # Web Audio API: music, SFX, voice
│ │ ├── useVoiceChat.ts # WebRTC peer mesh
│ │ ├── usePushToTalk.ts # PTT key binding + mic routing
│ │ └── useGameState.ts # Client-side state from server messages
│ ├── audio/
│ │ ├── AudioEngine.ts # Web Audio API wrapper (3 channels)
│ │ ├── Crossfader.ts # Music crossfade (GainNode ramps)
│ │ └── Ducker.ts # TTS→music ducking policy
│ ├── protocol/
│ │ └── messages.ts # MessageType enum, GameMessage type
│ └── theme/
│ └── GenreTheme.ts # CSS variable injection from server
├── public/
│ └── index.html
├── package.json
├── vite.config.ts
└── tsconfig.json
The existing GameMessage protocol is preserved with one addition:
New message type:
VOICE_AUDIO (server → client) Binary opus/PCM frames for TTS playback
Removed (not needed — STT is local):
STT_AUDIO — audio never leaves client
STT_TRANSCRIPT — transcript generated locally, submitted as PLAYER_ACTION
All existing message types (NARRATION, IMAGE, AUDIO_CUE, VOICE_TEXT, TURN_STATUS, etc.) remain unchanged. The React client handles them identically to the PySide6 client. Voice input arrives as PLAYER_ACTION — indistinguishable from typed input.
# In server dispatch, when sending VOICE_TEXT:
async def handle_voice_text(self, player_id: str, segments: list[VoiceSegment]):
"""Synthesize TTS on server, stream audio to client."""
for segment in segments:
audio_bytes = await self.tts_engine.synthesize(
text=segment.text,
preset=segment.preset,
)
await self.send_binary(player_id, MessageType.VOICE_AUDIO, audio_bytes)This replaces the current flow where VOICE_TEXT sends text and the client synthesizes locally. The server now sends audio data. The Kokoro/Piper engines move from sidequest/client/local_audio.py to a server-side TTS service.
// useWhisper.ts — local STT, audio never leaves browser
import { pipeline } from "@huggingface/transformers";
class LocalTranscriber {
private transcriber: any;
async initialize() {
const device = navigator.gpu ? 'webgpu' : 'wasm';
this.transcriber = await pipeline(
"automatic-speech-recognition",
"onnx-community/whisper-tiny.en",
{ device, dtype: device === 'webgpu' ? 'q4' : 'q8' }
);
// Model cached in IndexedDB — 75MB one-time download
}
async transcribe(audioBuffer: Float32Array): Promise<string> {
const result = await this.transcriber(audioBuffer);
return result.text; // ~1.1-1.7s on M3 Max with WebGPU
}
}No server round-trip. No binary audio frames on the wire. The transcript is displayed in TranscriptPreview for player confirmation. On Enter, it becomes a PLAYER_ACTION — the server sees text, not audio.
// usePushToTalk.ts — local STT + WebRTC mic muting
function usePushToTalk(
transcriber: LocalTranscriber,
voiceChat: RTCPeerConnection,
onTranscript: (text: string) => void
) {
const audioChunks = useRef<Blob[]>([]);
const recorder = useRef<MediaRecorder>();
const onKeyDown = (e: KeyboardEvent) => {
if (e.code !== 'Space' || e.repeat) return;
// Mute WebRTC track (cut from group voice)
voiceChat.getSenders().forEach(s =>
s.track && (s.track.enabled = false)
);
// Start recording locally
audioChunks.current = [];
recorder.current = new MediaRecorder(mediaStream.current!);
recorder.current.ondataavailable = (e) =>
audioChunks.current.push(e.data);
recorder.current.start(250);
};
const onKeyUp = async (e: KeyboardEvent) => {
if (e.code !== 'Space') return;
recorder.current?.stop();
// Unmute WebRTC track (rejoin group voice)
voiceChat.getSenders().forEach(s =>
s.track && (s.track.enabled = true)
);
// Transcribe locally — audio stays in browser
const audioBlob = new Blob(audioChunks.current);
const audioBuffer = await decodeAudioToFloat32(audioBlob);
const transcript = await transcriber.transcribe(audioBuffer);
onTranscript(transcript); // Show in TranscriptPreview
};
}// AudioEngine.ts — replaces pygame AudioMixer
class AudioEngine {
private ctx: AudioContext;
private channels: Map<string, GainNode>; // music, sfx, voice
async playMusic(url: string, fadeMs: number = 3000) {
const buffer = await this.fetchAndDecode(url);
const source = this.ctx.createBufferSource();
source.buffer = buffer;
// Crossfade: ramp old down, new up over fadeMs
this.crossfade(this.channels.get('music')!, source, fadeMs);
}
async playVoice(audioData: ArrayBuffer) {
const buffer = await this.ctx.decodeAudioData(audioData);
const source = this.ctx.createBufferSource();
source.buffer = buffer;
source.connect(this.channels.get('voice')!);
// Duck music while voice plays
this.duck('music', 0.3, 200);
source.onended = () => this.unduck('music', 500);
source.start();
}
}// Server sends genre theme CSS URL on connect
// Client loads it as a <link> tag — real CSS, no Qt bridge
useEffect(() => {
if (genreThemeUrl) {
const link = document.createElement('link');
link.rel = 'stylesheet';
link.href = genreThemeUrl;
document.head.appendChild(link);
}
}, [genreThemeUrl]);Genre packs already have client_theme.css files. The server serves them as static files. The client loads them as real stylesheets. No runJavaScript() bridge, no Qt style leaking.
The game server adds a static file route for genre pack assets:
# In server setup
app.router.add_static('/assets/', path='genre_packs/', name='assets')
app.router.add_static('/client/', path='client/dist/', name='client')Genre fonts, images, audio tracks, and CSS are served over HTTP. The React client fetches them with standard <link>, <img>, and fetch() calls.
| File | Change | Why |
|---|---|---|
sidequest/server/app.py |
Add static file routes, TTS streaming handler | Serve React build + genre assets, stream TTS audio |
sidequest/server/protocol.py |
Add VOICE_AUDIO type |
Binary TTS audio frames (server → client only) |
sidequest/server/app.py |
Add server-side Kokoro TTS service | Move TTS synthesis from client to server |
Orchestrator: Unchanged. MultiplayerSession: Unchanged. Game logic: Unchanged. Renderer daemon: Unchanged. Genre pack system: Unchanged (gains real CSS delivery).
| Existing Component | Reuse Strategy |
|---|---|
Orchestrator.handle_player_input() |
Called by server, unchanged |
MultiplayerSession |
Server wraps it directly, unchanged |
PerceptionRewriter |
Per-player narration, unchanged |
CharacterBuilder |
Driven over WebSocket, unchanged |
GameState.save()/load() |
Same save files, unchanged |
SceneInterpreter + RenderQueue |
Image delivery over WebSocket, unchanged |
AudioInterpreter |
Mood/SFX cue detection, unchanged |
Kokoro/Piper TTS engines |
Move to server-side service (same code, different caller) |
| Whisper STT concept | Reimplemented in browser via Transformers.js (not same code, same purpose) |
Genre pack CSS (client_theme.css) |
Served as real CSS instead of injected via Qt bridge |
| WebSocket protocol | Same GameMessage format, 3 new message types |
| WebRTC signaling | Same relay pattern, browser-native WebRTC instead of aiortc |
| Component | Why |
|---|---|
sidequest/client/ (entire directory) |
PySide6 desktop client replaced by React |
PySide6, qasync dependencies |
No longer needed |
pygame-ce dependency (client) |
Replaced by Web Audio API |
sounddevice dependency |
Replaced by MediaStream API |
mlx-whisper (Python client) |
Replaced by Transformers.js whisper-tiny.en in browser |
aiortc dependency |
Browser-native WebRTC replaces it |
| PyInstaller build stories (48-1, 48-2) | No desktop builds needed |
Rejected. Working code but fighting the developer's expertise. CSS friction, Qt widget quirks, and complex dependency chain for distribution. The developer thinks in React — forcing PySide6 burns cycles on the wrong problem.
Considered. Gets real CSS + local TTS/STT. But adds a native shell dependency, two build targets, and a sidecar process — reintroducing the distribution complexity that drove us away from PySide6. The M3 Max can handle server-side TTS for 4-5 players without a local sidecar.
Rejected. Combines the worst of both worlds — React's ecosystem with native platform bridging complexity. No meaningful advantage over a web browser for this application.
Positive:
- Real CSS, real DOM, real React — developer's primary expertise
- Zero-install distribution (URL)
- Browser-native WebRTC (battle-tested, no aiortc dependency)
- Genre pack CSS delivered as real stylesheets
- Vite hot reload for rapid UI iteration
- Standard web debugging tools (Chrome DevTools)
- Players join from any device with a browser
- STT stays local — raw voice audio never leaves the browser (privacy by architecture)
- Server protocol is simpler — no binary audio upload, no STT load
- Voice input and keyboard input are indistinguishable to the server
- Eliminates PySide6, qasync, pygame, sounddevice, aiortc, mlx-whisper from server deps
Negative:
- Server handles TTS synthesis for all players (CPU/GPU load — but sequential, not concurrent)
- No offline play (requires server connection — but already required for Orchestrator)
- Web Audio API is more verbose than pygame for mixing
- Rewrite of ~2000 LOC client code (presentation layer only, no game logic)
- Browser tab doesn't feel as "native" as a desktop app
- Audio autoplay policies require user interaction before first sound
- Local Whisper requires WebGPU for optimal latency; CPU fallback is slower (~2-3s)
- 75MB model download on first visit (cached after)
Risks:
- Server TTS concurrency. 4-5 simultaneous TTS requests during narration delivery. Mitigation: narration is sequential per turn; concurrent synthesis unlikely. M3 Max has headroom.
- Web Audio API complexity. Crossfade, ducking, and multi-channel mixing require manual AudioNode graph wiring. Mitigation: AudioEngine wrapper class encapsulates complexity.
- Browser audio autoplay. Browsers block audio until user gesture. Mitigation: "Click to enter the game" splash screen on connect.
- Binary WebSocket framing. Streaming TTS opus/PCM over WebSocket needs framing protocol for chunk boundaries. Mitigation: simple length-prefixed frames or use MediaSource Extensions.
- Local Whisper on older hardware. Users without WebGPU get CPU-only WASM (~2-3s latency). Mitigation: still usable with transcript preview UX; or fall back to server-side STT as a degradation path.
| Phase | Contents | Playable After? | Risk |
|---|---|---|---|
| 1: React Shell | Vite + React app, WebSocket hook, NarrativeView, InputBar, ConnectScreen | Yes (text only) | Low |
| 2: Genre Theming | CSS variable injection, genre stylesheet loading, inline images | Yes (themed) | Low |
| 3: Server TTS | Server-side Kokoro, VOICE_AUDIO binary streaming, Web Audio playback | Yes (with voice) | Medium |
| 4: Audio Engine | Web Audio mixer (music/SFX channels), crossfade, ducking, volume controls | Yes (full audio) | Medium |
| 5: Push-to-Talk STT | MediaStream capture, binary audio to server, Whisper STT, transcript preview | Yes (voice input) | Medium |
| 6: WebRTC Voice | Browser WebRTC peer mesh, group voice, PTT mic muting | Yes (full voice) | High |
| 7: Polish | Character creation flow, party panel, map overlay, quest log | Yes (complete) | Low |
Phases 1-2 reach feature parity with the TUI. Phase 3 adds the narrator's voice. Phase 6 is highest risk (WebRTC NAT traversal) but browser WebRTC is more mature than aiortc.