Skip to content

Latest commit

 

History

History
404 lines (325 loc) · 20.3 KB

File metadata and controls

404 lines (325 loc) · 20.3 KB

ADR: React Web Client Replacing PySide6 Desktop Client

Status: Accepted Date: 2026-03-24 Supersedes: ADR: Custom PySide6 Game Client (2026-03-22) Design Reference: Neal Stephenson, The Diamond Age — The Young Lady's Illustrated Primer (unchanged)

Context

The PySide6 custom client (ADR 2026-03-22) was built over 10 days and is functional: WebSocket connection, QWebEngineView narrative rendering, local TTS/music/SFX, genre CSS theming. However, development friction is compounding:

  1. CSS fights. QWebEngineView embeds Chromium but Qt's widget layer interferes with styling. Style inheritance leaks between Qt widgets and the embedded web view. Scoping issues that would be trivial in a real browser require workarounds in Qt.

  2. Component model mismatch. The developer's expertise is React — composable components, hooks, declarative state management. PySide6's signal/slot system and QWidget hierarchy require a fundamentally different mental model. Every UI interaction is slower to build, harder to debug, and more fragile than it would be in React.

  3. QWebEngineView is a browser in disguise. The narrative rendering surface is already HTML + CSS + JavaScript running in Chromium. PySide6 is a desktop shell wrapping a browser engine. The question becomes: why maintain the shell?

  4. Distribution friction. PySide6 requires Python 3.11+, torch, PySide6, pygame-ce, sounddevice, mlx-whisper — a complex dependency tree that must be frozen into PyInstaller builds per platform. A web client requires a URL.

  5. The WebSocket protocol is transport-agnostic. The existing message protocol (GameMessage, MessageType enum, JSON payloads) works identically whether the client is PySide6, React, or curl. The server doesn't know or care what renders the narration.

What the PySide6 ADR got right (and this ADR preserves):

  • Don't touch the Orchestrator
  • Silent asymmetry is absolute
  • Voice-only player communication
  • The Primer metaphor (illustrated book, not app chrome)
  • Genre pack CSS theming

What changed: The developer's assessment of where their time is best spent. PySide6 is learnable but not leverageable. React is mastered and productive.

Decision

Replace the PySide6 desktop client with a React web application. Move TTS synthesis and Whisper STT to the server. Keep WebRTC peer-to-peer voice chat in the browser. Preserve the existing WebSocket protocol, game server, and all game logic unchanged.

Architecture

Browser (React + Vite)
  ├── React UI
  │   ├── NarrativeView (HTML narrative flow, real CSS, inline images)
  │   ├── PartyPanel (character portraits, HP, status)
  │   ├── InputBar (action text + aside notes)
  │   ├── TranscriptPreview (STT confirmation before submit)
  │   └── AudioStatus (now-playing, volume controls)
  ├── WebSocket client (native WebSocket API)
  ├── Web Audio API
  │   ├── Music channel (crossfade, genre tracks)
  │   ├── SFX channel (combat, environment)
  │   └── Voice channel (narrator TTS playback)
  ├── Local Whisper STT (Transformers.js + WebGPU)
  │   └── whisper-tiny.en — 75MB cached in IndexedDB
  ├── MediaStream API (mic capture for PTT)
  └── WebRTC (peer-to-peer group voice)
        │
        ↓ WebSocket (JSON text + binary TTS audio)
        │  (no raw voice audio sent — STT is local)
        │
Game Server (M3 Max, aiohttp — unchanged)
  ├── Static file serving (React build + genre assets)
  ├── WebSocket handler (GameServer dispatch — unchanged)
  ├── MultiplayerSession (unchanged)
  ├── Orchestrator (unchanged)
  ├── Server-side Kokoro TTS → stream opus/PCM chunks to client
  ├── WebRTC signaling relay (unchanged)
  └── Renderer daemon connection (Flux, ACE-Step — unchanged)

Principle: Don't Touch the Orchestrator (Still)

Same as before. The Orchestrator, MultiplayerSession, PerceptionRewriter, TurnManager, MessageComposer — all unchanged. The React client connects to the same WebSocket endpoint and speaks the same GameMessage protocol.

Principle: Silent Asymmetry is Absolute (Still)

Each browser receives only its player's narration stream. No UI indicators of private information. Unchanged from the PySide6 ADR.

Principle: STT Stays Local, TTS Moves to Server

The audio pipeline splits by direction:

  • STT (player → game): Local in browser. Transformers.js v3 runs whisper-tiny.en with WebGPU acceleration directly in the browser. Audio never leaves the client. The player holds PTT, speaks, releases, sees a transcript preview in ~1-1.7 seconds, edits if needed, confirms with Enter. Only the confirmed text is sent to the server as a PLAYER_ACTION — identical to typing. The server cannot distinguish voice input from keyboard input.
  • TTS (game → player): Server-side. Server runs Kokoro, streams audio chunks (opus or PCM) over WebSocket as binary frames. Client plays via Web Audio API. Server already has a TTSWorker in the renderer daemon — this extends it to handle per-player synthesis.
  • Music/SFX: Server serves genre pack audio files via HTTP static. Client fetches and plays via Web Audio API. Same mood/SFX cue protocol over WebSocket.

Why local STT works:

  • Transformers.js v3 with whisper-tiny.en + WebGPU: ~1.1-1.7s for 10-second clip on M3 Max
  • Model is 75MB, cached in IndexedDB after first download
  • WebGPU ships in all major browsers (Chrome, Firefox, Safari, Edge) as of late 2025
  • Fallback: WebAssembly CPU execution (~2-3s, acceptable with transcript preview UX)
  • Raw audio never leaves the browser — privacy by architecture
  • No server STT load, no binary audio WebSocket frames, simpler protocol

Why server TTS works at scale (4-5 players):

  • Kokoro synthesis is fast (~80ms per utterance on M3 Max)
  • Narration is sequential — only one player's narration synthesizes at a time
  • Music/SFX are pre-rendered files, not generated on the fly
  • The M3 Max has 128GB RAM and 80GB ML budget — headroom for concurrent TTS

Principle: PTT Lag is a Feature

The push-to-talk → local Whisper → transcript preview → confirm → submit pipeline has inherent latency from STT processing (~1-1.7s). This latency is consumed by the UX: the player reads their transcript, edits if needed, and confirms. The confirmation step prevents misheard commands ("I said attack, not attract!"). Instant STT would actually require an artificial pause for the same UX. The processing time is invisible because it's filled with useful interaction.

Principle: Voice-Only Communication (Still)

WebRTC peer-to-peer voice in the browser. No text chat between players. Text input is for game actions and Narrator notes only.

Principle: PTT Lag is a Feature

The push-to-talk → server STT → transcript preview → confirm → submit pipeline has inherent latency from the network round-trip. This latency is consumed by the UX: the player reads their transcript, edits if needed, and confirms. The confirmation step prevents misheard commands. A local STT that returned instantly would require an artificial delay for the same UX.

Component Design

1. React Application Structure

client/
├── src/
│   ├── App.tsx                 # Root layout, WebSocket provider
│   ├── components/
│   │   ├── NarrativeView.tsx   # Scrolling narrative with inline images
│   │   ├── PartyPanel.tsx      # Character cards, HP, status effects
│   │   ├── InputBar.tsx        # Action input + aside toggle
│   │   ├── TranscriptPreview.tsx  # STT confirmation overlay
│   │   ├── AudioStatus.tsx     # Now-playing bar, volume sliders
│   │   ├── ConnectScreen.tsx   # Server URL + player name
│   │   └── CharacterCreation/  # Multi-step character builder
│   ├── hooks/
│   │   ├── useWebSocket.ts     # WebSocket connection + reconnect
│   │   ├── useAudio.ts         # Web Audio API: music, SFX, voice
│   │   ├── useVoiceChat.ts     # WebRTC peer mesh
│   │   ├── usePushToTalk.ts    # PTT key binding + mic routing
│   │   └── useGameState.ts     # Client-side state from server messages
│   ├── audio/
│   │   ├── AudioEngine.ts      # Web Audio API wrapper (3 channels)
│   │   ├── Crossfader.ts       # Music crossfade (GainNode ramps)
│   │   └── Ducker.ts           # TTS→music ducking policy
│   ├── protocol/
│   │   └── messages.ts         # MessageType enum, GameMessage type
│   └── theme/
│       └── GenreTheme.ts       # CSS variable injection from server
├── public/
│   └── index.html
├── package.json
├── vite.config.ts
└── tsconfig.json

2. WebSocket Protocol Changes

The existing GameMessage protocol is preserved with one addition:

New message type:
  VOICE_AUDIO    (server → client)  Binary opus/PCM frames for TTS playback

Removed (not needed — STT is local):
  STT_AUDIO      — audio never leaves client
  STT_TRANSCRIPT — transcript generated locally, submitted as PLAYER_ACTION

All existing message types (NARRATION, IMAGE, AUDIO_CUE, VOICE_TEXT, TURN_STATUS, etc.) remain unchanged. The React client handles them identically to the PySide6 client. Voice input arrives as PLAYER_ACTION — indistinguishable from typed input.

3. Server-Side TTS Pipeline

# In server dispatch, when sending VOICE_TEXT:
async def handle_voice_text(self, player_id: str, segments: list[VoiceSegment]):
    """Synthesize TTS on server, stream audio to client."""
    for segment in segments:
        audio_bytes = await self.tts_engine.synthesize(
            text=segment.text,
            preset=segment.preset,
        )
        await self.send_binary(player_id, MessageType.VOICE_AUDIO, audio_bytes)

This replaces the current flow where VOICE_TEXT sends text and the client synthesizes locally. The server now sends audio data. The Kokoro/Piper engines move from sidequest/client/local_audio.py to a server-side TTS service.

4. Local Browser STT (Transformers.js)

// useWhisper.ts — local STT, audio never leaves browser
import { pipeline } from "@huggingface/transformers";

class LocalTranscriber {
  private transcriber: any;

  async initialize() {
    const device = navigator.gpu ? 'webgpu' : 'wasm';
    this.transcriber = await pipeline(
      "automatic-speech-recognition",
      "onnx-community/whisper-tiny.en",
      { device, dtype: device === 'webgpu' ? 'q4' : 'q8' }
    );
    // Model cached in IndexedDB — 75MB one-time download
  }

  async transcribe(audioBuffer: Float32Array): Promise<string> {
    const result = await this.transcriber(audioBuffer);
    return result.text;  // ~1.1-1.7s on M3 Max with WebGPU
  }
}

No server round-trip. No binary audio frames on the wire. The transcript is displayed in TranscriptPreview for player confirmation. On Enter, it becomes a PLAYER_ACTION — the server sees text, not audio.

5. Push-to-Talk in Browser

// usePushToTalk.ts — local STT + WebRTC mic muting
function usePushToTalk(
  transcriber: LocalTranscriber,
  voiceChat: RTCPeerConnection,
  onTranscript: (text: string) => void
) {
  const audioChunks = useRef<Blob[]>([]);
  const recorder = useRef<MediaRecorder>();

  const onKeyDown = (e: KeyboardEvent) => {
    if (e.code !== 'Space' || e.repeat) return;
    // Mute WebRTC track (cut from group voice)
    voiceChat.getSenders().forEach(s =>
      s.track && (s.track.enabled = false)
    );
    // Start recording locally
    audioChunks.current = [];
    recorder.current = new MediaRecorder(mediaStream.current!);
    recorder.current.ondataavailable = (e) =>
      audioChunks.current.push(e.data);
    recorder.current.start(250);
  };

  const onKeyUp = async (e: KeyboardEvent) => {
    if (e.code !== 'Space') return;
    recorder.current?.stop();
    // Unmute WebRTC track (rejoin group voice)
    voiceChat.getSenders().forEach(s =>
      s.track && (s.track.enabled = true)
    );
    // Transcribe locally — audio stays in browser
    const audioBlob = new Blob(audioChunks.current);
    const audioBuffer = await decodeAudioToFloat32(audioBlob);
    const transcript = await transcriber.transcribe(audioBuffer);
    onTranscript(transcript);  // Show in TranscriptPreview
  };
}

6. Web Audio Engine

// AudioEngine.ts — replaces pygame AudioMixer
class AudioEngine {
  private ctx: AudioContext;
  private channels: Map<string, GainNode>;  // music, sfx, voice

  async playMusic(url: string, fadeMs: number = 3000) {
    const buffer = await this.fetchAndDecode(url);
    const source = this.ctx.createBufferSource();
    source.buffer = buffer;
    // Crossfade: ramp old down, new up over fadeMs
    this.crossfade(this.channels.get('music')!, source, fadeMs);
  }

  async playVoice(audioData: ArrayBuffer) {
    const buffer = await this.ctx.decodeAudioData(audioData);
    const source = this.ctx.createBufferSource();
    source.buffer = buffer;
    source.connect(this.channels.get('voice')!);
    // Duck music while voice plays
    this.duck('music', 0.3, 200);
    source.onended = () => this.unduck('music', 500);
    source.start();
  }
}

7. Genre Theme Injection

// Server sends genre theme CSS URL on connect
// Client loads it as a <link> tag — real CSS, no Qt bridge
useEffect(() => {
  if (genreThemeUrl) {
    const link = document.createElement('link');
    link.rel = 'stylesheet';
    link.href = genreThemeUrl;
    document.head.appendChild(link);
  }
}, [genreThemeUrl]);

Genre packs already have client_theme.css files. The server serves them as static files. The client loads them as real stylesheets. No runJavaScript() bridge, no Qt style leaking.

8. Static Asset Serving

The game server adds a static file route for genre pack assets:

# In server setup
app.router.add_static('/assets/', path='genre_packs/', name='assets')
app.router.add_static('/client/', path='client/dist/', name='client')

Genre fonts, images, audio tracks, and CSS are served over HTTP. The React client fetches them with standard <link>, <img>, and fetch() calls.

What Changes in Existing Code

File Change Why
sidequest/server/app.py Add static file routes, TTS streaming handler Serve React build + genre assets, stream TTS audio
sidequest/server/protocol.py Add VOICE_AUDIO type Binary TTS audio frames (server → client only)
sidequest/server/app.py Add server-side Kokoro TTS service Move TTS synthesis from client to server

Orchestrator: Unchanged. MultiplayerSession: Unchanged. Game logic: Unchanged. Renderer daemon: Unchanged. Genre pack system: Unchanged (gains real CSS delivery).

What We Reuse

Existing Component Reuse Strategy
Orchestrator.handle_player_input() Called by server, unchanged
MultiplayerSession Server wraps it directly, unchanged
PerceptionRewriter Per-player narration, unchanged
CharacterBuilder Driven over WebSocket, unchanged
GameState.save()/load() Same save files, unchanged
SceneInterpreter + RenderQueue Image delivery over WebSocket, unchanged
AudioInterpreter Mood/SFX cue detection, unchanged
Kokoro/Piper TTS engines Move to server-side service (same code, different caller)
Whisper STT concept Reimplemented in browser via Transformers.js (not same code, same purpose)
Genre pack CSS (client_theme.css) Served as real CSS instead of injected via Qt bridge
WebSocket protocol Same GameMessage format, 3 new message types
WebRTC signaling Same relay pattern, browser-native WebRTC instead of aiortc

What We Retire

Component Why
sidequest/client/ (entire directory) PySide6 desktop client replaced by React
PySide6, qasync dependencies No longer needed
pygame-ce dependency (client) Replaced by Web Audio API
sounddevice dependency Replaced by MediaStream API
mlx-whisper (Python client) Replaced by Transformers.js whisper-tiny.en in browser
aiortc dependency Browser-native WebRTC replaces it
PyInstaller build stories (48-1, 48-2) No desktop builds needed

Alternatives Considered

A: Stay with PySide6

Rejected. Working code but fighting the developer's expertise. CSS friction, Qt widget quirks, and complex dependency chain for distribution. The developer thinks in React — forcing PySide6 burns cycles on the wrong problem.

B: Electron/Tauri hybrid (React UI + native audio sidecar)

Considered. Gets real CSS + local TTS/STT. But adds a native shell dependency, two build targets, and a sidecar process — reintroducing the distribution complexity that drove us away from PySide6. The M3 Max can handle server-side TTS for 4-5 players without a local sidecar.

C: React Native Desktop

Rejected. Combines the worst of both worlds — React's ecosystem with native platform bridging complexity. No meaningful advantage over a web browser for this application.

Consequences

Positive:

  • Real CSS, real DOM, real React — developer's primary expertise
  • Zero-install distribution (URL)
  • Browser-native WebRTC (battle-tested, no aiortc dependency)
  • Genre pack CSS delivered as real stylesheets
  • Vite hot reload for rapid UI iteration
  • Standard web debugging tools (Chrome DevTools)
  • Players join from any device with a browser
  • STT stays local — raw voice audio never leaves the browser (privacy by architecture)
  • Server protocol is simpler — no binary audio upload, no STT load
  • Voice input and keyboard input are indistinguishable to the server
  • Eliminates PySide6, qasync, pygame, sounddevice, aiortc, mlx-whisper from server deps

Negative:

  • Server handles TTS synthesis for all players (CPU/GPU load — but sequential, not concurrent)
  • No offline play (requires server connection — but already required for Orchestrator)
  • Web Audio API is more verbose than pygame for mixing
  • Rewrite of ~2000 LOC client code (presentation layer only, no game logic)
  • Browser tab doesn't feel as "native" as a desktop app
  • Audio autoplay policies require user interaction before first sound
  • Local Whisper requires WebGPU for optimal latency; CPU fallback is slower (~2-3s)
  • 75MB model download on first visit (cached after)

Risks:

  • Server TTS concurrency. 4-5 simultaneous TTS requests during narration delivery. Mitigation: narration is sequential per turn; concurrent synthesis unlikely. M3 Max has headroom.
  • Web Audio API complexity. Crossfade, ducking, and multi-channel mixing require manual AudioNode graph wiring. Mitigation: AudioEngine wrapper class encapsulates complexity.
  • Browser audio autoplay. Browsers block audio until user gesture. Mitigation: "Click to enter the game" splash screen on connect.
  • Binary WebSocket framing. Streaming TTS opus/PCM over WebSocket needs framing protocol for chunk boundaries. Mitigation: simple length-prefixed frames or use MediaSource Extensions.
  • Local Whisper on older hardware. Users without WebGPU get CPU-only WASM (~2-3s latency). Mitigation: still usable with transcript preview UX; or fall back to server-side STT as a degradation path.

Implementation Phases

Phase Contents Playable After? Risk
1: React Shell Vite + React app, WebSocket hook, NarrativeView, InputBar, ConnectScreen Yes (text only) Low
2: Genre Theming CSS variable injection, genre stylesheet loading, inline images Yes (themed) Low
3: Server TTS Server-side Kokoro, VOICE_AUDIO binary streaming, Web Audio playback Yes (with voice) Medium
4: Audio Engine Web Audio mixer (music/SFX channels), crossfade, ducking, volume controls Yes (full audio) Medium
5: Push-to-Talk STT MediaStream capture, binary audio to server, Whisper STT, transcript preview Yes (voice input) Medium
6: WebRTC Voice Browser WebRTC peer mesh, group voice, PTT mic muting Yes (full voice) High
7: Polish Character creation flow, party panel, map overlay, quest log Yes (complete) Low

Phases 1-2 reach feature parity with the TUI. Phase 3 adds the narrator's voice. Phase 6 is highest risk (WebRTC NAT traversal) but browser WebRTC is more mature than aiortc.