voice: retire legacy WS transport, unify on LiveKit WebRTC#914
voice: retire legacy WS transport, unify on LiveKit WebRTC#914
Conversation
VoiceStartServerCommand now returns LiveKit URL + JWT token instead of spinning up a legacy WebSocket server on port 3001. Same LiveKit token generation pattern as collaboration/live/join. Port 3001 is no longer needed — Docker compose never exposed it, so it was already dead in containerized deployments. VoiceStartResult type adds livekitUrl + livekitToken fields (wsUrl kept for backwards compat, set to same as livekitUrl). VoiceChatWidget browser-side migration (raw WS → AudioStreamClient LiveKit transport) is the next step — this commit unblocks it by providing the correct server-side response shape. Existing TTS→STT roundtrip tests (livekit-audio-roundtrip.test.ts, sensory_pipeline_test.rs with gunfire/noise injection) validate the audio pipeline independently. Once the widget is wired to LiveKit, those tests cover the full voice path.
- VoiceChatWidget: replace raw WebSocket + AudioWorklet with AudioStreamClient (LiveKit WebRTC). 427→178 lines. VOICE_WS_PORT/3001 eliminated from browser side. - VoiceStartTypes: required result fields (handle, livekitUrl, livekitToken, roomId) now required in factory params — no more optional + empty-string defaults hiding compile-time errors. Remove dead wsUrl field (legacy port-3001, zero consumers). - docker-compose: livekit + livekit-bridge moved to profiles: [live]. Text chat works without WebRTC infrastructure. Carl saves ~300MB RAM and doesn't wait for LiveKit to boot. - continuum-core depends_on: removed hard dep on livekit-bridge (profile-gated services can't be in depends_on). Core discovers bridge via socket at startup when live profile is active.
LiveKit provides the UDP/WebRTC transport that made 14 personas + 4 LLMs + TTS/STT + Bevy avatars work simultaneously on M1. Profiling it out would degrade the product to text-only. Same principle as Docker Model Runner — efficient transport is a core requirement. Restores: livekit + livekit-bridge as always-on services, continuum-core depends_on livekit-bridge health.
VoiceChatWidget now uses AudioStreamClient (LiveKit WebRTC) for all audio capture and playback. These worklet processors were only loaded by the old raw-WebSocket code path that was replaced in the previous commit. No remaining references in the codebase.
Remove VoiceWebSocketHandler.ts (586 lines) — its entire functionality is handled by the LiveKit WebRTC transport: - Audio capture/playback → LiveKit SDK + AudioStreamClient - STT triggering → LiveKit VAD + Rust STT listener - Transcription routing → CollaborationLiveTranscriptionServerCommand - TTS synthesis → AIAudioBridge → voiceSpeakInCall IPC → LiveKit publish Keep: VoiceOrchestrator (persona routing), AIAudioBridge (TTS→LiveKit), AudioNativeBridge (voice-native AI models), VoiceSessionManager. Port 3001 no longer binds on server startup. Tests updated to reference LiveKit path instead of deleted handler.
There was a problem hiding this comment.
Pull request overview
This PR migrates the voice chat browser path off the legacy port-3001 raw WebSocket + AudioWorklet pipeline and standardizes on LiveKit/WebRTC (via AudioStreamClient), with voice/start now returning LiveKit connection details.
Changes:
- Remove the legacy browser AudioWorklet capture/playback processors.
- Rewrite
VoiceChatWidgetto join LiveKit viaAudioStreamClientand drive UI state/events from LiveKit mic levels, transcription, and active-speaker signals. - Update
voice/starttypes + server implementation to returnlivekitUrl+livekitToken(and remove the legacywsUrlfield); update docs/compose commentary to reflect LiveKit always-on.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/widgets/voice-chat/voice-playback-processor.js | Deletes legacy playback AudioWorklet processor. |
| src/widgets/voice-chat/voice-capture-processor.js | Deletes legacy capture AudioWorklet processor. |
| src/widgets/voice-chat/VoiceChatWidget.ts | Switches widget transport to LiveKit AudioStreamClient and updates state/event wiring. |
| src/commands/voice/start/shared/VoiceStartTypes.ts | Replaces wsUrl with required livekitUrl/livekitToken fields in result typing/factory. |
| src/commands/voice/start/server/VoiceStartServerCommand.ts | Generates LiveKit JWT + returns LiveKit URL instead of starting/using port-3001 voice WS server. |
| docs/planning/ALPHA-GAP-ANALYSIS.md | Documents the LiveKit migration and remaining cleanup. |
| docker-compose.yml | Updates comments to emphasize LiveKit/livekit-bridge always-on assumptions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // LiveKit URL for browser connection | ||
| const livekitUrl = getSecret('LIVEKIT_URL') || 'ws://localhost:7880'; |
There was a problem hiding this comment.
livekitUrl is sourced from getSecret('LIVEKIT_URL'), but in docker-compose the node-server default LIVEKIT_URL points at the Docker-internal hostname (ws://livekit:7880). Returning that to the browser will fail because the browser can’t resolve livekit. Align this with LiveJoinServerCommand by returning a browser-reachable URL (e.g., fall back to @shared/AudioConstants.LIVEKIT_URL / getWebSocketUrl(LIVEKIT_TLS_PORT) or introduce/use a dedicated LIVEKIT_BROWSER_URL secret/env).
| const apiKey = getSecret('LIVEKIT_API_KEY') || LIVEKIT_API_KEY; | ||
| const apiSecret = getSecret('LIVEKIT_API_SECRET') || LIVEKIT_API_SECRET; | ||
| const token = new AccessToken(apiKey, apiSecret, { |
There was a problem hiding this comment.
getSecret calls here omit the requestedBy argument, which means SecretManager audit logs will record these reads as coming from unknown. Pass a stable identifier (e.g., 'VoiceStartServerCommand') like LiveJoinServerCommand does so secret access is traceable in logs.
| this.handle = result.handle; | ||
| this.localUserId = result.roomId; // sessionId used as identity in JWT | ||
|
|
||
| // Create AudioStreamClient wired to our state |
There was a problem hiding this comment.
localUserId is being set to result.roomId, but LiveKit speaker identities are participant.identity from the JWT (here generated from params.sessionId). This will cause isSpeaking/isAISpeaking to be computed incorrectly (and can mark AI speaking when only the local user is speaking). Prefer returning the LiveKit identity in VoiceStartResult (e.g., identity/userId) or deriving it after connect (e.g., from the connected room/local participant) and using that for active-speaker comparisons.
| async start(): Promise<void> { | ||
| try { | ||
| // Resume audio context if suspended (browser autoplay policy) | ||
| if (this.audioContext?.state === 'suspended') { | ||
| await this.audioContext.resume(); | ||
| } | ||
| // Get LiveKit credentials from voice/start command | ||
| const result: VoiceStartResult = await VoiceStart.execute({ | ||
| room: this.roomId || 'general', | ||
| }); | ||
|
|
||
| // Initialize audio if needed | ||
| if (!this.audioContext) { | ||
| await this.initAudio(); | ||
| if (!result.success) { | ||
| throw new Error(result.error?.message || 'Failed to start voice session'); | ||
| } | ||
|
|
||
| // Start voice session via command to get handle | ||
| if (!this.handle) { | ||
| const result = await VoiceStart.execute({ | ||
| room: this.roomId || 'general', | ||
| }); | ||
|
|
||
| if (!result.success) { | ||
| throw new Error(result.error?.message || 'Failed to start voice session'); | ||
| } | ||
| this.handle = result.handle; | ||
| this.localUserId = result.roomId; // sessionId used as identity in JWT | ||
|
|
||
| // Create AudioStreamClient wired to our state | ||
| this.audioClient = new AudioStreamClient({ | ||
| onConnectionChange: (connected) => { | ||
| this.updateState({ isConnected: connected, error: connected ? null : 'Disconnected' }); | ||
| }, | ||
| onMicLevel: (level) => { |
There was a problem hiding this comment.
start() unconditionally creates a new server session and a new AudioStreamClient even if the widget is already listening/connected. Calling start() twice (or toggle() rapidly) can leak LiveKit connections and orphan server-side handles. Consider guarding with if (this.voiceState.isListening) return; or calling await this.stop() before starting a new session, and ensure any existing audioClient is left before overwriting it.
| onActiveSpeakersChanged: (speakerIds: string[]) => { | ||
| this.activeSpeakers = new Set(speakerIds); | ||
| const isSpeaking = this.activeSpeakers.has(this.localUserId); | ||
| const isAISpeaking = speakerIds.some(id => id !== this.localUserId); | ||
|
|
||
| this.updateState({ isSpeaking, isAISpeaking }); | ||
|
|
||
| if (isSpeaking) { | ||
| Events.emit('voice:speaking:start', { roomId: this.roomId }); | ||
| } else { | ||
| Events.emit('voice:speaking:end', { roomId: this.roomId }); | ||
| } | ||
| if (isAISpeaking) { | ||
| Events.emit('voice:ai:speaking:start', { roomId: this.roomId }); | ||
| } else { | ||
| Events.emit('voice:ai:speaking:end', { roomId: this.roomId }); | ||
| } | ||
| }, |
There was a problem hiding this comment.
onActiveSpeakersChanged emits voice:speaking:start/end and voice:ai:speaking:start/end on every active-speaker list change, even when isSpeaking/isAISpeaking didn't transition. This can produce duplicate start/end events (e.g., when another participant starts speaking while the user is already speaking). Track the previous speaking states and only emit the corresponding events when the boolean value changes.
joelteply
left a comment
There was a problem hiding this comment.
Reviewed for partnership / second pair of eyes.
LGTM from install-reliability lens. Key strengths:
- 1441 deletions vs 209 additions = real cleanup, not just shuffle
- docker-compose.yml prose embeds the architectural reasoning (texture-IDs over UDP, "do not profile-gate this") right next to the config — future readers won't repeat the profile-decouple mistake
- VoiceStartServerCommand.ts cleanly migrated to LiveKit JWT, getSecret() with sensible dev defaults at line 95-96
- ALPHA-GAP-ANALYSIS.md updated with the right architectural call (LiveKit always-on, same as DMR)
- References the generator type-safety fix from this PR (b96a652) so result fields are properly required
Two follow-up observations (not blocking — file as separate issues if useful):
-
LIVEKIT_API_KEY/SECRET defaults — dev keys (
devkey/secret) matchlivekit-server --dev's defaults so local installs work zero-config. For production deployments where Carl's instance is exposed (e.g., Tailscale grid hosting), those keys should be auto-generated at install time and written to~/.continuum/config.envso each user's grid has unique credentials. setup.sh could do this —openssl rand -hex 32for both. Could fold into my install-reliability work or your call. -
VoiceWebSocketHandler.ts still on boot — explicitly punted to a follow-up PR (per gap analysis prose). Fine as long as we track it. The cleanup now means LiveKit is the only Voice client path, but the server-side WS handler still listens on 3001 unnecessarily. Worth a follow-up issue / commit to remove.
Otherwise — ship it. The architecture story is intact.
🤖 anvil reviewed via gh pr review while working on PR #913 (install-reliability lane). Both lanes converge on main when ready.
joelteply
left a comment
There was a problem hiding this comment.
Type-safety audit (per Joel's request to check for the same pattern your VoiceStartTypes fix was about):
Found one real violation in src/widgets/voice-chat/VoiceChatWidget.ts:
// line 26
public roomId: string = ''; // required type, empty sentinel default
// line 47 — optional in constructor
constructor(options?: { roomId?: string; onStateChange?: ... })
// line 48-49 — conditional assignment
if (options?.roomId) {
this.roomId = options.roomId;
}
// line 72 — silent fallback when empty
room: this.roomId || 'general',Three layers all hiding the same gap: roomId is declared string (required) but allowed to land at '', optional in the constructor, then falls back to 'general' at use-time. Same pattern your VoiceStartTypes hand-fix was correcting + my generator change formalized — the type promises required, the runtime accepts missing, the || covers the gap.
Honest shape depends on intent:
(a) roomId is genuinely required for this widget to function → public roomId: string (no default), constructor(options: { roomId: string; ... }) (required in options), kill the || 'general' (if no roomId, throw or refuse to start).
(b) 'general' is the documented default when none is specified → public roomId = 'general' (no empty sentinel), constructor(options?: { roomId?: string; ... }) is fine (default kicks in via class init), kill the || 'general' at line 72 (already initialized).
Either is honest. Current shape has the worst of both — looks required, behaves optional, has a magic-string fallback nobody can audit by reading the type alone.
Other || short-circuits in this PR (params.sessionId || 'anonymous', getSecret('LIVEKIT_API_KEY') || LIVEKIT_API_KEY) look fine — those genuinely have defensible defaults at the right layer.
Catch this before merge if straightforward; otherwise track for a follow-up commit on this branch.
VoiceChatWidget: roomId defaults to 'general' at class init, not empty string with runtime || fallback. Eliminates three-layer indirection (empty sentinel → optional constructor → runtime check). Addressed anvil's PR review. voiceSynthesize: default 120s timeout (up from 60s) to accommodate ONNX→Metal JIT cold start on M1. Subsequent calls are <2s.
Verification Proof — M1 Pro (MacBookPro-1959)
8/9 pass. 1 skip (jtag ping blocked by #915 — ORT Metal EP deadlock on M1, not a regression from this PR). Script: |
Two bugs causing zero GPU usage on local personas:
1. CandleAdapter::initialize() eagerly loaded 2.5GB GGUF via embedded
llama.cpp on every startup — even though Candle is training-only.
This wasted RAM, caused Metal assertion crashes on M1 exit, and
the adapter was making resource decisions it has no authority to make.
Fix: initialize() just logs ready, no model load. Lazy-load on
explicit training request only.
2. AIProviderDaemon.selectAdapter() hard-coded 'local' → 'candle'
aliasing ("Candle is the ONLY local inference path"). Wrong since
DMR pivot. Fix: 'local' now routes through Rust IPC adapter which
has DMR registered at priority 0 (GPU). Candle only as last resort.
Was pointing at .continuum/jtag/data/database.sqlite which doesn't exist on any install — reseed silently failed because data:reseed → data:clear → data:backup hit `cp: source not found`, &&-chain halted, data-clear.ts never ran. Switch to sqlite3 .backup (WAL-safe — works with running system, correctly captures uncommitted writes from main.db-wal). Backups now live in ~/.continuum/backups/ (consistent with the ~/.continuum/* convention everything else uses). Live-tested on M5: 496MB main.db backed up cleanly while the system was running. Memento bug list #8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ContentTypeRegistry threw 'Unknown content type: metrics' when clicked because no metrics.json recipe existed. Mirror of diagnostics.json but pointed at metrics-detail-widget (the detail timeseries view) instead of diagnostics-widget. Memento bug list #2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…boot CodebaseIndexer ran 64-batches back-to-back with NO yield between batches. Each batch ~1.5s + ~80MB RSS growth. With 5000+ chunks in src/, that's 78+ batches × 1.5s = 2+ minutes of total event-loop saturation immediately after every boot. Local personas couldn't respond, voice couldn't connect, anything that needed the bus was blocked until indexing finished. Two changes: - Batch size 64→16 (smaller per-batch RSS hit, ~4× more chances for other IO to interleave between IPC roundtrips) - 50ms pause between batches via setTimeout (yields the event loop so chat/voice/personas can process while indexing runs) The throughput cost is small (16 vs 64 chunks per IPC) and the inter-batch pause is invisible at human timescales. The chat-arrival latency win is huge — system is responsive within seconds of boot instead of minutes. The deeper fix is querying GpuPressureWatcher / ResourcePressureWatcher before each batch and backing off when pressure is high — same principle Joel called out for InferenceCoordinator slot capacity. That's a follow-up; this is the floor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
M5 Verification — Candle eager-load fix WORKS but two follow-up bugs surfaceWhat's confirmed working ✅
What surfaces as the real-tier bottleneck
|
Docker Model Runner defaults to model's max context (262k for Qwen3.5). With concurrent persona slots, KV cache balloons to 20GB+ on a 32GB machine, causing swap thrash and making the system unusable. 4096 context is sufficient for chat (RAG budget capped at 2-4k tokens). Drops llama-server from 20.87GB to ~1-2GB. Applied after model pull in install.sh so Carl and Dev both get it. Also: RAG context budget needs separate fix (currently sends 14k tokens to model, which is the actual prompt bloat — anvil working on that).
ChatRAGBuilder computed totalBudget = floor(contextWindow * 0.75).
For Qwen3.5-4b which advertises a 262144-token window, that's 196608
tokens — a budget no chat turn would ever sensibly fill.
Two costs from leaving it that wide:
1. RAG composition still ran with that budget, producing prompts
~14k tokens that were 10× what a chat turn needs.
2. llama-server allocated full 262k KV cache PER PERSONA SLOT.
Activity Monitor on M5 (Joel): com.docker.llama-server 20.87 GB
resident, total 44 GB across 4 personas vs 32 GB physical = swap.
CHAT_INPUT_BUDGET_CEILING = 8192. Sized for chat: ~2k system prompt +
~3k recent history + ~3k RAG context. Specialized recipes (research,
codereview) that legitimately need more can opt up via their own
RAGBuilder subclass.
This fix touches the RAG budget number only. The KV cache slot size
inside DMR's llama-server is set per-model at pull time and is a
separate (and harder) lever — capping the input prompt is what we
control from this layer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This reverts commit e57bcaf.
…c 8192 Joel's correction: no ceilings. The budget should be derived from the model's own characteristics, not a hardcoded escape hatch. Previous commit set CHAT_INPUT_BUDGET_CEILING = 8192 as a workaround for the 196k → 14k → OOM chain. That's the same anti-pattern as hardcoded provider routing — a magic number in a builder instead of the authority deciding. The right authority already exists: getLatencyAwareTokenLimit(model) returns the input ceiling that fits a chat-acceptable response time given the model's measured TPS. It's already used on line 616 for the message fetch limit. Apply it here for the total budget too. Slow local model → latency-aware budget (Qwen3.5-4b on M5: ~24 TPS × 30s target = ~720 tokens — appropriately tight for the model). Fast cloud model → full 75% of context window. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three composable wins designed together: 1. Stable-first RAG ordering → llama-server prefix KV cache reuse → ~70× prompt-eval speedup (14k tokens reprocessed → ~200) 2. Multimodal content parts → delete STT/TTS sandwich for Qwen3.5 → 1 model invocation per voice turn instead of 3 3. Voice LoRA per persona → identity, not signal — the "Maya replied" experience that differentiates from Claude Code / OpenClaw / Aider Acceptance: 6-persona LiveKit room on M5, voice turn round-trip <3s, total resident memory <8 GB, audio output recognizably persona-specific. Companion to issue #917 (ModelMetadata refactor) — Phases 4-5 below depend on capability-declaration flowing through. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Joel's clarification: STT/TTS doesn't disappear. It becomes the universal substrate that gives ANY model class — niche 1B specialists, older Llama 3.1 text-only, cloud providers without audio — the same first-class persona experience. Local multimodal-native is the fast path; the bridge is what lets us mix model classes freely so users never know which class is actually serving their teammate. Updated decision matrix to cover all four model classes (local multimodal, cloud multimodal, cloud text-only, local text-only) and how voice identity stays a first-class property regardless. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
VOICE_WS_PORT/port-3001 eliminated from browser. Dead AudioWorklet processor files removed (348 lines).wsUrlfield removed. Anvil fixed the generator to defaultrequired: true(b96a652) — 452 generated files will tighten on re-gen.Remaining work on this PR
The old port-3001 WebSocket voice server still runs in parallel with LiveKit, doing the same work. This PR retires it.
What the exploration found
Current transport architecture (already good):
The duplication (old path duplicating LiveKit):
getRustVoiceOrchestrator().onUtterance()CollaborationLiveTranscriptionServerCommandVoiceSynthesize.execute()+ manual 20ms chunking to WSvoiceSpeakInCall()IPC → Rust TTS → LiveKit publishvoice:audio:leveleventroom.localParticipant.audioLevelat 30fpsStep-by-step plan
Step 1: Remove startVoiceServer() from boot
JTAGSystemServer.ts:223-230— delete thestartVoiceServer()callStep 2: Remove VoiceWebSocketHandler.ts
system/voice/server/VoiceWebSocketHandler.ts→ deletesystem/voice/server/index.ts→ remove re-exportsStep 3: Verify orchestration still works through LiveKit path
VoiceOrchestrator.ts— KEEP. Used by LiveKit pathAIAudioBridge.ts— KEEP. TTS→LiveKit publishAudioNativeBridge.ts— KEEP. Voice-native AI modelsVoiceSessionManager.ts— KEEP. Used by VoiceStartServerCommandStep 4: Update tests
voice-websocket-transcription-handler.test.ts— delete (tests deleted handler)Step 5: Verify Docker LiveKit reliability
docker compose upboots livekit + livekit-bridge + continuum-coreArchitecture (non-negotiable)
Test plan
npx tsc --noEmit— zero errorsdocker compose config --quiet— valid🤖 Generated with Claude Code