Skip to content

voice: retire legacy WS transport, unify on LiveKit WebRTC#914

Open
joelteply wants to merge 22 commits intomainfrom
fix/voice-livekit-migration
Open

voice: retire legacy WS transport, unify on LiveKit WebRTC#914
joelteply wants to merge 22 commits intomainfrom
fix/voice-livekit-migration

Conversation

@joelteply
Copy link
Copy Markdown
Contributor

Summary

  • VoiceChatWidget browser migration: 427→178 lines. Raw WebSocket + AudioWorklet replaced with AudioStreamClient (LiveKit WebRTC). VOICE_WS_PORT/port-3001 eliminated from browser. Dead AudioWorklet processor files removed (348 lines).
  • Type safety: Required result fields enforced in factory params. Dead wsUrl field removed. Anvil fixed the generator to default required: true (b96a652) — 452 generated files will tighten on re-gen.
  • LiveKit stays always-on: Reverted a mistaken profile-gate attempt. LiveKit is THE efficient UDP/WebRTC transport backbone — 14 personas + 4 LLMs + TTS/STT + Bevy avatars worked simultaneously on M1 because of it. Texture-ID + mouth-animation params over the wire, NOT rasterized video.

Remaining work on this PR

The old port-3001 WebSocket voice server still runs in parallel with LiveKit, doing the same work. This PR retires it.

What the exploration found

Current transport architecture (already good):

  • Core↔Bridge IPC: binary frames over Unix socket. Audio = raw i16 PCM, video = raw RGBA. Zero base64.
  • Bridge↔LiveKit: WebRTC/UDP. Opus codec for audio, hardware-accelerated video.
  • Tile resolution: browser-driven CSS px → 6 tiers (Tiny 160×120 → FullHD 1920×1080), dynamic fps.
  • Grid: TCP for reliable commands, UDP (port 7118) for fire-and-forget events.
  • Avatar pipeline: texture-ID approach — Bevy renders locally from params, LiveKit carries voice + metadata. NOT rasterized streams.

The duplication (old path duplicating LiveKit):

Function Old (VoiceWebSocketHandler.ts, port 3001) New (LiveKit via bridge)
Audio capture Manual binary WS frames, 500ms buffering LiveKit SDK + getUserMedia, Opus, WebRTC
STT trigger 500ms accumulation threshold LiveKit VAD + Rust STT listener
Transcription routing getRustVoiceOrchestrator().onUtterance() Same call, via CollaborationLiveTranscriptionServerCommand
TTS synthesis VoiceSynthesize.execute() + manual 20ms chunking to WS voiceSpeakInCall() IPC → Rust TTS → LiveKit publish
Audio playback Manual binary WS frames to browser LiveKit remote track → HTMLAudioElement
Mic levels voice:audio:level event room.localParticipant.audioLevel at 30fps

Step-by-step plan

Step 1: Remove startVoiceServer() from boot

  • JTAGSystemServer.ts:223-230 — delete the startVoiceServer() call
  • Port 3001 stops binding. Zero impact on LiveKit path.

Step 2: Remove VoiceWebSocketHandler.ts

  • 590 lines of raw WebSocket audio handling that LiveKit replaces
  • system/voice/server/VoiceWebSocketHandler.ts → delete
  • system/voice/server/index.ts → remove re-exports

Step 3: Verify orchestration still works through LiveKit path

  • VoiceOrchestrator.ts — KEEP. Used by LiveKit path
  • AIAudioBridge.ts — KEEP. TTS→LiveKit publish
  • AudioNativeBridge.ts — KEEP. Voice-native AI models
  • VoiceSessionManager.ts — KEEP. Used by VoiceStartServerCommand

Step 4: Update tests

  • voice-websocket-transcription-handler.test.ts — delete (tests deleted handler)
  • Verify remaining voice tests use LiveKit path

Step 5: Verify Docker LiveKit reliability

  • docker compose up boots livekit + livekit-bridge + continuum-core
  • Verify on Mac + BigMama
  • Verify multi-persona live call

Architecture (non-negotiable)

  • No rasterization — Bevy renders GPU-accelerated, texture-ID approach. LiveKit carries metadata, not pixels.
  • Pointers not copies — binary IPC for audio/video. Base64 banned for real-time data.
  • UDP fire-and-forget — WebRTC handles this natively.
  • Dynamic resolution — 6 tiers, VGA dropback under pressure.
  • LiveKit always-on — the transport backbone, not a feature flag.

Test plan

  • npx tsc --noEmit — zero errors
  • docker compose config --quiet — valid
  • Step 1-2: Remove old WS server, verify boot without port 3001
  • Step 3: Verify persona response (orchestration intact)
  • Step 4: Run/update voice tests
  • Step 5: Docker LiveKit on Mac + BigMama

🤖 Generated with Claude Code

VoiceStartServerCommand now returns LiveKit URL + JWT token instead of
spinning up a legacy WebSocket server on port 3001. Same LiveKit token
generation pattern as collaboration/live/join.

Port 3001 is no longer needed — Docker compose never exposed it, so it
was already dead in containerized deployments.

VoiceStartResult type adds livekitUrl + livekitToken fields (wsUrl kept
for backwards compat, set to same as livekitUrl).

VoiceChatWidget browser-side migration (raw WS → AudioStreamClient
LiveKit transport) is the next step — this commit unblocks it by
providing the correct server-side response shape.

Existing TTS→STT roundtrip tests (livekit-audio-roundtrip.test.ts,
sensory_pipeline_test.rs with gunfire/noise injection) validate the
audio pipeline independently. Once the widget is wired to LiveKit,
those tests cover the full voice path.
- VoiceChatWidget: replace raw WebSocket + AudioWorklet with
  AudioStreamClient (LiveKit WebRTC). 427→178 lines. VOICE_WS_PORT/3001
  eliminated from browser side.

- VoiceStartTypes: required result fields (handle, livekitUrl,
  livekitToken, roomId) now required in factory params — no more
  optional + empty-string defaults hiding compile-time errors.
  Remove dead wsUrl field (legacy port-3001, zero consumers).

- docker-compose: livekit + livekit-bridge moved to profiles: [live].
  Text chat works without WebRTC infrastructure. Carl saves ~300MB RAM
  and doesn't wait for LiveKit to boot.

- continuum-core depends_on: removed hard dep on livekit-bridge
  (profile-gated services can't be in depends_on). Core discovers
  bridge via socket at startup when live profile is active.
LiveKit provides the UDP/WebRTC transport that made 14 personas + 4 LLMs
+ TTS/STT + Bevy avatars work simultaneously on M1. Profiling it out
would degrade the product to text-only. Same principle as Docker Model
Runner — efficient transport is a core requirement.

Restores: livekit + livekit-bridge as always-on services,
continuum-core depends_on livekit-bridge health.
VoiceChatWidget now uses AudioStreamClient (LiveKit WebRTC) for all
audio capture and playback. These worklet processors were only loaded
by the old raw-WebSocket code path that was replaced in the previous
commit. No remaining references in the codebase.
Copilot AI review requested due to automatic review settings April 17, 2026 19:31
Remove VoiceWebSocketHandler.ts (586 lines) — its entire functionality
is handled by the LiveKit WebRTC transport:
- Audio capture/playback → LiveKit SDK + AudioStreamClient
- STT triggering → LiveKit VAD + Rust STT listener
- Transcription routing → CollaborationLiveTranscriptionServerCommand
- TTS synthesis → AIAudioBridge → voiceSpeakInCall IPC → LiveKit publish

Keep: VoiceOrchestrator (persona routing), AIAudioBridge (TTS→LiveKit),
AudioNativeBridge (voice-native AI models), VoiceSessionManager.

Port 3001 no longer binds on server startup.
Tests updated to reference LiveKit path instead of deleted handler.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates the voice chat browser path off the legacy port-3001 raw WebSocket + AudioWorklet pipeline and standardizes on LiveKit/WebRTC (via AudioStreamClient), with voice/start now returning LiveKit connection details.

Changes:

  • Remove the legacy browser AudioWorklet capture/playback processors.
  • Rewrite VoiceChatWidget to join LiveKit via AudioStreamClient and drive UI state/events from LiveKit mic levels, transcription, and active-speaker signals.
  • Update voice/start types + server implementation to return livekitUrl + livekitToken (and remove the legacy wsUrl field); update docs/compose commentary to reflect LiveKit always-on.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/widgets/voice-chat/voice-playback-processor.js Deletes legacy playback AudioWorklet processor.
src/widgets/voice-chat/voice-capture-processor.js Deletes legacy capture AudioWorklet processor.
src/widgets/voice-chat/VoiceChatWidget.ts Switches widget transport to LiveKit AudioStreamClient and updates state/event wiring.
src/commands/voice/start/shared/VoiceStartTypes.ts Replaces wsUrl with required livekitUrl/livekitToken fields in result typing/factory.
src/commands/voice/start/server/VoiceStartServerCommand.ts Generates LiveKit JWT + returns LiveKit URL instead of starting/using port-3001 voice WS server.
docs/planning/ALPHA-GAP-ANALYSIS.md Documents the LiveKit migration and remaining cleanup.
docker-compose.yml Updates comments to emphasize LiveKit/livekit-bridge always-on assumptions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +65 to +66
// LiveKit URL for browser connection
const livekitUrl = getSecret('LIVEKIT_URL') || 'ws://localhost:7880';
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

livekitUrl is sourced from getSecret('LIVEKIT_URL'), but in docker-compose the node-server default LIVEKIT_URL points at the Docker-internal hostname (ws://livekit:7880). Returning that to the browser will fail because the browser can’t resolve livekit. Align this with LiveJoinServerCommand by returning a browser-reachable URL (e.g., fall back to @shared/AudioConstants.LIVEKIT_URL / getWebSocketUrl(LIVEKIT_TLS_PORT) or introduce/use a dedicated LIVEKIT_BROWSER_URL secret/env).

Copilot uses AI. Check for mistakes.
Comment on lines +91 to +93
const apiKey = getSecret('LIVEKIT_API_KEY') || LIVEKIT_API_KEY;
const apiSecret = getSecret('LIVEKIT_API_SECRET') || LIVEKIT_API_SECRET;
const token = new AccessToken(apiKey, apiSecret, {
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getSecret calls here omit the requestedBy argument, which means SecretManager audit logs will record these reads as coming from unknown. Pass a stable identifier (e.g., 'VoiceStartServerCommand') like LiveJoinServerCommand does so secret access is traceable in logs.

Copilot uses AI. Check for mistakes.
Comment on lines +79 to +82
this.handle = result.handle;
this.localUserId = result.roomId; // sessionId used as identity in JWT

// Create AudioStreamClient wired to our state
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

localUserId is being set to result.roomId, but LiveKit speaker identities are participant.identity from the JWT (here generated from params.sessionId). This will cause isSpeaking/isAISpeaking to be computed incorrectly (and can mark AI speaking when only the local user is speaking). Prefer returning the LiveKit identity in VoiceStartResult (e.g., identity/userId) or deriving it after connect (e.g., from the connected room/local participant) and using that for active-speaker comparisons.

Copilot uses AI. Check for mistakes.
Comment on lines 68 to +87
async start(): Promise<void> {
try {
// Resume audio context if suspended (browser autoplay policy)
if (this.audioContext?.state === 'suspended') {
await this.audioContext.resume();
}
// Get LiveKit credentials from voice/start command
const result: VoiceStartResult = await VoiceStart.execute({
room: this.roomId || 'general',
});

// Initialize audio if needed
if (!this.audioContext) {
await this.initAudio();
if (!result.success) {
throw new Error(result.error?.message || 'Failed to start voice session');
}

// Start voice session via command to get handle
if (!this.handle) {
const result = await VoiceStart.execute({
room: this.roomId || 'general',
});

if (!result.success) {
throw new Error(result.error?.message || 'Failed to start voice session');
}
this.handle = result.handle;
this.localUserId = result.roomId; // sessionId used as identity in JWT

// Create AudioStreamClient wired to our state
this.audioClient = new AudioStreamClient({
onConnectionChange: (connected) => {
this.updateState({ isConnected: connected, error: connected ? null : 'Disconnected' });
},
onMicLevel: (level) => {
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start() unconditionally creates a new server session and a new AudioStreamClient even if the widget is already listening/connected. Calling start() twice (or toggle() rapidly) can leak LiveKit connections and orphan server-side handles. Consider guarding with if (this.voiceState.isListening) return; or calling await this.stop() before starting a new session, and ensure any existing audioClient is left before overwriting it.

Copilot uses AI. Check for mistakes.
Comment on lines +99 to +116
onActiveSpeakersChanged: (speakerIds: string[]) => {
this.activeSpeakers = new Set(speakerIds);
const isSpeaking = this.activeSpeakers.has(this.localUserId);
const isAISpeaking = speakerIds.some(id => id !== this.localUserId);

this.updateState({ isSpeaking, isAISpeaking });

if (isSpeaking) {
Events.emit('voice:speaking:start', { roomId: this.roomId });
} else {
Events.emit('voice:speaking:end', { roomId: this.roomId });
}
if (isAISpeaking) {
Events.emit('voice:ai:speaking:start', { roomId: this.roomId });
} else {
Events.emit('voice:ai:speaking:end', { roomId: this.roomId });
}
},
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onActiveSpeakersChanged emits voice:speaking:start/end and voice:ai:speaking:start/end on every active-speaker list change, even when isSpeaking/isAISpeaking didn't transition. This can produce duplicate start/end events (e.g., when another participant starts speaking while the user is already speaking). Track the previous speaking states and only emit the corresponding events when the boolean value changes.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

@joelteply joelteply left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed for partnership / second pair of eyes.

LGTM from install-reliability lens. Key strengths:

  • 1441 deletions vs 209 additions = real cleanup, not just shuffle
  • docker-compose.yml prose embeds the architectural reasoning (texture-IDs over UDP, "do not profile-gate this") right next to the config — future readers won't repeat the profile-decouple mistake
  • VoiceStartServerCommand.ts cleanly migrated to LiveKit JWT, getSecret() with sensible dev defaults at line 95-96
  • ALPHA-GAP-ANALYSIS.md updated with the right architectural call (LiveKit always-on, same as DMR)
  • References the generator type-safety fix from this PR (b96a652) so result fields are properly required

Two follow-up observations (not blocking — file as separate issues if useful):

  1. LIVEKIT_API_KEY/SECRET defaults — dev keys (devkey / secret) match livekit-server --dev's defaults so local installs work zero-config. For production deployments where Carl's instance is exposed (e.g., Tailscale grid hosting), those keys should be auto-generated at install time and written to ~/.continuum/config.env so each user's grid has unique credentials. setup.sh could do this — openssl rand -hex 32 for both. Could fold into my install-reliability work or your call.

  2. VoiceWebSocketHandler.ts still on boot — explicitly punted to a follow-up PR (per gap analysis prose). Fine as long as we track it. The cleanup now means LiveKit is the only Voice client path, but the server-side WS handler still listens on 3001 unnecessarily. Worth a follow-up issue / commit to remove.

Otherwise — ship it. The architecture story is intact.

🤖 anvil reviewed via gh pr review while working on PR #913 (install-reliability lane). Both lanes converge on main when ready.

Copy link
Copy Markdown
Contributor Author

@joelteply joelteply left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type-safety audit (per Joel's request to check for the same pattern your VoiceStartTypes fix was about):

Found one real violation in src/widgets/voice-chat/VoiceChatWidget.ts:

// line 26
public roomId: string = '';                                    // required type, empty sentinel default

// line 47 — optional in constructor
constructor(options?: { roomId?: string; onStateChange?: ... })

// line 48-49 — conditional assignment
if (options?.roomId) {
  this.roomId = options.roomId;
}

// line 72 — silent fallback when empty
room: this.roomId || 'general',

Three layers all hiding the same gap: roomId is declared string (required) but allowed to land at '', optional in the constructor, then falls back to 'general' at use-time. Same pattern your VoiceStartTypes hand-fix was correcting + my generator change formalized — the type promises required, the runtime accepts missing, the || covers the gap.

Honest shape depends on intent:

(a) roomId is genuinely required for this widget to function → public roomId: string (no default), constructor(options: { roomId: string; ... }) (required in options), kill the || 'general' (if no roomId, throw or refuse to start).

(b) 'general' is the documented default when none is specified → public roomId = 'general' (no empty sentinel), constructor(options?: { roomId?: string; ... }) is fine (default kicks in via class init), kill the || 'general' at line 72 (already initialized).

Either is honest. Current shape has the worst of both — looks required, behaves optional, has a magic-string fallback nobody can audit by reading the type alone.

Other || short-circuits in this PR (params.sessionId || 'anonymous', getSecret('LIVEKIT_API_KEY') || LIVEKIT_API_KEY) look fine — those genuinely have defensible defaults at the right layer.

Catch this before merge if straightforward; otherwise track for a follow-up commit on this branch.

VoiceChatWidget: roomId defaults to 'general' at class init, not empty
string with runtime || fallback. Eliminates three-layer indirection
(empty sentinel → optional constructor → runtime check). Addressed
anvil's PR review.

voiceSynthesize: default 120s timeout (up from 60s) to accommodate
ONNX→Metal JIT cold start on M1. Subsequent calls are <2s.
@joelteply
Copy link
Copy Markdown
Contributor Author

Verification Proof — M1 Pro (MacBookPro-1959)

Branch: fix/voice-livekit-migration
SHA: cc0bb3f21
Date: 2026-04-17T20:04:13Z
Machine: MacBookPro-1959.lan (Darwin 25.0.0, arm64)
Check Result Detail
TypeScript compilation PASS Zero errors
Port 3001 not bound PASS Old voice WS server removed
VoiceWebSocketHandler.ts deleted PASS File removed
voice-start.json spec PASS Has livekitUrl + livekitToken
VoiceStartTypes type safety PASS Required fields enforced in factory
docker-compose.yml valid PASS Validates cleanly
LiveKit not profile-gated PASS Always-on in compose
jtag ping SKIP System not booted (ORT Metal deadlock #915 blocks TTS warmup on M1)
AudioWorklet processors deleted PASS Dead files removed

8/9 pass. 1 skip (jtag ping blocked by #915 — ORT Metal EP deadlock on M1, not a regression from this PR).

Script: scripts/verify-pr-914.sh — run it to reproduce.

joelteply and others added 5 commits April 17, 2026 15:13
Two bugs causing zero GPU usage on local personas:

1. CandleAdapter::initialize() eagerly loaded 2.5GB GGUF via embedded
   llama.cpp on every startup — even though Candle is training-only.
   This wasted RAM, caused Metal assertion crashes on M1 exit, and
   the adapter was making resource decisions it has no authority to make.
   Fix: initialize() just logs ready, no model load. Lazy-load on
   explicit training request only.

2. AIProviderDaemon.selectAdapter() hard-coded 'local' → 'candle'
   aliasing ("Candle is the ONLY local inference path"). Wrong since
   DMR pivot. Fix: 'local' now routes through Rust IPC adapter which
   has DMR registered at priority 0 (GPU). Candle only as last resort.
Was pointing at .continuum/jtag/data/database.sqlite which doesn't
exist on any install — reseed silently failed because
data:reseed → data:clear → data:backup hit `cp: source not found`,
&&-chain halted, data-clear.ts never ran.

Switch to sqlite3 .backup (WAL-safe — works with running system,
correctly captures uncommitted writes from main.db-wal).

Backups now live in ~/.continuum/backups/ (consistent with the
~/.continuum/* convention everything else uses).

Live-tested on M5: 496MB main.db backed up cleanly while the
system was running.

Memento bug list #8.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ContentTypeRegistry threw 'Unknown content type: metrics' when
clicked because no metrics.json recipe existed. Mirror of
diagnostics.json but pointed at metrics-detail-widget (the
detail timeseries view) instead of diagnostics-widget.

Memento bug list #2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…boot

CodebaseIndexer ran 64-batches back-to-back with NO yield between
batches. Each batch ~1.5s + ~80MB RSS growth. With 5000+ chunks in
src/, that's 78+ batches × 1.5s = 2+ minutes of total event-loop
saturation immediately after every boot. Local personas couldn't
respond, voice couldn't connect, anything that needed the bus was
blocked until indexing finished.

Two changes:
- Batch size 64→16 (smaller per-batch RSS hit, ~4× more chances
  for other IO to interleave between IPC roundtrips)
- 50ms pause between batches via setTimeout (yields the event loop
  so chat/voice/personas can process while indexing runs)

The throughput cost is small (16 vs 64 chunks per IPC) and the
inter-batch pause is invisible at human timescales. The chat-arrival
latency win is huge — system is responsive within seconds of boot
instead of minutes.

The deeper fix is querying GpuPressureWatcher / ResourcePressureWatcher
before each batch and backing off when pressure is high — same
principle Joel called out for InferenceCoordinator slot capacity.
That's a follow-up; this is the floor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@joelteply
Copy link
Copy Markdown
Contributor Author

M5 Verification — Candle eager-load fix WORKS but two follow-up bugs surface

What's confirmed working ✅

  • Test 1 (direct): ./jtag ai/generate --provider=localprovider: "docker-model-runner" confirmed. Routing correctly lands on DMR. Persona seed config (provider: "local") → registry select() → DMR adapter, no more silent Candle bypass.
  • Test 2 (chat): Helper AI, Teacher AI, Local Assistant, CodeReview AI all responded to chat msgs after the embedding-storm settled. Routing+chat pipeline functional end-to-end.
  • DMR backend: docker model status --json confirms llama.cpp latest-metal + vllm-metal both Running. Metal IS the backend.
  • Embedding storm fix (commit c1c6d62d4, also pushed to this branch): batch 64→16 + 50ms inter-batch yield. CodebaseIndexer no longer monopolizes event loop for 2+min after boot.
  • data:backup fix (commit 8164d6ca7): pointed at real ~/.continuum/database/main.db path with sqlite3 .backup (WAL-safe). Live-tested 496MB backup on M5.
  • metrics.json recipe (commit 8901f2618): fixes Metrics tab crashes: Unknown content type 'metrics' — missing recipe #916 metrics tab crash.

What surfaces as the real-tier bottleneck ⚠️

1. llama.cpp Metal DeltaNet kernels are unoptimised for Qwen3.5.
DMR logs on M5 show prompt eval at 379 tok/s (GPU territory), but predicted (output) tok/s for Qwen3.5-4b-code-forged drops to ~4 tok/s on a 296-token completion. Qwen3.5 uses Gated DeltaNet (recurrence, not pure transformer). ssm_conv / ssm_scan / gated_delta_net Metal shaders in ggml/src/ggml-metal/ have a documented ~14× regression. This is an upstream llama.cpp gap, not a routing bug. Options: (a) patch our vendor copy of llama.cpp's Metal shaders, (b) install MLX backend (docker model status shows mlx: Not Installed), (c) default first-chat experience to Qwen2.5 (pure transformer, ~33 tok/s on M5) until DeltaNet kernels land.

2. Personas hit a 14,443-token prompt window per chat call (per docker model logs n_tokens = 14443). RAG isn't budgeting context. Even on optimised kernels this would hurt latency — current behaviour starts users with multi-second prompt eval before any output token streams. Plus it caused the visible echo-chamber: 14 chat replies from 4 personas all converging on identical "Webview authentication" hallucination because they're all loading the same bloated context. Separate fix: enforce RAG budget caps in PersonaResponseGenerator chain.

Recommendation

Merge this PR with the routing + Candle-eager-load fix as the user-facing first-chat unblock. File the Metal DeltaNet shader work and RAG budget bug as follow-up issues — both are real but neither belongs in a voice-LiveKit-migration PR.

joelteply and others added 5 commits April 17, 2026 16:20
Docker Model Runner defaults to model's max context (262k for Qwen3.5).
With concurrent persona slots, KV cache balloons to 20GB+ on a 32GB
machine, causing swap thrash and making the system unusable.

4096 context is sufficient for chat (RAG budget capped at 2-4k tokens).
Drops llama-server from 20.87GB to ~1-2GB. Applied after model pull
in install.sh so Carl and Dev both get it.

Also: RAG context budget needs separate fix (currently sends 14k tokens
to model, which is the actual prompt bloat — anvil working on that).
ChatRAGBuilder computed totalBudget = floor(contextWindow * 0.75).
For Qwen3.5-4b which advertises a 262144-token window, that's 196608
tokens — a budget no chat turn would ever sensibly fill.

Two costs from leaving it that wide:
  1. RAG composition still ran with that budget, producing prompts
     ~14k tokens that were 10× what a chat turn needs.
  2. llama-server allocated full 262k KV cache PER PERSONA SLOT.
     Activity Monitor on M5 (Joel): com.docker.llama-server 20.87 GB
     resident, total 44 GB across 4 personas vs 32 GB physical = swap.

CHAT_INPUT_BUDGET_CEILING = 8192. Sized for chat: ~2k system prompt +
~3k recent history + ~3k RAG context. Specialized recipes (research,
codereview) that legitimately need more can opt up via their own
RAGBuilder subclass.

This fix touches the RAG budget number only. The KV cache slot size
inside DMR's llama-server is set per-model at pull time and is a
separate (and harder) lever — capping the input prompt is what we
control from this layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…c 8192

Joel's correction: no ceilings. The budget should be derived from
the model's own characteristics, not a hardcoded escape hatch.

Previous commit set CHAT_INPUT_BUDGET_CEILING = 8192 as a workaround
for the 196k → 14k → OOM chain. That's the same anti-pattern as
hardcoded provider routing — a magic number in a builder instead of
the authority deciding.

The right authority already exists: getLatencyAwareTokenLimit(model)
returns the input ceiling that fits a chat-acceptable response time
given the model's measured TPS. It's already used on line 616 for
the message fetch limit. Apply it here for the total budget too.

Slow local model → latency-aware budget (Qwen3.5-4b on M5: ~24 TPS
× 30s target = ~720 tokens — appropriately tight for the model).
Fast cloud model → full 75% of context window.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three composable wins designed together:
1. Stable-first RAG ordering → llama-server prefix KV cache reuse
   → ~70× prompt-eval speedup (14k tokens reprocessed → ~200)
2. Multimodal content parts → delete STT/TTS sandwich for Qwen3.5
   → 1 model invocation per voice turn instead of 3
3. Voice LoRA per persona → identity, not signal — the "Maya replied"
   experience that differentiates from Claude Code / OpenClaw / Aider

Acceptance: 6-persona LiveKit room on M5, voice turn round-trip <3s,
total resident memory <8 GB, audio output recognizably persona-specific.

Companion to issue #917 (ModelMetadata refactor) — Phases 4-5 below
depend on capability-declaration flowing through.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Joel's clarification: STT/TTS doesn't disappear. It becomes the
universal substrate that gives ANY model class — niche 1B
specialists, older Llama 3.1 text-only, cloud providers without
audio — the same first-class persona experience. Local
multimodal-native is the fast path; the bridge is what lets us
mix model classes freely so users never know which class is
actually serving their teammate.

Updated decision matrix to cover all four model classes (local
multimodal, cloud multimodal, cloud text-only, local text-only)
and how voice identity stays a first-class property regardless.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants