Stop throwing computation away. Most of the prompt is invariant per request, and Qwen3.5 takes audio and images directly — eliminating STT/TTS entirely. Together these collapse end-to-end voice latency from minutes to ~2-3 seconds per turn while running 14 personas in parallel on a single M-series Mac.
Full design: docs/architecture/MULTIMODAL-WORKER-AND-PREFIX-REUSE.md
The thesis
We are not GPU-bound. We are waste-bound. Three composable wins:
- Stable-first RAG ordering → llama-server prefix KV cache reuse → ~70× prompt-eval speedup (today: 14k tokens reprocessed per turn → target: ~200, the volatile suffix only)
- Multimodal content parts → delete STT/TTS sandwich for Qwen3.5 → 1 model invocation per voice turn instead of 3 (no Whisper, no Kokoro, no ORT-deadlock chain)
- Voice LoRA per persona → identity, not signal → the "Maya replied" experience that differentiates from Claude Code / OpenClaw / Aider
Phases (one PR, phased commits)
Each phase is independently mergeable but compounds with the prior. Suggested branch: feature/prefix-reuse-and-multimodal.
Phase 1 — Stable-first RAG ordering (TS only, no dependencies)
RAGComposer.assemble returns sections explicitly tagged INVARIANT / SEMI_STABLE / VOLATILE
- Final concatenation always orders the three regions identically; sorts deterministically within each
- Every RAG source declares its tier
ChatRAGBuilder emits stable-byte-prefix prompts
- Acceptance: SHA-256 of
prompt[:invariant_len] is identical across consecutive turns of the same persona
Phase 2 — Per-persona DMR slot pinning (TS + small Rust)
AIProviderRustClient.generateText accepts slot_hint: u32 derived from persona_id
- DMR adapter passes
slot_id in the OpenAI request
- Acceptance: persona Maya's requests consistently land on the same llama-server slot across turns; DMR logs show prompt processing for ≤200 tokens after first turn warm-up
Phase 3 — RAGComposition cache (TS only, depends on Phase 1)
- Memoize
RAGComposer.assemble keyed by (persona_id, room_id, recipe_id, history_tail_msg_ids)
- TTL 5 min, invalidated by event subscriptions on the keyed inputs
- Acceptance: cache hit rate >80% on consecutive turns of the same conversation
Phase 4 — Multimodal content parts (depends on #917 ModelMetadata)
LLMAdapter request adds audio_chunks: AudioInput[] and image_inputs: ImageInput[]
- DMR adapter forwards as OpenAI multimodal content parts
MediaArtifactSource checks ModelMetadata.capabilities.supports_audio / supports_vision: if true → attach raw, else → STT/vision-description bridge (fallback path)
voice/start pipeline rewires to send audio chunks directly, no Whisper invocation for Qwen3.5 personas
- Acceptance: voice turn for a Qwen3.5 persona logs zero Whisper invocations and zero Kokoro invocations
Phase 5 — Voice LoRA layer (depends on Phase 4 + existing genome paging)
- Persona entity gains
voiceAdapterId: AdapterId
- Genome registry treats voice LoRAs as an adapter category alongside skill LoRAs
- LoRA pages in before the voice turn's first audio chunk
- Acceptance: persona's audio output is recognizably distinct from another persona's, and voice survives across sessions
Phase 6 — Voice LoRA marketplace (follow-up PR, not blocking)
- HuggingFace publishing with
continuum:voice-lora tag
- Browse / preview / pull commands in CLI
- Attribution + license preserved
Acceptance for the whole PR
A persona named Maya, voice LoRA loaded, on M5, in a LiveKit room with 6 personas active, processing a voice turn:
Why this is the PR
- Everything else we shipped today (Candle eager-load fix, RAG budget cap, embedding throttle) was triage. This is the architecture that turns triage into a system.
- The MacBook Air case (8 GB RAM, no GPU toggles) becomes plausible because we stop multiplying KV-cache by full-model-context per slot.
- The differentiation from Claude Code / OpenClaw / Aider — voice + face + memory + parallel personas — becomes real, not just claimed.
Dependencies
Full design:
docs/architecture/MULTIMODAL-WORKER-AND-PREFIX-REUSE.mdThe thesis
We are not GPU-bound. We are waste-bound. Three composable wins:
Phases (one PR, phased commits)
Each phase is independently mergeable but compounds with the prior. Suggested branch:
feature/prefix-reuse-and-multimodal.Phase 1 — Stable-first RAG ordering (TS only, no dependencies)
RAGComposer.assemblereturns sections explicitly taggedINVARIANT/SEMI_STABLE/VOLATILEChatRAGBuilderemits stable-byte-prefix promptsprompt[:invariant_len]is identical across consecutive turns of the same personaPhase 2 — Per-persona DMR slot pinning (TS + small Rust)
AIProviderRustClient.generateTextacceptsslot_hint: u32derived frompersona_idslot_idin the OpenAI requestPhase 3 — RAGComposition cache (TS only, depends on Phase 1)
RAGComposer.assemblekeyed by(persona_id, room_id, recipe_id, history_tail_msg_ids)Phase 4 — Multimodal content parts (depends on #917 ModelMetadata)
LLMAdapterrequest addsaudio_chunks: AudioInput[]andimage_inputs: ImageInput[]MediaArtifactSourcechecksModelMetadata.capabilities.supports_audio/supports_vision: if true → attach raw, else → STT/vision-description bridge (fallback path)voice/startpipeline rewires to send audio chunks directly, no Whisper invocation for Qwen3.5 personasPhase 5 — Voice LoRA layer (depends on Phase 4 + existing genome paging)
voiceAdapterId: AdapterIdPhase 6 — Voice LoRA marketplace (follow-up PR, not blocking)
continuum:voice-loratagAcceptance for the whole PR
A persona named Maya, voice LoRA loaded, on M5, in a LiveKit room with 6 personas active, processing a voice turn:
prompt processing progressfor ≤200 tokens, not 14kgpu/statsshows resident memory <8 GB across 6 active personas (vs 20+ GB / swap state today)Why this is the PR
Dependencies
ModelMetadatarefactor) — must land before Phase 4. Phases 1-3 are independent.