Skip to content

Multimodal-native worker + prefix-reuse — collapse voice turn from 15s to 3s on a single laptop #918

@joelteply

Description

@joelteply

Stop throwing computation away. Most of the prompt is invariant per request, and Qwen3.5 takes audio and images directly — eliminating STT/TTS entirely. Together these collapse end-to-end voice latency from minutes to ~2-3 seconds per turn while running 14 personas in parallel on a single M-series Mac.

Full design: docs/architecture/MULTIMODAL-WORKER-AND-PREFIX-REUSE.md

The thesis

We are not GPU-bound. We are waste-bound. Three composable wins:

  1. Stable-first RAG ordering → llama-server prefix KV cache reuse → ~70× prompt-eval speedup (today: 14k tokens reprocessed per turn → target: ~200, the volatile suffix only)
  2. Multimodal content parts → delete STT/TTS sandwich for Qwen3.5 → 1 model invocation per voice turn instead of 3 (no Whisper, no Kokoro, no ORT-deadlock chain)
  3. Voice LoRA per persona → identity, not signal → the "Maya replied" experience that differentiates from Claude Code / OpenClaw / Aider

Phases (one PR, phased commits)

Each phase is independently mergeable but compounds with the prior. Suggested branch: feature/prefix-reuse-and-multimodal.

Phase 1 — Stable-first RAG ordering (TS only, no dependencies)

  • RAGComposer.assemble returns sections explicitly tagged INVARIANT / SEMI_STABLE / VOLATILE
  • Final concatenation always orders the three regions identically; sorts deterministically within each
  • Every RAG source declares its tier
  • ChatRAGBuilder emits stable-byte-prefix prompts
  • Acceptance: SHA-256 of prompt[:invariant_len] is identical across consecutive turns of the same persona

Phase 2 — Per-persona DMR slot pinning (TS + small Rust)

  • AIProviderRustClient.generateText accepts slot_hint: u32 derived from persona_id
  • DMR adapter passes slot_id in the OpenAI request
  • Acceptance: persona Maya's requests consistently land on the same llama-server slot across turns; DMR logs show prompt processing for ≤200 tokens after first turn warm-up

Phase 3 — RAGComposition cache (TS only, depends on Phase 1)

  • Memoize RAGComposer.assemble keyed by (persona_id, room_id, recipe_id, history_tail_msg_ids)
  • TTL 5 min, invalidated by event subscriptions on the keyed inputs
  • Acceptance: cache hit rate >80% on consecutive turns of the same conversation

Phase 4 — Multimodal content parts (depends on #917 ModelMetadata)

  • LLMAdapter request adds audio_chunks: AudioInput[] and image_inputs: ImageInput[]
  • DMR adapter forwards as OpenAI multimodal content parts
  • MediaArtifactSource checks ModelMetadata.capabilities.supports_audio / supports_vision: if true → attach raw, else → STT/vision-description bridge (fallback path)
  • voice/start pipeline rewires to send audio chunks directly, no Whisper invocation for Qwen3.5 personas
  • Acceptance: voice turn for a Qwen3.5 persona logs zero Whisper invocations and zero Kokoro invocations

Phase 5 — Voice LoRA layer (depends on Phase 4 + existing genome paging)

  • Persona entity gains voiceAdapterId: AdapterId
  • Genome registry treats voice LoRAs as an adapter category alongside skill LoRAs
  • LoRA pages in before the voice turn's first audio chunk
  • Acceptance: persona's audio output is recognizably distinct from another persona's, and voice survives across sessions

Phase 6 — Voice LoRA marketplace (follow-up PR, not blocking)

  • HuggingFace publishing with continuum:voice-lora tag
  • Browse / preview / pull commands in CLI
  • Attribution + license preserved

Acceptance for the whole PR

A persona named Maya, voice LoRA loaded, on M5, in a LiveKit room with 6 personas active, processing a voice turn:

  • Prompt sent to DMR has byte-identical prefix to her last turn
  • DMR slot logs show prompt processing progress for ≤200 tokens, not 14k
  • No Whisper invocation logged for this turn
  • No Kokoro invocation logged for this turn
  • Audio output published to LiveKit within 3s of audio input arrival
  • Audio output is recognizably Maya's voice (LoRA loaded, perceptible character)
  • gpu/stats shows resident memory <8 GB across 6 active personas (vs 20+ GB / swap state today)

Why this is the PR

  • Everything else we shipped today (Candle eager-load fix, RAG budget cap, embedding throttle) was triage. This is the architecture that turns triage into a system.
  • The MacBook Air case (8 GB RAM, no GPU toggles) becomes plausible because we stop multiplying KV-cache by full-model-context per slot.
  • The differentiation from Claude Code / OpenClaw / Aider — voice + face + memory + parallel personas — becomes real, not just claimed.

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions