Multimodal-native worker + prefix-reuse — collapse voice turn from 15s to 3s on a single laptop

> **Stop throwing computation away.** Most of the prompt is invariant per request, and Qwen3.5 takes audio and images directly — eliminating STT/TTS entirely. Together these collapse end-to-end voice latency from minutes to ~2-3 seconds per turn while running 14 personas in parallel on a single M-series Mac.

Full design: [`docs/architecture/MULTIMODAL-WORKER-AND-PREFIX-REUSE.md`](docs/architecture/MULTIMODAL-WORKER-AND-PREFIX-REUSE.md)

## The thesis

We are not GPU-bound. We are *waste*-bound. Three composable wins:

1. **Stable-first RAG ordering → llama-server prefix KV cache reuse** → ~70× prompt-eval speedup (today: 14k tokens reprocessed per turn → target: ~200, the volatile suffix only)
2. **Multimodal content parts → delete STT/TTS sandwich for Qwen3.5** → 1 model invocation per voice turn instead of 3 (no Whisper, no Kokoro, no ORT-deadlock chain)
3. **Voice LoRA per persona → identity, not signal** → the "Maya replied" experience that differentiates from Claude Code / OpenClaw / Aider

## Phases (one PR, phased commits)

Each phase is independently mergeable but compounds with the prior. Suggested branch: `feature/prefix-reuse-and-multimodal`.

### Phase 1 — Stable-first RAG ordering (TS only, no dependencies)
- `RAGComposer.assemble` returns sections explicitly tagged `INVARIANT` / `SEMI_STABLE` / `VOLATILE`
- Final concatenation always orders the three regions identically; sorts deterministically within each
- Every RAG source declares its tier
- `ChatRAGBuilder` emits stable-byte-prefix prompts
- **Acceptance**: SHA-256 of `prompt[:invariant_len]` is identical across consecutive turns of the same persona

### Phase 2 — Per-persona DMR slot pinning (TS + small Rust)
- `AIProviderRustClient.generateText` accepts `slot_hint: u32` derived from `persona_id`
- DMR adapter passes `slot_id` in the OpenAI request
- **Acceptance**: persona Maya's requests consistently land on the same llama-server slot across turns; DMR logs show prompt processing for ≤200 tokens after first turn warm-up

### Phase 3 — RAGComposition cache (TS only, depends on Phase 1)
- Memoize `RAGComposer.assemble` keyed by `(persona_id, room_id, recipe_id, history_tail_msg_ids)`
- TTL 5 min, invalidated by event subscriptions on the keyed inputs
- **Acceptance**: cache hit rate >80% on consecutive turns of the same conversation

### Phase 4 — Multimodal content parts (depends on #917 ModelMetadata)
- `LLMAdapter` request adds `audio_chunks: AudioInput[]` and `image_inputs: ImageInput[]`
- DMR adapter forwards as OpenAI multimodal content parts
- `MediaArtifactSource` checks `ModelMetadata.capabilities.supports_audio` / `supports_vision`: if true → attach raw, else → STT/vision-description bridge (fallback path)
- `voice/start` pipeline rewires to send audio chunks directly, no Whisper invocation for Qwen3.5 personas
- **Acceptance**: voice turn for a Qwen3.5 persona logs zero Whisper invocations and zero Kokoro invocations

### Phase 5 — Voice LoRA layer (depends on Phase 4 + existing genome paging)
- Persona entity gains `voiceAdapterId: AdapterId`
- Genome registry treats voice LoRAs as an adapter category alongside skill LoRAs
- LoRA pages in before the voice turn's first audio chunk
- **Acceptance**: persona's audio output is recognizably distinct from another persona's, and voice survives across sessions

### Phase 6 — Voice LoRA marketplace (follow-up PR, not blocking)
- HuggingFace publishing with `continuum:voice-lora` tag
- Browse / preview / pull commands in CLI
- Attribution + license preserved

## Acceptance for the whole PR

A persona named Maya, voice LoRA loaded, on M5, in a LiveKit room with 6 personas active, processing a voice turn:

- [ ] Prompt sent to DMR has byte-identical prefix to her last turn
- [ ] DMR slot logs show `prompt processing progress` for ≤200 tokens, not 14k
- [ ] No Whisper invocation logged for this turn
- [ ] No Kokoro invocation logged for this turn
- [ ] Audio output published to LiveKit within 3s of audio input arrival
- [ ] Audio output is recognizably Maya's voice (LoRA loaded, perceptible character)
- [ ] `gpu/stats` shows resident memory <8 GB across 6 active personas (vs 20+ GB / swap state today)

## Why this is the PR

- Everything else we shipped today (Candle eager-load fix, RAG budget cap, embedding throttle) was triage. This is the architecture that turns triage into a system.
- The MacBook Air case (8 GB RAM, no GPU toggles) becomes plausible because we stop multiplying KV-cache by full-model-context per slot.
- The differentiation from Claude Code / OpenClaw / Aider — voice + face + memory + parallel personas — becomes real, not just claimed.

## Dependencies

- #917 (`ModelMetadata` refactor) — must land before Phase 4. Phases 1-3 are independent.
- PR #914 (voice LiveKit migration) — orthogonal; this builds on it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimodal-native worker + prefix-reuse — collapse voice turn from 15s to 3s on a single laptop #918

The thesis

Phases (one PR, phased commits)

Phase 1 — Stable-first RAG ordering (TS only, no dependencies)

Phase 2 — Per-persona DMR slot pinning (TS + small Rust)

Phase 3 — RAGComposition cache (TS only, depends on Phase 1)

Phase 4 — Multimodal content parts (depends on #917 ModelMetadata)

Phase 5 — Voice LoRA layer (depends on Phase 4 + existing genome paging)

Phase 6 — Voice LoRA marketplace (follow-up PR, not blocking)

Acceptance for the whole PR

Why this is the PR

Dependencies

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multimodal-native worker + prefix-reuse — collapse voice turn from 15s to 3s on a single laptop #918

Description

The thesis

Phases (one PR, phased commits)

Phase 1 — Stable-first RAG ordering (TS only, no dependencies)

Phase 2 — Per-persona DMR slot pinning (TS + small Rust)

Phase 3 — RAGComposition cache (TS only, depends on Phase 1)

Phase 4 — Multimodal content parts (depends on #917 ModelMetadata)

Phase 5 — Voice LoRA layer (depends on Phase 4 + existing genome paging)

Phase 6 — Voice LoRA marketplace (follow-up PR, not blocking)

Acceptance for the whole PR

Why this is the PR

Dependencies

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions