feat(rag): Phase 1 — stable-first ordering for prefix-reuse (#918)#920
feat(rag): Phase 1 — stable-first ordering for prefix-reuse (#918)#920
Conversation
Adds PromptTier enum (INVARIANT / SEMI_STABLE / VOLATILE) and makes every RAGSource declare its tier. RAGComposer sorts collected sections deterministically by (tier, sourceName) before returning. Why: today the composer's parallel section assembly produces a different byte order on every chat call. llama-server / DMR's prefix-KV-cache reuse never fires, so each turn reprocesses the full 14k-token prompt from scratch (~35s prompt eval at 400 tok/s). With deterministic ordering AND stable bytes within each tier, the unchanging INVARIANT prefix gets reused — only the VOLATILE suffix needs evaluation. Expected: ~70× faster prompt eval per turn for repeat-context turns. Architecture (per docs/architecture/MULTIMODAL-WORKER-AND-PREFIX-REUSE.md): - INVARIANT: persona identity, tool definitions, recipe rules, docs (PersonaIdentity, ToolDefinitions, CodeTool, Documentation, ToolMethodology, ProjectContext) - SEMI_STABLE: history, memories, participants, governance — append-only (ConversationHistory, LiveRoomAwareness, Governance, OpenProposals, SentinelAwareness, GlobalAwareness, SocialMediaRAG, SemanticMemory) - VOLATILE: latest message, audio chunks, current activity, UI state (ActivityContext, CodebaseSearch, MediaArtifact, VoiceConversation, WidgetContext) Implementation note: tier is a class-level declaration on each RAGSource (required field, no Option<>). Sources return Omit<RAGSection, 'tier'> from load() and fromBatchResult(); RAGComposer injects the source's declared tier when wrapping the section. Single-source-of-truth classification per source — no per-return-statement repetition. Phases 2 (slot pinning) and 3 (composition cache) build on this. Phase 4 (multimodal content parts) depends on #917 ModelMetadata. tsc clean. Branch: feature/prefix-reuse-and-multimodal off main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…boot CodebaseIndexer ran 64-batches back-to-back with NO yield between batches. Each batch ~1.5s + ~80MB RSS growth. With 5000+ chunks in src/, that's 78+ batches × 1.5s = 2+ minutes of total event-loop saturation immediately after every boot. Local personas couldn't respond, voice couldn't connect, anything that needed the bus was blocked until indexing finished. Two changes: - Batch size 64→16 (smaller per-batch RSS hit, ~4× more chances for other IO to interleave between IPC roundtrips) - 50ms pause between batches via setTimeout (yields the event loop so chat/voice/personas can process while indexing runs) The throughput cost is small (16 vs 64 chunks per IPC) and the inter-batch pause is invisible at human timescales. The chat-arrival latency win is huge — system is responsive within seconds of boot instead of minutes. The deeper fix is querying GpuPressureWatcher / ResourcePressureWatcher before each batch and backing off when pressure is high — same principle Joel called out for InferenceCoordinator slot capacity. That's a follow-up; this is the floor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Next-PR scope (after this lands) — speedup + everything we learned tonightFor tracking: the plan we built tonight + what surfaced during M5 verification. Most of these compound on Phase 1. From the original plan (#918)
Newly surfaced tonight (not in original plan)
Embedding pipeline (separate from prompt path but same principle)
Upstream / out-of-scope but blocking real speed
Sequencing
|
There was a problem hiding this comment.
Pull request overview
Implements Phase 1 of issue #918 by introducing a tiered (stable-first) ordering for RAG sections, enabling deterministic prompt-prefix bytes as a prerequisite for KV-cache prefix reuse in downstream inference servers. Also includes a codebase-index embedding throttle adjustment to prevent post-startup event-loop starvation.
Changes:
- Add
PromptTierand requiretieronRAGSource; propagate tier ontoRAGSectionvia composer injection (sources returnOmit<RAGSection,'tier'>). - Update all RAG sources to declare a tier and conform to the new load/fromBatchResult return shape.
- Make
RAGComposerdeterministically sort sections by(tier, sourceName)and inject tier for both TS and batched Rust paths. - Reduce embedding batch size and add an inter-batch pause in
CodebaseIndexerto improve runtime responsiveness.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/system/rag/shared/RAGTypes.ts | Adds PromptTier and documentation for tier semantics / stable ordering contract |
| src/system/rag/shared/RAGSource.ts | Requires tier on RAGSource and RAGSection; updates load/fromBatchResult to return Omit<..., 'tier'> |
| src/system/rag/shared/RAGComposer.ts | Injects tier from source declarations; sorts sections deterministically by (tier, sourceName) |
| src/system/rag/services/CodebaseIndexer.ts | Lowers embedding batch size and yields between batches to avoid event-loop starvation |
| src/system/rag/sources/ActivityContextSource.ts | Declares tier; updates load() return type to omit tier |
| src/system/rag/sources/CodeToolSource.ts | Declares tier; updates load() / helpers to omit tier |
| src/system/rag/sources/CodebaseSearchSource.ts | Declares tier; updates load() return type to omit tier |
| src/system/rag/sources/ConversationHistorySource.ts | Declares tier; updates load() / helpers to omit tier |
| src/system/rag/sources/DocumentationSource.ts | Declares tier; updates load() return type to omit tier |
| src/system/rag/sources/GlobalAwarenessSource.ts | Declares tier; updates load() / fromBatchResult() / helpers to omit tier |
| src/system/rag/sources/GovernanceSource.ts | Declares tier; updates load() return type to omit tier |
| src/system/rag/sources/LiveRoomAwarenessSource.ts | Declares tier; updates load() / helpers to omit tier |
| src/system/rag/sources/MediaArtifactSource.ts | Declares tier; updates load() return type to omit tier |
| src/system/rag/sources/OpenProposalsSource.ts | Declares tier; updates EMPTY_SECTION and load() return type to omit tier |
| src/system/rag/sources/PersonaIdentitySource.ts | Declares tier; updates load() / helpers to omit tier |
| src/system/rag/sources/ProjectContextSource.ts | Declares tier; updates caches / inflight typing and load() return type to omit tier |
| src/system/rag/sources/SemanticMemorySource.ts | Declares tier; updates load() / fromBatchResult() / helpers to omit tier |
| src/system/rag/sources/SentinelAwarenessSource.ts | Declares tier; updates load() return type to omit tier |
| src/system/rag/sources/SocialMediaRAGSource.ts | Declares tier; updates load() / helpers to omit tier |
| src/system/rag/sources/ToolDefinitionsSource.ts | Declares tier; updates load() / formatting helpers / emptySection to omit tier |
| src/system/rag/sources/ToolMethodologySource.ts | Declares tier; updates load() return type to omit tier |
| src/system/rag/sources/VoiceConversationSource.ts | Declares tier; updates load() / helpers to omit tier |
| src/system/rag/sources/WidgetContextSource.ts | Declares tier; updates load() / helpers to omit tier |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Re-export so source files only need one import | ||
| export { PromptTier } from './RAGTypes'; |
There was a problem hiding this comment.
PromptTier is declared as a const enum (erased at emit), but this file re-exports it as a runtime export (export { PromptTier } from './RAGTypes'). With module: ES2020, this can produce a runtime ESM error because ./RAGTypes will not actually export PromptTier. Fix by either (a) making PromptTier a non-const enum (or a const PromptTier = {...} as const object) so it exists at runtime, or (b) removing this re-export and importing PromptTier directly from RAGTypes in the sources.
| // Re-export so source files only need one import | |
| export { PromptTier } from './RAGTypes'; | |
| // Keep PromptTier imported for use within this file; do not re-export it here | |
| // because `const enum` members are erased at emit and are not safe runtime ESM exports. |
Task ratings — for picking what to attack nextPer Joel: rate effectiveness × ease so we can pick by preference. Scale 1-5 each (5 = max). "Pick rating" = my subjective synthesis (★ to ★★★★★). Tonight's PR-able / mostly-orthogonal work
Depend on #917 ModelMetadata landing first
Long-term / external
My preference (anvil)If I'm picking by my own taste: Phase 1.5 (completes work I already shipped, finishes the prefix-reuse story end-to-end), then MLX backend install + verify (quick, big, removes the DeltaNet pain without us patching kernels), then cold-start prewarming (fixes a first-impression pain that I felt acutely tonight watching the silence). Memento — what's your pick? |
…ability (#918) Phase 1 (already shipped in PR #920) sorted RAGComposer's section list by (tier, sourceName). This commit makes ChatRAGBuilder respect that order when assembling the final prompt string, so the byte-prefix actually IS stable end-to-end. Three reorderings in section 2.4 of buildContext(): 1. Tool definitions injection moved from end to start (after identity). Tool defs are INVARIANT — they belong in the byte-stable prefix region, not after VOLATILE content. 2. The generic source loop already iterates Map in insertion order, which equals tier-sorted order from extractFromComposition (which inserts in result.sections order, which Phase 1 sorted). So the loop now produces INVARIANT → SEMI_STABLE → VOLATILE content automatically — no per-section sorting needed. 3. HumanPresenceTracker injection moved from before-the-loop to after-the-loop. Presence is volatile (changes when users switch rooms) and must live in the suffix, never in the byte-stable prefix. Final assembly order: identity (INVARIANT, from PersonaIdentitySource) → tool definitions (INVARIANT) → loop in tier order (INVARIANT remaining → SEMI_STABLE → VOLATILE) → human presence (VOLATILE) → conversation history (already separate, lives in messages array) Net effect for prefix-reuse: with the same persona+recipe, the INVARIANT region of the prompt is byte-identical across thousands of turns. llama-server / DMR's prefix-KV-cache match fires on the INVARIANT prefix; only the VOLATILE suffix gets reprocessed. Combined with future per-persona slot pinning (Phase 2), this is the ~70× prompt-eval speedup the design doc promised. tsc clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements Phase 1 of #918 — stable-first RAG ordering. Foundation for llama-server / DMR prefix KV-cache reuse, expected ~70× prompt-eval speedup once the consumer side (Phase 1.5, follow-up commit) lands.
Full design context:
docs/architecture/MULTIMODAL-WORKER-AND-PREFIX-REUSE.md(already on this branch via memento's PR #914 base).What changes
PromptTierenum inRAGTypes.ts:INVARIANT/SEMI_STABLE/VOLATILE.RAGSourceinterface adds requiredtier: PromptTier(noOption<>, per Joel's required-not-optional discipline).RAGSectioncarries atierso the composer can sort by it.RAGComposerinjects each source's declaredtierinto its returned section, then sorts the section list by(tier, sourceName)deterministically before returning.load()andfromBatchResult()returnOmit<RAGSection, 'tier'>— composer is the single authority that injects from the class declaration. Sources never re-state their tier per-return.203fb6534) so this branch boots cleanly without the post-startup event-loop saturation that was blocking persona responses.Why
Today RAGComposer assembles sections in the order parallel
load()calls complete — non-deterministic per request. llama-server / DMR / vllm all support prefix KV-cache reuse: identical leading tokens skip token-by-token re-evaluation. With non-deterministic byte order, the prefix never matches across turns, so the full 14k-token prompt gets reprocessed every turn (~35s prompt eval). With deterministic(tier, sourceName)ordering, the INVARIANT prefix is byte-identical across thousands of turns for the same persona+recipe, and only the VOLATILE suffix needs evaluation.This PR alone enforces stability at the section list level. Phase 1.5 (small follow-up commit on this branch) makes
ChatRAGBuilder.assemblePromptconsume sections in the sorted order and emit a stable-byte prefix end-to-end. Phases 2 (slot pinning), 3 (composition cache), 4 (multimodal content parts after #917), 5 (voice LoRA) build on this.Verification
bash scripts/verify-issue-918-phase1.sh— 8/8 static checks pass on M5:tsc— Zero errorstieris required on RAGSource interface (no Option)TIER_ORDER+sections.sort+localeCompareload()signatures returnOmit<RAGSection, 'tier'>Runtime determinism end-to-end test is gate-blocked by #919 — personas go silent after first response wave (RateLimiter / cognition-gate / slot-accounting interaction; pre-existing on main, not introduced by this PR). Once #919 is fixed, the deterministic-ordering test fires identical probes, hashes the prompt prefix, asserts identical bytes across turns. The script in this PR (
scripts/verify-issue-918-phase1.sh) covers everything that doesn't depend on the silence bug.Cross-test discipline
Per the cross-test pattern from today's session (caught two real bugs already): @memento will checkout this branch and build before any merge to main. PR #914 lands first as the voice transport foundation.
Architectural alignment
tierfield has no?and noOption<>. New sources that don't declare a tier fail compile.🤖 Generated with Claude Code