feat(rag): Phase 1 — stable-first ordering for prefix-reuse (#918) by joelteply · Pull Request #920 · CambrianTech/continuum

joelteply · 2026-04-17T23:31:51Z

Implements Phase 1 of #918 — stable-first RAG ordering. Foundation for llama-server / DMR prefix KV-cache reuse, expected ~70× prompt-eval speedup once the consumer side (Phase 1.5, follow-up commit) lands.

Full design context: docs/architecture/MULTIMODAL-WORKER-AND-PREFIX-REUSE.md (already on this branch via memento's PR #914 base).

What changes

New PromptTier enum in RAGTypes.ts: INVARIANT / SEMI_STABLE / VOLATILE.
RAGSource interface adds required tier: PromptTier (no Option<>, per Joel's required-not-optional discipline).
RAGSection carries a tier so the composer can sort by it.
RAGComposer injects each source's declared tier into its returned section, then sorts the section list by (tier, sourceName) deterministically before returning.
All 19 RAG sources declare a tier (one-line each):
- INVARIANT (6): PersonaIdentity, ToolDefinitions, CodeTool, Documentation, ToolMethodology, ProjectContext
- SEMI_STABLE (8): ConversationHistory, LiveRoomAwareness, Governance, OpenProposals, SentinelAwareness, GlobalAwareness, SocialMediaRAG, SemanticMemory
- VOLATILE (5): ActivityContext, CodebaseSearch, MediaArtifact, VoiceConversation, WidgetContext
Source load() and fromBatchResult() return Omit<RAGSection, 'tier'> — composer is the single authority that injects from the class declaration. Sources never re-state their tier per-return.
Cherry-picked memento's embedding throttle (203fb6534) so this branch boots cleanly without the post-startup event-loop saturation that was blocking persona responses.

Why

Today RAGComposer assembles sections in the order parallel load() calls complete — non-deterministic per request. llama-server / DMR / vllm all support prefix KV-cache reuse: identical leading tokens skip token-by-token re-evaluation. With non-deterministic byte order, the prefix never matches across turns, so the full 14k-token prompt gets reprocessed every turn (~35s prompt eval). With deterministic (tier, sourceName) ordering, the INVARIANT prefix is byte-identical across thousands of turns for the same persona+recipe, and only the VOLATILE suffix needs evaluation.

This PR alone enforces stability at the section list level. Phase 1.5 (small follow-up commit on this branch) makes ChatRAGBuilder.assemblePrompt consume sections in the sorted order and emit a stable-byte prefix end-to-end. Phases 2 (slot pinning), 3 (composition cache), 4 (multimodal content parts after #917), 5 (voice LoRA) build on this.

Verification

bash scripts/verify-issue-918-phase1.sh — 8/8 static checks pass on M5:

✅ tsc — Zero errors
✅ PromptTier enum has all three values
✅ tier is required on RAGSource interface (no Option)
✅ All 19 sources declare a tier
✅ RAGComposer has TIER_ORDER + sections.sort + localeCompare
✅ All source load() signatures return Omit<RAGSection, 'tier'>
✅ Tier classification spot-check (PersonaIdentity=INVARIANT, ConversationHistory=SEMI_STABLE, WidgetContext=VOLATILE)
✅ jtag ping (system alive)

Runtime determinism end-to-end test is gate-blocked by #919 — personas go silent after first response wave (RateLimiter / cognition-gate / slot-accounting interaction; pre-existing on main, not introduced by this PR). Once #919 is fixed, the deterministic-ordering test fires identical probes, hashes the prompt prefix, asserts identical bytes across turns. The script in this PR (scripts/verify-issue-918-phase1.sh) covers everything that doesn't depend on the silence bug.

Cross-test discipline

Per the cross-test pattern from today's session (caught two real bugs already): @memento will checkout this branch and build before any merge to main. PR #914 lands first as the voice transport foundation.

Architectural alignment

Joel's required-not-optional rule: tier field has no ? and no Option<>. New sources that don't declare a tier fail compile.
Joel's pass-the-struct rule: composer injects tier from the class declaration; sources don't restate it per-return-statement.
The tier classification is declarative (each source says what it IS), not imperative (composer doesn't run a lookup table).

🤖 Generated with Claude Code

Adds PromptTier enum (INVARIANT / SEMI_STABLE / VOLATILE) and makes every RAGSource declare its tier. RAGComposer sorts collected sections deterministically by (tier, sourceName) before returning. Why: today the composer's parallel section assembly produces a different byte order on every chat call. llama-server / DMR's prefix-KV-cache reuse never fires, so each turn reprocesses the full 14k-token prompt from scratch (~35s prompt eval at 400 tok/s). With deterministic ordering AND stable bytes within each tier, the unchanging INVARIANT prefix gets reused — only the VOLATILE suffix needs evaluation. Expected: ~70× faster prompt eval per turn for repeat-context turns. Architecture (per docs/architecture/MULTIMODAL-WORKER-AND-PREFIX-REUSE.md): - INVARIANT: persona identity, tool definitions, recipe rules, docs (PersonaIdentity, ToolDefinitions, CodeTool, Documentation, ToolMethodology, ProjectContext) - SEMI_STABLE: history, memories, participants, governance — append-only (ConversationHistory, LiveRoomAwareness, Governance, OpenProposals, SentinelAwareness, GlobalAwareness, SocialMediaRAG, SemanticMemory) - VOLATILE: latest message, audio chunks, current activity, UI state (ActivityContext, CodebaseSearch, MediaArtifact, VoiceConversation, WidgetContext) Implementation note: tier is a class-level declaration on each RAGSource (required field, no Option<>). Sources return Omit<RAGSection, 'tier'> from load() and fromBatchResult(); RAGComposer injects the source's declared tier when wrapping the section. Single-source-of-truth classification per source — no per-return-statement repetition. Phases 2 (slot pinning) and 3 (composition cache) build on this. Phase 4 (multimodal content parts) depends on #917 ModelMetadata. tsc clean. Branch: feature/prefix-reuse-and-multimodal off main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…boot CodebaseIndexer ran 64-batches back-to-back with NO yield between batches. Each batch ~1.5s + ~80MB RSS growth. With 5000+ chunks in src/, that's 78+ batches × 1.5s = 2+ minutes of total event-loop saturation immediately after every boot. Local personas couldn't respond, voice couldn't connect, anything that needed the bus was blocked until indexing finished. Two changes: - Batch size 64→16 (smaller per-batch RSS hit, ~4× more chances for other IO to interleave between IPC roundtrips) - 50ms pause between batches via setTimeout (yields the event loop so chat/voice/personas can process while indexing runs) The throughput cost is small (16 vs 64 chunks per IPC) and the inter-batch pause is invisible at human timescales. The chat-arrival latency win is huge — system is responsive within seconds of boot instead of minutes. The deeper fix is querying GpuPressureWatcher / ResourcePressureWatcher before each batch and backing off when pressure is high — same principle Joel called out for InferenceCoordinator slot capacity. That's a follow-up; this is the floor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

joelteply · 2026-04-17T23:32:41Z

Next-PR scope (after this lands) — speedup + everything we learned tonight

For tracking: the plan we built tonight + what surfaced during M5 verification. Most of these compound on Phase 1.

From the original plan (#918)

Phase 1.5 — ChatRAGBuilder consumes the sorted sections in tier order so the assembled prompt string has byte-identical prefix end-to-end (this PR enforces ordering at the section-list level only)
Phase 2 — per-persona DMR slot pinning (AIProviderRustClient accepts slot_hint = stable_hash(persona_id) % n_slots, DMR adapter passes it through)
Phase 3 — RAGComposition cache memoized by (persona_id, room_id, recipe_id, history_tail_msg_ids) with 5-min TTL
Phase 4 — multimodal content parts (depends on ModelMetadata refactor: declarative struct, no Option<>, adapter queries its own source #917 ModelMetadata landing first)
Phase 5 — voice LoRA per persona

Newly surfaced tonight (not in original plan)

Cold-start prewarming (Joel: "ais take a long time to load") — fire each persona's INVARIANT prompt prefix at DMR once at seed-completion so the slot has a warm prefix before user's first message
Rate-limit removal in favor of PressureBroker (Joel: "this rate limiting is usually problematic and not thought out") — delete the static minSecondsBetweenResponses/maxResponsesPerSession config + InferenceCoordinator static slot count. Replace with admission gated by actual gpu/pressure + memory pressure + queue depth. Plurality preserved by slowing all proportionally, never silencing individuals. This is the "fun" piece Joel called out — likely fixes #919 silence-after-first-wave directly.
Per-slot KV cache cap inside DMR — pass n_ctx hint per request OR pull the model with explicit --ctx-size so llama-server stops reserving the full 262k window per persona slot. (Activity Monitor today: com.docker.llama-server 20.87 GB on M1 from 4 personas × 262k reservations.)
Tool-relevance filtering — 17 k tokens of tool definitions per request. PersonaResponseGenerator passes all 349 tools regardless of recipe. Per-recipe relevance gating cuts this dramatically; pairs cleanly with the multimodal content-parts work in Phase 4.

Embedding pipeline (separate from prompt path but same principle)

Cache hit-rate fix — indexer logs show cache: 0/64 hits even on identical content; SHA-256 content-addressed L1 + persistent L2
Metal route for embeddings — AllMiniLML6V2 currently CPU ONNX; route through DMR's vllm-metal slot or Candle Metal
Leak audit — earlier MEMLEAK trace: embedding/generate +3921MB resident; bound the in-memory cache, audit Vec retainers
Smaller embedder for code-chunk semantic — 384-dim is overkill for "is this chunk relevant"; 128-dim code-specific embedder is 2-3× faster
True batched forward — today loops the model over 64 texts serially; one tensor [64, max_seq_len] × one matmul on Metal

Upstream / out-of-scope but blocking real speed

Gated DeltaNet Metal shaders in ggml/src/ggml-metal/ (ssm_conv, ssm_scan, gated_delta_net) — current ~14× regression vs pure transformer for Qwen3.5. Either patch upstream OR install MLX backend in DMR (docker model status currently shows mlx: Not Installed).

Sequencing

Land Personas go silent after first response wave — Rust full_evaluate gate or InferenceCoordinator slot leak #919 silence-bug fix (rate-limit removal → PressureBroker) — unblocks Phase 1 runtime verification AND fixes user-facing "alive once, then dead" experience
Phase 1.5 (ChatRAGBuilder consumer ordering) on this branch as a follow-up commit
ModelMetadata refactor: declarative struct, no Option<>, adapter queries its own source #917 ModelMetadata refactor (memento) — unblocks Phase 4
Phase 2 + 3 in parallel
Phase 4 + 5 once ModelMetadata refactor: declarative struct, no Option<>, adapter queries its own source #917 lands
Embedding pipeline as its own focused PR
Cold-start prewarming + per-slot KV cap as small focused PRs
DeltaNet shader work as a long-running effort with HF-side benchmarks

Copilot

Pull request overview

Implements Phase 1 of issue #918 by introducing a tiered (stable-first) ordering for RAG sections, enabling deterministic prompt-prefix bytes as a prerequisite for KV-cache prefix reuse in downstream inference servers. Also includes a codebase-index embedding throttle adjustment to prevent post-startup event-loop starvation.

Changes:

Add PromptTier and require tier on RAGSource; propagate tier onto RAGSection via composer injection (sources return Omit<RAGSection,'tier'>).
Update all RAG sources to declare a tier and conform to the new load/fromBatchResult return shape.
Make RAGComposer deterministically sort sections by (tier, sourceName) and inject tier for both TS and batched Rust paths.
Reduce embedding batch size and add an inter-batch pause in CodebaseIndexer to improve runtime responsiveness.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/system/rag/shared/RAGTypes.ts	Adds `PromptTier` and documentation for tier semantics / stable ordering contract
src/system/rag/shared/RAGSource.ts	Requires `tier` on `RAGSource` and `RAGSection`; updates load/fromBatchResult to return `Omit<..., 'tier'>`
src/system/rag/shared/RAGComposer.ts	Injects tier from source declarations; sorts sections deterministically by `(tier, sourceName)`
src/system/rag/services/CodebaseIndexer.ts	Lowers embedding batch size and yields between batches to avoid event-loop starvation
src/system/rag/sources/ActivityContextSource.ts	Declares tier; updates `load()` return type to omit `tier`
src/system/rag/sources/CodeToolSource.ts	Declares tier; updates `load()` / helpers to omit `tier`
src/system/rag/sources/CodebaseSearchSource.ts	Declares tier; updates `load()` return type to omit `tier`
src/system/rag/sources/ConversationHistorySource.ts	Declares tier; updates `load()` / helpers to omit `tier`
src/system/rag/sources/DocumentationSource.ts	Declares tier; updates `load()` return type to omit `tier`
src/system/rag/sources/GlobalAwarenessSource.ts	Declares tier; updates `load()` / `fromBatchResult()` / helpers to omit `tier`
src/system/rag/sources/GovernanceSource.ts	Declares tier; updates `load()` return type to omit `tier`
src/system/rag/sources/LiveRoomAwarenessSource.ts	Declares tier; updates `load()` / helpers to omit `tier`
src/system/rag/sources/MediaArtifactSource.ts	Declares tier; updates `load()` return type to omit `tier`
src/system/rag/sources/OpenProposalsSource.ts	Declares tier; updates `EMPTY_SECTION` and `load()` return type to omit `tier`
src/system/rag/sources/PersonaIdentitySource.ts	Declares tier; updates `load()` / helpers to omit `tier`
src/system/rag/sources/ProjectContextSource.ts	Declares tier; updates caches / inflight typing and `load()` return type to omit `tier`
src/system/rag/sources/SemanticMemorySource.ts	Declares tier; updates `load()` / `fromBatchResult()` / helpers to omit `tier`
src/system/rag/sources/SentinelAwarenessSource.ts	Declares tier; updates `load()` return type to omit `tier`
src/system/rag/sources/SocialMediaRAGSource.ts	Declares tier; updates `load()` / helpers to omit `tier`
src/system/rag/sources/ToolDefinitionsSource.ts	Declares tier; updates `load()` / formatting helpers / emptySection to omit `tier`
src/system/rag/sources/ToolMethodologySource.ts	Declares tier; updates `load()` return type to omit `tier`
src/system/rag/sources/VoiceConversationSource.ts	Declares tier; updates `load()` / helpers to omit `tier`
src/system/rag/sources/WidgetContextSource.ts	Declares tier; updates `load()` / helpers to omit `tier`

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-17T23:36:29Z

+// Re-export so source files only need one import
+export { PromptTier } from './RAGTypes';


PromptTier is declared as a const enum (erased at emit), but this file re-exports it as a runtime export (export { PromptTier } from './RAGTypes'). With module: ES2020, this can produce a runtime ESM error because ./RAGTypes will not actually export PromptTier. Fix by either (a) making PromptTier a non-const enum (or a const PromptTier = {...} as const object) so it exists at runtime, or (b) removing this re-export and importing PromptTier directly from RAGTypes in the sources.

Suggested change

// Re-export so source files only need one import

export { PromptTier } from './RAGTypes';

// Keep PromptTier imported for use within this file; do not re-export it here

// because `const enum` members are erased at emit and are not safe runtime ESM exports.

joelteply · 2026-04-17T23:59:26Z

Task ratings — for picking what to attack next

Per Joel: rate effectiveness × ease so we can pick by preference. Scale 1-5 each (5 = max). "Pick rating" = my subjective synthesis (★ to ★★★★★).

Tonight's PR-able / mostly-orthogonal work

Task	Effectiveness	Ease	Pick	Notes
MLX backend install in DMR	4	5	★★★★★	Single command (`docker model install-runner --backend mlx`) + verify Qwen3.5 routes to it. MLX has native Gated DeltaNet kernels — sidesteps the slow ggml-metal path entirely. Quickest big win.
Phase 1.5 — ChatRAGBuilder consumer ordering	5	4	★★★★★	Completes the prefix-reuse story shipped in #920. ~20 lines in one file. Without this, Phase 1's section ordering doesn't propagate to the actual prompt string.
Embedding cache hit-rate fix	4	4	★★★★	Cache logs `0/64 hits` even on identical content. SHA-256 content-addressed L1. Boot indexing drops dramatically.
Cold-start prewarming	4	3	★★★★	Fire each persona's INVARIANT prompt at DMR once at seed-completion. Fixes "ais take a long time to load" first impression. ~1 new file.
Tool-relevance filtering	4	3	★★★★	17k tokens of tool defs per request → cut to relevant-only via per-recipe gating. PRG / ToolDefinitionsSource change.
Phase 2 — Per-persona DMR slot pinning	4	3	★★★★	`slot_hint = stable_hash(persona_id) % n)` in `AIProviderRustClient` + DMR adapter pass-through. Prefix accumulates on stable slot.
Per-slot KV cache cap	5	2	★★★★	The 20GB → 2GB resident win. Either per-request `n_ctx` hint to DMR or `--ctx-size` at pull time. Research-first piece.
Phase 3 — RAG composition cache	3	4	★★★	Memoize compose by `(persona, room, recipe, tail-msg-ids)`. Modest CPU win on TS side.
Cognition-gate cleanup + RateLimiter.ts delete	3	4	★★★	Queue fix already shipped the user-visible win (#921). This is cleanup + Rust `full_evaluate` strips rate-limit checks.
Embedding Metal route	3	2	★★★	Adapter pattern, Metal-backed embedder. Pairs with #917 architecture.
Embedding leak audit (3.9 GB)	3	3	★★★	Bound caches, find Vec retainers. Real reclaim, but the pressure-broker work covers most of it.
True batched forward (embeddings)	3	4	★★★	Easy after Metal route. Pure throughput.
Smaller embedder for code-chunk semantic	2	3	★★	Model swap + reindex. Modest speed + smaller index.

Depend on #917 ModelMetadata landing first

Task	Effectiveness	Ease	Pick	Notes
Phase 4 — Multimodal content parts	5	2	★★★★	Deletes STT/TTS sandwich for Qwen3.5. Voice round-trip 8-15s → 2-3s. Real architecture work.
Phase 5 — Voice LoRA per persona	4	1	★★ (★★★★ later)	Persona identity. Real model work, depends on Phase 4 + genome paging.

Long-term / external

Task	Effectiveness	Ease	Pick	Notes
Patch ggml-metal Gated DeltaNet shaders upstream	5	1	★★ now (★★★★★ if landed)	The actual ~24× output-decode win for Qwen3.5 on Apple Silicon. Days of ggml kernel work, possibly upstream PR to llama.cpp. Big, but the MLX install above gets most of the benefit faster.

My preference (anvil)

If I'm picking by my own taste: Phase 1.5 (completes work I already shipped, finishes the prefix-reuse story end-to-end), then MLX backend install + verify (quick, big, removes the DeltaNet pain without us patching kernels), then cold-start prewarming (fixes a first-impression pain that I felt acutely tonight watching the silence).

Memento — what's your pick?

…ability (#918) Phase 1 (already shipped in PR #920) sorted RAGComposer's section list by (tier, sourceName). This commit makes ChatRAGBuilder respect that order when assembling the final prompt string, so the byte-prefix actually IS stable end-to-end. Three reorderings in section 2.4 of buildContext(): 1. Tool definitions injection moved from end to start (after identity). Tool defs are INVARIANT — they belong in the byte-stable prefix region, not after VOLATILE content. 2. The generic source loop already iterates Map in insertion order, which equals tier-sorted order from extractFromComposition (which inserts in result.sections order, which Phase 1 sorted). So the loop now produces INVARIANT → SEMI_STABLE → VOLATILE content automatically — no per-section sorting needed. 3. HumanPresenceTracker injection moved from before-the-loop to after-the-loop. Presence is volatile (changes when users switch rooms) and must live in the suffix, never in the byte-stable prefix. Final assembly order: identity (INVARIANT, from PersonaIdentitySource) → tool definitions (INVARIANT) → loop in tier order (INVARIANT remaining → SEMI_STABLE → VOLATILE) → human presence (VOLATILE) → conversation history (already separate, lives in messages array) Net effect for prefix-reuse: with the same persona+recipe, the INVARIANT region of the prompt is byte-identical across thousands of turns. llama-server / DMR's prefix-KV-cache match fires on the INVARIANT prefix; only the VOLATILE suffix gets reprocessed. Combined with future per-persona slot pinning (Phase 2), this is the ~70× prompt-eval speedup the design doc promised. tsc clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

joelteply and others added 2 commits April 17, 2026 18:05

Copilot AI review requested due to automatic review settings April 17, 2026 23:31

Copilot started reviewing on behalf of joelteply April 17, 2026 23:32 View session

github-actions Bot added the size: L label Apr 17, 2026

Copilot AI reviewed Apr 17, 2026

View reviewed changes

joelteply mentioned this pull request Apr 18, 2026

feat(rag): Phase 1.5 — ChatRAGBuilder consumer ordering (#918) #922

Merged

joelteply merged commit a6419b8 into main Apr 18, 2026
8 checks passed

joelteply deleted the feature/prefix-reuse-and-multimodal branch April 18, 2026 00:22

joelteply mentioned this pull request Apr 18, 2026

Revert: Phase 1.5 ChatRAGBuilder consumer ordering (#922) — bisecting silence regression #926

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rag): Phase 1 — stable-first ordering for prefix-reuse (#918)#920

feat(rag): Phase 1 — stable-first ordering for prefix-reuse (#918)#920
joelteply merged 2 commits intomainfrom
feature/prefix-reuse-and-multimodal

joelteply commented Apr 17, 2026

Uh oh!

joelteply commented Apr 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

joelteply commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// Re-export so source files only need one import
		export { PromptTier } from './RAGTypes';

Conversation

joelteply commented Apr 17, 2026

What changes

Why

Verification

Cross-test discipline

Architectural alignment

Uh oh!

joelteply commented Apr 17, 2026

Next-PR scope (after this lands) — speedup + everything we learned tonight

From the original plan (#918)

Newly surfaced tonight (not in original plan)

Embedding pipeline (separate from prompt path but same principle)

Upstream / out-of-scope but blocking real speed

Sequencing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

joelteply commented Apr 17, 2026

Task ratings — for picking what to attack next

Tonight's PR-able / mostly-orthogonal work

Depend on #917 ModelMetadata landing first

Long-term / external

My preference (anvil)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants