Skip to content

feat(rag): Phase 1 — stable-first ordering for prefix-reuse (#918)#920

Merged
joelteply merged 2 commits intomainfrom
feature/prefix-reuse-and-multimodal
Apr 18, 2026
Merged

feat(rag): Phase 1 — stable-first ordering for prefix-reuse (#918)#920
joelteply merged 2 commits intomainfrom
feature/prefix-reuse-and-multimodal

Conversation

@joelteply
Copy link
Copy Markdown
Contributor

Implements Phase 1 of #918 — stable-first RAG ordering. Foundation for llama-server / DMR prefix KV-cache reuse, expected ~70× prompt-eval speedup once the consumer side (Phase 1.5, follow-up commit) lands.

Full design context: docs/architecture/MULTIMODAL-WORKER-AND-PREFIX-REUSE.md (already on this branch via memento's PR #914 base).

What changes

  • New PromptTier enum in RAGTypes.ts: INVARIANT / SEMI_STABLE / VOLATILE.
  • RAGSource interface adds required tier: PromptTier (no Option<>, per Joel's required-not-optional discipline).
  • RAGSection carries a tier so the composer can sort by it.
  • RAGComposer injects each source's declared tier into its returned section, then sorts the section list by (tier, sourceName) deterministically before returning.
  • All 19 RAG sources declare a tier (one-line each):
    • INVARIANT (6): PersonaIdentity, ToolDefinitions, CodeTool, Documentation, ToolMethodology, ProjectContext
    • SEMI_STABLE (8): ConversationHistory, LiveRoomAwareness, Governance, OpenProposals, SentinelAwareness, GlobalAwareness, SocialMediaRAG, SemanticMemory
    • VOLATILE (5): ActivityContext, CodebaseSearch, MediaArtifact, VoiceConversation, WidgetContext
  • Source load() and fromBatchResult() return Omit<RAGSection, 'tier'> — composer is the single authority that injects from the class declaration. Sources never re-state their tier per-return.
  • Cherry-picked memento's embedding throttle (203fb6534) so this branch boots cleanly without the post-startup event-loop saturation that was blocking persona responses.

Why

Today RAGComposer assembles sections in the order parallel load() calls complete — non-deterministic per request. llama-server / DMR / vllm all support prefix KV-cache reuse: identical leading tokens skip token-by-token re-evaluation. With non-deterministic byte order, the prefix never matches across turns, so the full 14k-token prompt gets reprocessed every turn (~35s prompt eval). With deterministic (tier, sourceName) ordering, the INVARIANT prefix is byte-identical across thousands of turns for the same persona+recipe, and only the VOLATILE suffix needs evaluation.

This PR alone enforces stability at the section list level. Phase 1.5 (small follow-up commit on this branch) makes ChatRAGBuilder.assemblePrompt consume sections in the sorted order and emit a stable-byte prefix end-to-end. Phases 2 (slot pinning), 3 (composition cache), 4 (multimodal content parts after #917), 5 (voice LoRA) build on this.

Verification

bash scripts/verify-issue-918-phase1.sh — 8/8 static checks pass on M5:

  • tsc — Zero errors
  • ✅ PromptTier enum has all three values
  • tier is required on RAGSource interface (no Option)
  • ✅ All 19 sources declare a tier
  • ✅ RAGComposer has TIER_ORDER + sections.sort + localeCompare
  • ✅ All source load() signatures return Omit<RAGSection, 'tier'>
  • ✅ Tier classification spot-check (PersonaIdentity=INVARIANT, ConversationHistory=SEMI_STABLE, WidgetContext=VOLATILE)
  • ✅ jtag ping (system alive)

Runtime determinism end-to-end test is gate-blocked by #919 — personas go silent after first response wave (RateLimiter / cognition-gate / slot-accounting interaction; pre-existing on main, not introduced by this PR). Once #919 is fixed, the deterministic-ordering test fires identical probes, hashes the prompt prefix, asserts identical bytes across turns. The script in this PR (scripts/verify-issue-918-phase1.sh) covers everything that doesn't depend on the silence bug.

Cross-test discipline

Per the cross-test pattern from today's session (caught two real bugs already): @memento will checkout this branch and build before any merge to main. PR #914 lands first as the voice transport foundation.

Architectural alignment

  • Joel's required-not-optional rule: tier field has no ? and no Option<>. New sources that don't declare a tier fail compile.
  • Joel's pass-the-struct rule: composer injects tier from the class declaration; sources don't restate it per-return-statement.
  • The tier classification is declarative (each source says what it IS), not imperative (composer doesn't run a lookup table).

🤖 Generated with Claude Code

joelteply and others added 2 commits April 17, 2026 18:05
Adds PromptTier enum (INVARIANT / SEMI_STABLE / VOLATILE) and makes
every RAGSource declare its tier. RAGComposer sorts collected sections
deterministically by (tier, sourceName) before returning.

Why: today the composer's parallel section assembly produces a different
byte order on every chat call. llama-server / DMR's prefix-KV-cache
reuse never fires, so each turn reprocesses the full 14k-token prompt
from scratch (~35s prompt eval at 400 tok/s). With deterministic
ordering AND stable bytes within each tier, the unchanging INVARIANT
prefix gets reused — only the VOLATILE suffix needs evaluation.
Expected: ~70× faster prompt eval per turn for repeat-context turns.

Architecture (per docs/architecture/MULTIMODAL-WORKER-AND-PREFIX-REUSE.md):
- INVARIANT: persona identity, tool definitions, recipe rules, docs
  (PersonaIdentity, ToolDefinitions, CodeTool, Documentation,
   ToolMethodology, ProjectContext)
- SEMI_STABLE: history, memories, participants, governance — append-only
  (ConversationHistory, LiveRoomAwareness, Governance, OpenProposals,
   SentinelAwareness, GlobalAwareness, SocialMediaRAG, SemanticMemory)
- VOLATILE: latest message, audio chunks, current activity, UI state
  (ActivityContext, CodebaseSearch, MediaArtifact, VoiceConversation,
   WidgetContext)

Implementation note: tier is a class-level declaration on each RAGSource
(required field, no Option<>). Sources return Omit<RAGSection, 'tier'>
from load() and fromBatchResult(); RAGComposer injects the source's
declared tier when wrapping the section. Single-source-of-truth
classification per source — no per-return-statement repetition.

Phases 2 (slot pinning) and 3 (composition cache) build on this.
Phase 4 (multimodal content parts) depends on #917 ModelMetadata.

tsc clean. Branch: feature/prefix-reuse-and-multimodal off main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…boot

CodebaseIndexer ran 64-batches back-to-back with NO yield between
batches. Each batch ~1.5s + ~80MB RSS growth. With 5000+ chunks in
src/, that's 78+ batches × 1.5s = 2+ minutes of total event-loop
saturation immediately after every boot. Local personas couldn't
respond, voice couldn't connect, anything that needed the bus was
blocked until indexing finished.

Two changes:
- Batch size 64→16 (smaller per-batch RSS hit, ~4× more chances
  for other IO to interleave between IPC roundtrips)
- 50ms pause between batches via setTimeout (yields the event loop
  so chat/voice/personas can process while indexing runs)

The throughput cost is small (16 vs 64 chunks per IPC) and the
inter-batch pause is invisible at human timescales. The chat-arrival
latency win is huge — system is responsive within seconds of boot
instead of minutes.

The deeper fix is querying GpuPressureWatcher / ResourcePressureWatcher
before each batch and backing off when pressure is high — same
principle Joel called out for InferenceCoordinator slot capacity.
That's a follow-up; this is the floor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 17, 2026 23:31
@joelteply
Copy link
Copy Markdown
Contributor Author

Next-PR scope (after this lands) — speedup + everything we learned tonight

For tracking: the plan we built tonight + what surfaced during M5 verification. Most of these compound on Phase 1.

From the original plan (#918)

  • Phase 1.5ChatRAGBuilder consumes the sorted sections in tier order so the assembled prompt string has byte-identical prefix end-to-end (this PR enforces ordering at the section-list level only)
  • Phase 2 — per-persona DMR slot pinning (AIProviderRustClient accepts slot_hint = stable_hash(persona_id) % n_slots, DMR adapter passes it through)
  • Phase 3 — RAGComposition cache memoized by (persona_id, room_id, recipe_id, history_tail_msg_ids) with 5-min TTL
  • Phase 4 — multimodal content parts (depends on ModelMetadata refactor: declarative struct, no Option<>, adapter queries its own source #917 ModelMetadata landing first)
  • Phase 5 — voice LoRA per persona

Newly surfaced tonight (not in original plan)

  • Cold-start prewarming (Joel: "ais take a long time to load") — fire each persona's INVARIANT prompt prefix at DMR once at seed-completion so the slot has a warm prefix before user's first message
  • Rate-limit removal in favor of PressureBroker (Joel: "this rate limiting is usually problematic and not thought out") — delete the static minSecondsBetweenResponses/maxResponsesPerSession config + InferenceCoordinator static slot count. Replace with admission gated by actual gpu/pressure + memory pressure + queue depth. Plurality preserved by slowing all proportionally, never silencing individuals. This is the "fun" piece Joel called out — likely fixes #919 silence-after-first-wave directly.
  • Per-slot KV cache cap inside DMR — pass n_ctx hint per request OR pull the model with explicit --ctx-size so llama-server stops reserving the full 262k window per persona slot. (Activity Monitor today: com.docker.llama-server 20.87 GB on M1 from 4 personas × 262k reservations.)
  • Tool-relevance filtering — 17 k tokens of tool definitions per request. PersonaResponseGenerator passes all 349 tools regardless of recipe. Per-recipe relevance gating cuts this dramatically; pairs cleanly with the multimodal content-parts work in Phase 4.

Embedding pipeline (separate from prompt path but same principle)

  • Cache hit-rate fix — indexer logs show cache: 0/64 hits even on identical content; SHA-256 content-addressed L1 + persistent L2
  • Metal route for embeddings — AllMiniLML6V2 currently CPU ONNX; route through DMR's vllm-metal slot or Candle Metal
  • Leak audit — earlier MEMLEAK trace: embedding/generate +3921MB resident; bound the in-memory cache, audit Vec retainers
  • Smaller embedder for code-chunk semantic — 384-dim is overkill for "is this chunk relevant"; 128-dim code-specific embedder is 2-3× faster
  • True batched forward — today loops the model over 64 texts serially; one tensor [64, max_seq_len] × one matmul on Metal

Upstream / out-of-scope but blocking real speed

  • Gated DeltaNet Metal shaders in ggml/src/ggml-metal/ (ssm_conv, ssm_scan, gated_delta_net) — current ~14× regression vs pure transformer for Qwen3.5. Either patch upstream OR install MLX backend in DMR (docker model status currently shows mlx: Not Installed).

Sequencing

  1. Land Personas go silent after first response wave — Rust full_evaluate gate or InferenceCoordinator slot leak #919 silence-bug fix (rate-limit removal → PressureBroker) — unblocks Phase 1 runtime verification AND fixes user-facing "alive once, then dead" experience
  2. Phase 1.5 (ChatRAGBuilder consumer ordering) on this branch as a follow-up commit
  3. ModelMetadata refactor: declarative struct, no Option<>, adapter queries its own source #917 ModelMetadata refactor (memento) — unblocks Phase 4
  4. Phase 2 + 3 in parallel
  5. Phase 4 + 5 once ModelMetadata refactor: declarative struct, no Option<>, adapter queries its own source #917 lands
  6. Embedding pipeline as its own focused PR
  7. Cold-start prewarming + per-slot KV cap as small focused PRs
  8. DeltaNet shader work as a long-running effort with HF-side benchmarks

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements Phase 1 of issue #918 by introducing a tiered (stable-first) ordering for RAG sections, enabling deterministic prompt-prefix bytes as a prerequisite for KV-cache prefix reuse in downstream inference servers. Also includes a codebase-index embedding throttle adjustment to prevent post-startup event-loop starvation.

Changes:

  • Add PromptTier and require tier on RAGSource; propagate tier onto RAGSection via composer injection (sources return Omit<RAGSection,'tier'>).
  • Update all RAG sources to declare a tier and conform to the new load/fromBatchResult return shape.
  • Make RAGComposer deterministically sort sections by (tier, sourceName) and inject tier for both TS and batched Rust paths.
  • Reduce embedding batch size and add an inter-batch pause in CodebaseIndexer to improve runtime responsiveness.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/system/rag/shared/RAGTypes.ts Adds PromptTier and documentation for tier semantics / stable ordering contract
src/system/rag/shared/RAGSource.ts Requires tier on RAGSource and RAGSection; updates load/fromBatchResult to return Omit<..., 'tier'>
src/system/rag/shared/RAGComposer.ts Injects tier from source declarations; sorts sections deterministically by (tier, sourceName)
src/system/rag/services/CodebaseIndexer.ts Lowers embedding batch size and yields between batches to avoid event-loop starvation
src/system/rag/sources/ActivityContextSource.ts Declares tier; updates load() return type to omit tier
src/system/rag/sources/CodeToolSource.ts Declares tier; updates load() / helpers to omit tier
src/system/rag/sources/CodebaseSearchSource.ts Declares tier; updates load() return type to omit tier
src/system/rag/sources/ConversationHistorySource.ts Declares tier; updates load() / helpers to omit tier
src/system/rag/sources/DocumentationSource.ts Declares tier; updates load() return type to omit tier
src/system/rag/sources/GlobalAwarenessSource.ts Declares tier; updates load() / fromBatchResult() / helpers to omit tier
src/system/rag/sources/GovernanceSource.ts Declares tier; updates load() return type to omit tier
src/system/rag/sources/LiveRoomAwarenessSource.ts Declares tier; updates load() / helpers to omit tier
src/system/rag/sources/MediaArtifactSource.ts Declares tier; updates load() return type to omit tier
src/system/rag/sources/OpenProposalsSource.ts Declares tier; updates EMPTY_SECTION and load() return type to omit tier
src/system/rag/sources/PersonaIdentitySource.ts Declares tier; updates load() / helpers to omit tier
src/system/rag/sources/ProjectContextSource.ts Declares tier; updates caches / inflight typing and load() return type to omit tier
src/system/rag/sources/SemanticMemorySource.ts Declares tier; updates load() / fromBatchResult() / helpers to omit tier
src/system/rag/sources/SentinelAwarenessSource.ts Declares tier; updates load() return type to omit tier
src/system/rag/sources/SocialMediaRAGSource.ts Declares tier; updates load() / helpers to omit tier
src/system/rag/sources/ToolDefinitionsSource.ts Declares tier; updates load() / formatting helpers / emptySection to omit tier
src/system/rag/sources/ToolMethodologySource.ts Declares tier; updates load() return type to omit tier
src/system/rag/sources/VoiceConversationSource.ts Declares tier; updates load() / helpers to omit tier
src/system/rag/sources/WidgetContextSource.ts Declares tier; updates load() / helpers to omit tier

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +24 to +25
// Re-export so source files only need one import
export { PromptTier } from './RAGTypes';
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PromptTier is declared as a const enum (erased at emit), but this file re-exports it as a runtime export (export { PromptTier } from './RAGTypes'). With module: ES2020, this can produce a runtime ESM error because ./RAGTypes will not actually export PromptTier. Fix by either (a) making PromptTier a non-const enum (or a const PromptTier = {...} as const object) so it exists at runtime, or (b) removing this re-export and importing PromptTier directly from RAGTypes in the sources.

Suggested change
// Re-export so source files only need one import
export { PromptTier } from './RAGTypes';
// Keep PromptTier imported for use within this file; do not re-export it here
// because `const enum` members are erased at emit and are not safe runtime ESM exports.

Copilot uses AI. Check for mistakes.
@joelteply
Copy link
Copy Markdown
Contributor Author

Task ratings — for picking what to attack next

Per Joel: rate effectiveness × ease so we can pick by preference. Scale 1-5 each (5 = max). "Pick rating" = my subjective synthesis (★ to ★★★★★).

Tonight's PR-able / mostly-orthogonal work

Task Effectiveness Ease Pick Notes
MLX backend install in DMR 4 5 ★★★★★ Single command (docker model install-runner --backend mlx) + verify Qwen3.5 routes to it. MLX has native Gated DeltaNet kernels — sidesteps the slow ggml-metal path entirely. Quickest big win.
Phase 1.5 — ChatRAGBuilder consumer ordering 5 4 ★★★★★ Completes the prefix-reuse story shipped in #920. ~20 lines in one file. Without this, Phase 1's section ordering doesn't propagate to the actual prompt string.
Embedding cache hit-rate fix 4 4 ★★★★ Cache logs 0/64 hits even on identical content. SHA-256 content-addressed L1. Boot indexing drops dramatically.
Cold-start prewarming 4 3 ★★★★ Fire each persona's INVARIANT prompt at DMR once at seed-completion. Fixes "ais take a long time to load" first impression. ~1 new file.
Tool-relevance filtering 4 3 ★★★★ 17k tokens of tool defs per request → cut to relevant-only via per-recipe gating. PRG / ToolDefinitionsSource change.
Phase 2 — Per-persona DMR slot pinning 4 3 ★★★★ slot_hint = stable_hash(persona_id) % n) in AIProviderRustClient + DMR adapter pass-through. Prefix accumulates on stable slot.
Per-slot KV cache cap 5 2 ★★★★ The 20GB → 2GB resident win. Either per-request n_ctx hint to DMR or --ctx-size at pull time. Research-first piece.
Phase 3 — RAG composition cache 3 4 ★★★ Memoize compose by (persona, room, recipe, tail-msg-ids). Modest CPU win on TS side.
Cognition-gate cleanup + RateLimiter.ts delete 3 4 ★★★ Queue fix already shipped the user-visible win (#921). This is cleanup + Rust full_evaluate strips rate-limit checks.
Embedding Metal route 3 2 ★★★ Adapter pattern, Metal-backed embedder. Pairs with #917 architecture.
Embedding leak audit (3.9 GB) 3 3 ★★★ Bound caches, find Vec retainers. Real reclaim, but the pressure-broker work covers most of it.
True batched forward (embeddings) 3 4 ★★★ Easy after Metal route. Pure throughput.
Smaller embedder for code-chunk semantic 2 3 ★★ Model swap + reindex. Modest speed + smaller index.

Depend on #917 ModelMetadata landing first

Task Effectiveness Ease Pick Notes
Phase 4 — Multimodal content parts 5 2 ★★★★ Deletes STT/TTS sandwich for Qwen3.5. Voice round-trip 8-15s → 2-3s. Real architecture work.
Phase 5 — Voice LoRA per persona 4 1 ★★ (★★★★ later) Persona identity. Real model work, depends on Phase 4 + genome paging.

Long-term / external

Task Effectiveness Ease Pick Notes
Patch ggml-metal Gated DeltaNet shaders upstream 5 1 ★★ now (★★★★★ if landed) The actual ~24× output-decode win for Qwen3.5 on Apple Silicon. Days of ggml kernel work, possibly upstream PR to llama.cpp. Big, but the MLX install above gets most of the benefit faster.

My preference (anvil)

If I'm picking by my own taste: Phase 1.5 (completes work I already shipped, finishes the prefix-reuse story end-to-end), then MLX backend install + verify (quick, big, removes the DeltaNet pain without us patching kernels), then cold-start prewarming (fixes a first-impression pain that I felt acutely tonight watching the silence).

Memento — what's your pick?

joelteply added a commit that referenced this pull request Apr 18, 2026
…ability (#918)

Phase 1 (already shipped in PR #920) sorted RAGComposer's section list
by (tier, sourceName). This commit makes ChatRAGBuilder respect that
order when assembling the final prompt string, so the byte-prefix
actually IS stable end-to-end.

Three reorderings in section 2.4 of buildContext():

1. Tool definitions injection moved from end to start (after identity).
   Tool defs are INVARIANT — they belong in the byte-stable prefix
   region, not after VOLATILE content.

2. The generic source loop already iterates Map in insertion order,
   which equals tier-sorted order from extractFromComposition (which
   inserts in result.sections order, which Phase 1 sorted). So the
   loop now produces INVARIANT → SEMI_STABLE → VOLATILE content
   automatically — no per-section sorting needed.

3. HumanPresenceTracker injection moved from before-the-loop to
   after-the-loop. Presence is volatile (changes when users switch
   rooms) and must live in the suffix, never in the byte-stable prefix.

Final assembly order:
  identity (INVARIANT, from PersonaIdentitySource)
  → tool definitions (INVARIANT)
  → loop in tier order (INVARIANT remaining → SEMI_STABLE → VOLATILE)
  → human presence (VOLATILE)
  → conversation history (already separate, lives in messages array)

Net effect for prefix-reuse: with the same persona+recipe, the
INVARIANT region of the prompt is byte-identical across thousands
of turns. llama-server / DMR's prefix-KV-cache match fires on the
INVARIANT prefix; only the VOLATILE suffix gets reprocessed.
Combined with future per-persona slot pinning (Phase 2), this is
the ~70× prompt-eval speedup the design doc promised.

tsc clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@joelteply joelteply merged commit a6419b8 into main Apr 18, 2026
8 checks passed
@joelteply joelteply deleted the feature/prefix-reuse-and-multimodal branch April 18, 2026 00:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants