BYK · BYK · May 6, 2026 · May 6, 2026
diff --git a/.lore.md b/.lore.md
@@ -19,7 +19,7 @@
 * **Batch API integration: gateway enhancement, not mandatory architecture shift**: Implementing Anthropic Message Batches API as a gateway-only feature (50% cost savings on distillation/curation workers) does not require mandating gateway for all deployments. Direct plugin path continues working normally; batching is an optional gateway optimization that transparently accumulates non-urgent distill/distill-curation calls, flushes every N seconds, polls results in background. Keeps gateway experimental status while capturing savings on high-volume workers (\`distillSegment\`, \`metaDistill\`, \`consolidate\`, worker validation). Estimate: ~$1,100/month savings on Lore workers alone.
 
 <!-- lore:019dfa53-b925-70e2-8f84-cab808d8e115 -->
-* **Batch distillation consumption to reduce cache-bust frequency**: Batch distillation consumption at turn boundaries: Refresh \`loadDistillations()\` only at turn boundaries (new user message) or after idle gap > cache TTL (~5min). During autonomous tool chains (consecutive assistant→tool→assistant), freeze prefix—no DB hits. Context: prefix refresh costs \`context\_size × $3.75/MTok\` (~$1.88 per bust for 500K Sonnet). New distillations have marginal value mid-chain—model already has raw messages. Turn-boundary refresh reduces 189 arrivals → 8 refresh points in typical session, cutting bust cache writes from $639 → ~$15 (97% reduction). Combine with batching background distill workers: accumulate \`backgroundDistill()\` calls, flush at turn boundaries instead of firing on every \`message.updated\` event.
+* **Batch distillation consumption to reduce cache-bust frequency**: Batch distillation consumption to reduce cache-bust frequency: Refresh \`loadDistillations()\` only at turn boundaries (new user message) or after idle gap > cache TTL (~5min default). During autonomous tool chains (consecutive assistant→tool→assistant), freeze prefix—no DB hits. Meta-distillation rewrites row IDs; when cache is warm (lastTurnAt within TTL), skip meta to avoid prefix invalidation. Use \`getLastTurnAt(sessionID)\` to check cache warmth before distilling. This reduces 189 arrivals → 8 refresh points, cutting cache-write cost from $639 → ~$15 (97% reduction).
 
 ### Gotcha
 
@@ -32,6 +32,9 @@
 <!-- lore:019dfa4b-d2fb-7195-8f43-f93b5ffac9bb -->
 * **Lore transform non-determinism breaks prompt cache between API calls**: Lore transform non-determinism breaks prompt cache. Root causes: (1) \`sanitizeToolParts()\` uses \`Date.now()\` on every call → different timestamps for same pending parts → different message bytes → cache bust. Fix: use deterministic timestamp (part.state.time.start or 0). (2) \`distilledPrefixCached()\` calls \`addRelativeTimeToObservations(newRows, new Date())\` per gen-0 row → relational time changes → cache bust. Fix: batch consumption at turn boundaries \[\[019dfa53-b925-70e2-8f84-cab808d8e115]]. Prevent regressions via unit tests covering transform determinism + runtime bust-rate tracking.
 
+<!-- lore:019dfe4c-acaa-75d0-a1d3-701f52206945 -->
+* **Meta-distillation row ID rewrites invalidate distilled prefix cache**: Meta-distillation changes gen-0 row IDs when consolidating rows (e.g., 10→1 gen-1 row), invalidating the distilled prefix cache on the next transform. If the upstream prompt cache is still warm (within 5min default), this wastes a cache-write burst. Fix: pass \`skipMeta: true\` to \`distillation.run()\` when \`Date.now() - getLastTurnAt(sessionID) < cacheTTLMs\`. Idle workers use this check; manual distillation should too.
+
 <!-- lore:019dfcb9-cad6-7290-b526-cc9e4186a290 -->
 * **Runtime cost monitoring is log-only, no session budget enforcement**: Lore has cache-bust detection (prefix hash comparison) and overflow recovery, but NO session cost accumulator, alerts, or abort mechanisms. Cost is only tracked post-hoc in eval harnesses. Cache busts log individually via \`log.info()\` but are never counted or rated. No config option for session spend limits or cost thresholds. Plugin can't abort—only host (OpenCode) can halt. Must implement runtime cost tracking with stderr alerts when session spend exceeds threshold, paired with unit tests for transform determinism to prevent regressions.
 
@@ -44,7 +47,7 @@
 ### Pattern
 
 <!-- lore:019dfcb9-cae2-7eb5-9769-8faf8cc8527d -->
-* **Cache bust detection via prefix ID hash but no rate tracking**: Gradient tracks byte-identity of message prefix between turns using \`lastPrefixHash\` (first 5 message IDs concatenated with layer). When prefix changes, logs cache-bust event via \`log.info()\` at lines 1682-1696. Also tracks \`consecutiveHighLayer\` counter for compaction hints (logs at count=3, fires once). But no rolling bust-rate counter, no cumulative bust count per session, no alerting threshold. Need to add per-session \`bustCount\` and \`bustRate\` metrics that fire stderr alert when rate > 50% after 20+ API calls.
+* **Cache bust detection via prefix ID hash but no rate tracking**: Cache bust detection with rate tracking: Gradient now tracks byte-identity of message prefix using \`lastPrefixHash\` (first 5 message IDs + layer) and logs cache-bust events with rate percentage (bustCount/transformCount). Runtime metrics added: \`bustCount\` (cumulative busts), \`transformCount\` (total transform calls). Alerts on stderr when rate > 50% after 20+ API calls. Helps identify regressions in transform determinism. Busts remain expensive: ~$3.75/MTok × context\_size per bust.
 
 <!-- lore:019dfa53-b921-766c-b46b-14390cf81010 -->
 * **Distillation row arrivals trigger cache busts via prefix budget shifts**: Each new gen-0 distillation row (~189 total across session) changes the distilled prefix text length → shrinks raw window budget → \`tryFitStable()\` recalculates raw window cutoff → messages evicted/included from front → entire output array bytes change. Even with \`tryFitStable()\` pinning logic, prefix token growth forces re-evaluation. Result: alternating bust/warm pattern (bust when row arrives, warm on subsequent call with same row count). Meta-distillations compound this: 17 full re-renders with \`new Date()\` cause relational time annotations to potentially differ, plus row count collapse (e.g., 10 gen-0 → 1 gen-1 row) shrinks prefix drastically.
@@ -59,10 +62,10 @@
 * **Gateway package: new fourth runtime adapter for proxy-based context management**: Gateway package: runtime-agnostic HTTP proxy accepting Anthropic \`/v1/messages\`, applying full Lore pipeline (gradient, LTM, distillation), forwarding upstream. Implements \`LLMClient\` in \`llm-adapter.ts\`. Supports optional interceptor for recording/replay. Plugin spawns gateway if not running (probes \`http://127.0.0.1:6969/health\`, waits 5s), then registers observer hooks in gateway mode to audit gateway decisions without mutating output — logs session ID verification, LTM entries selected, gradient layer/tokens chosen. Observer reads \`temporal\_messages\`, \`knowledge\` tables; runs local \`transform()\` and \`forSession()\` for comparison.
 
 <!-- lore:019dfa4b-d2ff-704a-97b4-e382a46cb7b4 -->
-* **Gradient layer transitions trigger cascade of cache busts in Lore**: Late-stage sessions show phase transition at ~step 668: bust rate jumps from 12% → 51%. Correlates with context window growth crossing layer-0 cap, escalating to layer-1+ (higher cost, different message restructuring). Each layer transition may alter how gradient injects context, changing message array bytes and invalidating prompt cache. Effect compounds: higher layer cost + more busts = quadratic explosion. Monitor gradient layer choice at step transitions; may need per-layer cache validation or deterministic layer boundary crossing.
+* **Gradient layer transitions trigger cascade of cache busts in Lore**: Gradient layer transitions trigger cascade of cache busts: Late-stage sessions show phase transition at ~step 668: bust rate jumps 12% → 51%. Sticky layer guard now pins to \`lastLayer\` (not layer 0) when message count stable—prevents oscillation (0→1→0 or 1→2→1) that rewrites context bytes. Example: layer 2 strips tool outputs (different bytes), bouncing to layer 1 restores them → two busts. Guard only applies to calibrated sessions to isolate impact.
 
 <!-- lore:019dc5e2-c998-7395-9591-b0214485832d -->
-* **Idle-resume cache refresh: clear caches when wall-clock gap exceeds prompt cache TTL**: Clear caches when wall-clock gap exceeds prompt cache TTL. If \`now - lastTurnAt > 60min\`, call \`onIdleResume(sessionID)\` in pre-LLM hook to clear \`prefixCache\`, \`rawWindowCache\`, delete \`ltmSessionCache\`, set \`cameOutOfIdle=true\`.
+* **Idle-resume cache refresh: clear caches when wall-clock gap exceeds prompt cache TTL**: Idle-resume cache refresh: clear caches when wall-clock gap exceeds prompt cache TTL. If \`now - lastTurnAt > cacheTTLMs\` (default 5min for Anthropic default-tier, configurable via \`idleResumeMinutes\`), call \`onIdleResume(sessionID)\` in pre-LLM hook to clear \`prefixCache\`, \`rawWindowCache\`, delete \`ltmSessionCache\`, set \`cameOutOfIdle=true\`. Anthropic's default-tier prompt cache TTL is ~5 minutes (not 1 hour); beyond that window, byte-identity preservation wastes cache-write cost with no benefit.
 
 <!-- lore:019df987-1c4f-7205-b320-f01f2c32cdce -->
 * **Long-running autonomous sessions hit quadratic cache cost — session length budget needed**: Long-running sessions hit quadratic cache cost via non-deterministic transform. Session with 1,345 API calls: 314 calls (23%) read only 40,913 tokens (system prompt), rewriting 400–690K tokens each (busts). Two root causes: (1) Distillation row arrivals (~189 total) change \`distilledPrefix()\` length → shrink raw window budget → entire message array bytes change. (2) \`sanitizeToolParts()\` line 833 uses \`Date.now()\` to convert pending tool parts to error, producing different timestamps on every \`transform()\` call even with same input. OpenCode's cache fix (e148f00aa) preserves old pending parts in cached array—but Lore re-timestamps them. Fix distillation consumption at turn boundaries \[\[019dfa53-b925-70e2-8f84-cab808d8e115]] and use deterministic timestamp (0 or message.time.created) instead of \`Date.now()\` in sanitizeToolParts.

diff --git a/packages/core/src/distillation.ts b/packages/core/src/distillation.ts
@@ -532,6 +532,11 @@ export async function run(input: {
    *  and causes a cache bust on the next turn. Callers should set this to true
    *  when `Date.now() - getLastTurnAt(sessionID) < cacheTTL`. */
   skipMeta?: boolean;
+  /** When true, all LLM calls in this run are marked urgent and bypass the
+   *  batch queue (if one is active). Use for compaction and overflow recovery
+   *  where the caller is blocking on the result. Background/idle distillation
+   *  should leave this false to benefit from batch API 50% cost savings. */
+  urgent?: boolean;
 }): Promise<{ rounds: number; distilled: number }> {
   // Reset orphaned messages (marked distilled by a deleted/migrated distillation)
   const orphans = resetOrphans(input.projectPath, input.sessionID);
@@ -565,6 +570,7 @@ export async function run(input: {
           sessionID: input.sessionID,
           messages: segment,
           model: input.model,
+          urgent: input.urgent,
         });
         if (result) {
           distilled += segment.length;
@@ -586,6 +592,7 @@ export async function run(input: {
         projectPath: input.projectPath,
         sessionID: input.sessionID,
         model: input.model,
+        urgent: input.urgent,
       });
       rounds++;
     }
@@ -603,6 +610,7 @@ async function distillSegment(input: {
   sessionID: string;
   messages: TemporalMessage[];
   model?: { providerID: string; modelID: string };
+  urgent?: boolean;
 }): Promise<DistillationResult | null> {
   const prior = latestObservations(input.projectPath, input.sessionID);
   const text = messagesToText(input.messages);
@@ -625,7 +633,7 @@ async function distillSegment(input: {
   const responseText = await input.llm.prompt(
     DISTILLATION_SYSTEM,
     userContent,
-    { model, workerID: "lore-distill", thinking: false },
+    { model, workerID: "lore-distill", thinking: false, urgent: input.urgent },
   );
   if (!responseText) return null;
 
@@ -676,6 +684,7 @@ export async function metaDistill(input: {
   projectPath: string;
   sessionID: string;
   model?: { providerID: string; modelID: string };
+  urgent?: boolean;
 }): Promise<DistillationResult | null> {
   const existing = loadGen0(input.projectPath, input.sessionID);
 
@@ -703,7 +712,7 @@ export async function metaDistill(input: {
   const responseText = await input.llm.prompt(
     RECURSIVE_SYSTEM,
     userContent,
-    { model, workerID: "lore-distill", thinking: false },
+    { model, workerID: "lore-distill", thinking: false, urgent: input.urgent },
   );
   if (!responseText) return null;
 

diff --git a/packages/core/src/gradient.ts b/packages/core/src/gradient.ts
@@ -119,6 +119,15 @@ type SessionState = {
    * the post-idle turn regardless of conversation size.
    */
   cameOutOfIdle: boolean;
+  /**
+   * Set true by onIdleResume() alongside cameOutOfIdle; consumed (and cleared)
+   * by transformInner() to activate the post-idle compact layer. When true AND
+   * distillations exist, transform skips layer 0 (full-raw passthrough) and
+   * uses a tighter raw budget for layer 1. Rationale: on a cold cache the
+   * entire context is a cache WRITE — a smaller total means lower write cost,
+   * and aggressive idle distillation already captured the older history.
+   */
+  postIdleCompact: boolean;
   /** Consecutive turns at layer >= 2. When >= 3, log a compaction hint. */
   consecutiveHighLayer: number;
   /** Hash of the first message IDs in the last transform output — for cache-bust diagnostics. */
@@ -156,6 +165,7 @@ function makeSessionState(): SessionState {
     rawWindowCache: null,
     lastTurnAt: 0,
     cameOutOfIdle: false,
+    postIdleCompact: false,
     consecutiveHighLayer: 0,
     lastPrefixHash: "",
     bustCount: 0,
@@ -225,6 +235,7 @@ export function onIdleResume(
   state.rawWindowCache = null;
   state.distillationSnapshot = null;
   state.cameOutOfIdle = true;
+  state.postIdleCompact = true;
   return { triggered: true, idleMs };
 }
 
@@ -416,6 +427,7 @@ export function inspectSessionState(sessionID: string): {
   hasPrefixCache: boolean;
   hasRawWindowCache: boolean;
   cameOutOfIdle: boolean;
+  postIdleCompact: boolean;
   lastTurnAt: number;
   distillationSnapshot: DistillationSnapshot | null;
 } | null {
@@ -425,6 +437,7 @@ export function inspectSessionState(sessionID: string): {
     hasPrefixCache: state.prefixCache !== null,
     hasRawWindowCache: state.rawWindowCache !== null,
     cameOutOfIdle: state.cameOutOfIdle,
+    postIdleCompact: state.postIdleCompact,
     lastTurnAt: state.lastTurnAt,
     distillationSnapshot: state.distillationSnapshot,
   };
@@ -1254,7 +1267,8 @@ function transformInner(input: {
     contextLimit - outputReserved - overhead - ltmTokens,
   );
   const distilledBudget = Math.floor(usable * cfg.budget.distilled);
-  const rawBudget = Math.floor(usable * cfg.budget.raw);
+  // Base raw budget. May be overridden below for post-idle compact mode.
+  let rawBudget = Math.floor(usable * cfg.budget.raw);
 
   // --- Force escalation (reactive error recovery) ---
   // When the API previously rejected with "prompt is too long", skip layers
@@ -1308,6 +1322,30 @@ function transformInner(input: {
     effectiveMinLayer = Math.max(effectiveMinLayer, sessState.lastLayer) as SafetyLayer;
   }
 
+  // --- Post-idle compact layer ---
+  // When the cache just went cold (onIdleResume fired), skip layer 0 full-raw
+  // passthrough and use a tighter raw budget. Rationale: the entire context is
+  // a cache WRITE regardless — a smaller total costs less to write, and
+  // aggressive idle distillation already captured older history in the prefix.
+  // The flag is one-shot: consumed here and reset so subsequent turns use
+  // normal budgets once the cache is warm.
+  const postIdleCompact = sessState.postIdleCompact;
+  if (postIdleCompact) {
+    sessState.postIdleCompact = false;
+    // Skip layer 0 — don't pass through all raw messages on a cold cache.
+    effectiveMinLayer = Math.max(effectiveMinLayer, 1) as SafetyLayer;
+    // Use a tighter raw budget: 20% of usable instead of the normal 40%.
+    // The distilled prefix covers the older history; the raw window only
+    // needs the current turn + minimal recent context. This reduces the
+    // total cold-cache write cost by up to 20% of usable (~29K tokens on
+    // a 200K context model).
+    rawBudget = Math.floor(usable * 0.20);
+    log.info(
+      `post-idle compact: session=${sid} rawBudget=${rawBudget}` +
+      ` (${Math.floor(usable * cfg.budget.raw)}→${rawBudget})`,
+    );
+  }
+
   let expectedInput: number;
   if (calibrated) {
     // Exact approach: prior API count + estimate of only genuinely new messages.

diff --git a/packages/core/src/search.ts b/packages/core/src/search.ts
@@ -299,7 +299,7 @@ export async function expandQuery(
       llm.prompt(
         QUERY_EXPANSION_SYSTEM,
         `Input: "${query}"`,
-        { model, workerID: "lore-query-expand", thinking: false },
+        { model, workerID: "lore-query-expand", thinking: false, urgent: true },
       ),
       new Promise<null>((resolve) => setTimeout(() => resolve(null), TIMEOUT_MS)),
     ]);

diff --git a/packages/core/src/types.ts b/packages/core/src/types.ts
@@ -217,6 +217,21 @@ export interface LLMClient {
        *   relies on Part A (non-reasoning model selection) instead
        */
       thinking?: boolean;
+      /**
+       * When true, the request must be processed immediately and the result
+       * returned before the next user turn. When false or absent, the request
+       * may be deferred to a batch queue for cost savings (50% discount via
+       * Anthropic's Message Batches API).
+       *
+       * Callers that `await` the result for a blocking operation (compaction,
+       * overflow recovery, query expansion) should set `urgent: true`.
+       * Fire-and-forget background work (incremental distillation, idle
+       * curation) should leave it unset or set `false`.
+       *
+       * Only the gateway's BatchLLMClient honors this flag; other adapters
+       * (OpenCode, Pi) ignore it and always process immediately.
+       */
+      urgent?: boolean;
     },
   ): Promise<string | null>;
 }