Fix inline summarization prompt cache misses#309701
Fix inline summarization prompt cache misses#309701kevin-m-kent wants to merge 1 commit intomicrosoft:mainfrom
Conversation
The background inline summarizer forks messages from the main render but applies different post-processing than the tool calling loop: - Main agent call: stripInternalToolCallIds + validateToolMessages (filters orphaned tool results that lack matching assistant tool calls) - Background summarizer: stripInternalToolCallIds only After a summarization is applied, prompt-tsx re-renders the conversation with summarized history. Tool results that referenced tool calls from the now-summarized rounds become orphaned. The main call filters these out via validateToolMessages, but the background summarizer keeps them. This causes the message arrays to diverge, breaking prefix-based prompt caching (e.g., Anthropic's cache_control). The divergence specifically occurs on the 2nd+ summarization in the same turn, because the 1st summarization creates the orphaned messages that the 2nd summarization's forked copy includes but the main call filters out. This explains the observed pattern of 0% cache hit rate on 2nd+ summarizations while 1st summarizations get 65-98% hits. Fix: apply validateToolMessagesCore to the forked messages in the background summarizer, matching the main call's processing pipeline. Also move addCacheBreakpoints() to run before _startBackgroundSummarization to ensure cache breakpoint ordering is deterministic (code clarity improvement). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Fixes prompt cache misses in the inline background summarization flow by aligning its message post-processing and cache-breakpoint ordering with the main agent fetch pipeline, so Anthropic prefix-based prompt caching can achieve high hit rates on 2nd+ summarizations within the same user turn.
Changes:
- Move
addCacheBreakpoints(result.messages)earlier so the background summarizer always receives messages with deterministic breakpoint placement. - Apply
ToolCallingLoop.validateToolMessagesCore(...)(after stripping internal tool call IDs) to background summarizer forked messages to filter orphaned tool results and keep message arrays aligned with the main call. - Preserve the rendered last user message on the current turn earlier in the post-render flow via
RenderedUserMessageMetadata.
Show a summary per file
| File | Description |
|---|---|
| extensions/copilot/src/extension/intents/node/agentIntent.ts | Align background inline summarization message processing with main agent pipeline and ensure cache breakpoints are applied before forking messages. |
Copilot's findings
- Files reviewed: 1/1 changed files
- Comments generated: 1
| const strippedMainMessages = ToolCallingLoop.validateToolMessagesCore( | ||
| ToolCallingLoop.stripInternalToolCallIds(mainRenderMessages), | ||
| ).messages; |
There was a problem hiding this comment.
In the inline background summarization path, validateToolMessagesCore is called without the stripOrphanedToolCalls option. The main agent fetch path applies applyMessagePostProcessing(..., { stripOrphanedToolCalls: isGeminiFamily(endpoint) }), so for Gemini endpoints the background summarizer can keep orphaned toolCalls on assistant messages. That can (a) re-diverge the message prefix again (hurting cache parity) and (b) cause Gemini 400s due to missing 1:1 tool_call ↔ tool_result pairing. Consider passing { stripOrphanedToolCalls: isGeminiFamily(this.endpoint) } here (and importing isGeminiFamily) to match the main loop’s post-processing exactly.
Problem
The background inline summarization call gets 0% prompt cache hit rate on 2nd+ summarizations within the same turn.
Root Cause
The background summarizer forks messages from the main render but applies different post-processing than the tool calling loop:
stripInternalToolCallIds+validateToolMessages(filters orphaned tool results that lack matching assistant tool calls)stripInternalToolCallIdsonly (no validation)After a summarization is applied, prompt-tsx re-renders the conversation with summarized history. Tool results that referenced tool calls from the now-summarized rounds become orphaned. The main call filters these out via
validateToolMessages, but the background summarizer keeps them. This causes the message arrays to diverge, breaking prefix-based prompt caching (Anthropiccache_control).Why it only affects 2nd+ summarizations
Evidence
Analyzed multiple conversations across internal telemetry (
CopilotChatEvents/engine.messages.length) and client-side exports (panel.request/summarizedconversationhistory):engine.messages.lengthdatapromptcachetokencountfromsummarizedconversationhistorytelemetry events shows the pattern clearlyFix
Apply
validateToolMessagesCoreto the forked messages in the background summarizer, matching the main call's processing pipeline. This ensures orphaned tool messages are filtered identically.Move
addCacheBreakpoints()to run before_startBackgroundSummarizationfor deterministic breakpoint ordering (code clarity).Impact
For a ~90K-token summarization call, going from 0% to ~90%+ cache hit means ~80K fewer uncached input tokens per summarization event. This adds up significantly for long agentic conversations that trigger multiple summarizations per turn.