Skip to content

Fix inline summarization prompt cache misses#309701

Open
kevin-m-kent wants to merge 1 commit intomicrosoft:mainfrom
kevin-m-kent:kevin-m-kent/fix-inline-summarization-cache
Open

Fix inline summarization prompt cache misses#309701
kevin-m-kent wants to merge 1 commit intomicrosoft:mainfrom
kevin-m-kent:kevin-m-kent/fix-inline-summarization-cache

Conversation

@kevin-m-kent
Copy link
Copy Markdown
Contributor

Problem

The background inline summarization call gets 0% prompt cache hit rate on 2nd+ summarizations within the same turn.

Root Cause

The background summarizer forks messages from the main render but applies different post-processing than the tool calling loop:

  • Main agent call: stripInternalToolCallIds + validateToolMessages (filters orphaned tool results that lack matching assistant tool calls)
  • Background summarizer: stripInternalToolCallIds only (no validation)

After a summarization is applied, prompt-tsx re-renders the conversation with summarized history. Tool results that referenced tool calls from the now-summarized rounds become orphaned. The main call filters these out via validateToolMessages, but the background summarizer keeps them. This causes the message arrays to diverge, breaking prefix-based prompt caching (Anthropic cache_control).

Why it only affects 2nd+ summarizations

  • 1st summarization: no prior summarization applied → no orphaned tool messages → messages match → good cache (65-98%)
  • 2nd+ summarization in same turn: prior summarization created orphaned messages → summarizer keeps them, main call filters them → messages diverge → 0% cache
  • Later summarizations after many rounds: orphaned messages from prior summarization are no longer in the prompt (prompt-tsx trimmed them) → messages match again → good cache (98%)

Evidence

Analyzed multiple conversations across internal telemetry (CopilotChatEvents / engine.messages.length) and client-side exports (panel.request / summarizedconversationhistory):

  • Tools, system prompt, thinking config, and max window are all constant across calls — confirmed from engine.messages.length data
  • The 0% cache events consistently occur on the 2nd+ summarization within the same user turn
  • 1st summarizations consistently achieve 65-98% cache hit rates
  • The promptcachetokencount from summarizedconversationhistory telemetry events shows the pattern clearly

Fix

  1. Apply validateToolMessagesCore to the forked messages in the background summarizer, matching the main call's processing pipeline. This ensures orphaned tool messages are filtered identically.

  2. Move addCacheBreakpoints() to run before _startBackgroundSummarization for deterministic breakpoint ordering (code clarity).

Impact

For a ~90K-token summarization call, going from 0% to ~90%+ cache hit means ~80K fewer uncached input tokens per summarization event. This adds up significantly for long agentic conversations that trigger multiple summarizations per turn.

The background inline summarizer forks messages from the main render
but applies different post-processing than the tool calling loop:

- Main agent call: stripInternalToolCallIds + validateToolMessages
  (filters orphaned tool results that lack matching assistant tool calls)
- Background summarizer: stripInternalToolCallIds only

After a summarization is applied, prompt-tsx re-renders the conversation
with summarized history. Tool results that referenced tool calls from
the now-summarized rounds become orphaned. The main call filters these
out via validateToolMessages, but the background summarizer keeps them.
This causes the message arrays to diverge, breaking prefix-based prompt
caching (e.g., Anthropic's cache_control).

The divergence specifically occurs on the 2nd+ summarization in the
same turn, because the 1st summarization creates the orphaned messages
that the 2nd summarization's forked copy includes but the main call
filters out. This explains the observed pattern of 0% cache hit rate
on 2nd+ summarizations while 1st summarizations get 65-98% hits.

Fix: apply validateToolMessagesCore to the forked messages in the
background summarizer, matching the main call's processing pipeline.

Also move addCacheBreakpoints() to run before
_startBackgroundSummarization to ensure cache breakpoint ordering is
deterministic (code clarity improvement).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 14, 2026 02:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes prompt cache misses in the inline background summarization flow by aligning its message post-processing and cache-breakpoint ordering with the main agent fetch pipeline, so Anthropic prefix-based prompt caching can achieve high hit rates on 2nd+ summarizations within the same user turn.

Changes:

  • Move addCacheBreakpoints(result.messages) earlier so the background summarizer always receives messages with deterministic breakpoint placement.
  • Apply ToolCallingLoop.validateToolMessagesCore(...) (after stripping internal tool call IDs) to background summarizer forked messages to filter orphaned tool results and keep message arrays aligned with the main call.
  • Preserve the rendered last user message on the current turn earlier in the post-render flow via RenderedUserMessageMetadata.
Show a summary per file
File Description
extensions/copilot/src/extension/intents/node/agentIntent.ts Align background inline summarization message processing with main agent pipeline and ensure cache breakpoints are applied before forking messages.

Copilot's findings

  • Files reviewed: 1/1 changed files
  • Comments generated: 1

Comment on lines +817 to +819
const strippedMainMessages = ToolCallingLoop.validateToolMessagesCore(
ToolCallingLoop.stripInternalToolCallIds(mainRenderMessages),
).messages;
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the inline background summarization path, validateToolMessagesCore is called without the stripOrphanedToolCalls option. The main agent fetch path applies applyMessagePostProcessing(..., { stripOrphanedToolCalls: isGeminiFamily(endpoint) }), so for Gemini endpoints the background summarizer can keep orphaned toolCalls on assistant messages. That can (a) re-diverge the message prefix again (hurting cache parity) and (b) cause Gemini 400s due to missing 1:1 tool_call ↔ tool_result pairing. Consider passing { stripOrphanedToolCalls: isGeminiFamily(this.endpoint) } here (and importing isGeminiFamily) to match the main loop’s post-processing exactly.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants