Skip to content

Improve prefix cache inspection and warmup#1966

Open
wplll wants to merge 1 commit into
Hmbown:mainfrom
wplll:feat/cache-hit-optimization
Open

Improve prefix cache inspection and warmup#1966
wplll wants to merge 1 commit into
Hmbown:mainfrom
wplll:feat/cache-hit-optimization

Conversation

@wplll
Copy link
Copy Markdown

@wplll wplll commented May 24, 2026

Summary

Improve DeepSeek prefix-cache stability diagnostics and warmup correctness by making cache inspection and warmup keys account for real prompt layers, tool schemas, project context packs, and skills.

Motivation

Previous cache diagnostics could report incomplete or misleading cache state because /cache inspect did not include the real tool catalog, warmup keys depended on prior inspection state, and project context pack cache invalidation could miss prompt-relevant changes.

Changes

  • Include the real request tool catalog in /cache inspect.
  • Add tool_catalog_hash and stable_prefix_hash to prompt inspection output.
  • Add a complete CacheWarmupKey covering provider, model, base URL, stable prefix, tool catalog, project pack, and skills.
  • Compute warmup keys from current app/request state instead of last_cache_inspection.
  • Make Project Context Pack generation deterministic with sorted entries and sort-before-truncate behavior.
  • Improve Project Context Pack cache invalidation with a manifest hash covering included file paths and README excerpt content.
  • Skip wire-payload dedup for short tool results and keep dedup strictly wire-only.
  • Update memory documentation to describe memory/session goal content as dynamic.

Cache hit improvement rationale

The changes increase probability of prefix cache reuse by stabilizing prompt-layer ordering, tool catalog ordering, and Project Context Pack rendering. They also make cache misses more diagnosable by exposing stable prefix, full reusable prefix, tool catalog, project pack, and skills hashes.

Performance considerations

Project Context Pack generation now walks prompt-relevant entries to compute a reliable manifest hash, while reusing rendered packs when the manifest is unchanged. Short tool results bypass dedup hashing and map lookup entirely. Warmup does not introduce extra catalog/network requests beyond the explicit warmup model call.

Risks and rollback

Risk is limited to cache diagnostics, warmup behavior, project context pack rendering, and wire payload compaction. Rollback can revert the cache inspection/warmup and project pack changes independently if provider behavior differs in production.

Follow-up work

  • Optionally allow first-run /cache inspect to build the current tool catalog before any model turn.
  • Consider adding config file excerpts to Project Context Pack only if they are intentionally part of the prompt, with matching manifest hash coverage.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances prompt stability and cache observability by implementing deterministic sorting for tools, MCP resources, and diagnostics, alongside a manifest-based caching system for project context packs. It introduces a CacheWarmupKey to identify stable prefixes and expands the /cache inspect command with JSON and verbose output modes. Additionally, user memory and session goals were relocated to dynamic prompt sections to improve prefix cache hits. The review feedback highlights opportunities to simplify JSON serialization using more idiomatic error handling and suggests lowering the tool result deduplication threshold to optimize token usage for medium-sized outputs.

Comment on lines +426 to +427
let content =
serde_json::to_string(&values).unwrap_or_else(|_| Value::Array(values).to_string());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The unwrap_or_else block here is redundant and potentially misleading. serde_json::to_string for a Vec<Value> is extremely unlikely to fail, and if it does, Value::Array(values).to_string() (which uses the Display trait) will also trigger a panic. It is cleaner and more idiomatic to use .expect() with a descriptive message.

Suggested change
let content =
serde_json::to_string(&values).unwrap_or_else(|_| Value::Array(values).to_string());
let content = serde_json::to_string(&values).expect("tool catalog serialization failed");

Comment on lines +678 to +679
let content = serde_json::to_string(tools)
.unwrap_or_else(|_| Value::Array(tools.to_vec()).to_string());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the tool_catalog_hash function, this fallback is redundant and includes an unnecessary tools.to_vec() clone. Using .expect() is preferred here as it avoids the overhead and clarifies that serialization is expected to succeed for these types.

        let content = serde_json::to_string(tools).expect("tool catalog serialization failed");

Comment on lines +1014 to +1016
// Short content: skip dedup entirely — not worth the ref overhead.
// Do NOT compute hash or insert into seen_tool_results.
if original_chars <= TOOL_RESULT_SENT_CHAR_BUDGET {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The threshold for skipping tool result deduplication is set to the full truncation budget (12,000 characters). This is quite high; 12,000 characters (~3,000 tokens) is a significant amount of data to repeat in a request if the same result appears multiple times in the history. Since hashing and map lookup for a few kilobytes of data is extremely cheap, consider using a much lower threshold (e.g., 1,024 characters) to allow deduplication of medium-sized repeated results while still skipping it for truly trivial ones.

Suggested change
// Short content: skip dedup entirely — not worth the ref overhead.
// Do NOT compute hash or insert into seen_tool_results.
if original_chars <= TOOL_RESULT_SENT_CHAR_BUDGET {
// Very short content: skip dedup entirely — not worth the ref overhead.
// Do NOT compute hash or insert into seen_tool_results.
if original_chars <= 1024 {

@Hmbown
Copy link
Copy Markdown
Owner

Hmbown commented May 24, 2026

Thanks for sending this. It is directly relevant to #1965 and the cache-continuity work, but I am not pulling it into v0.8.42 stabilization: the diff is large, currently DIRTY, only GitGuardian has run, and it touches prompt/cache/project-context behavior that needs focused review and telemetry.

I am routing this to v0.8.47. The most useful next shape would be smaller slices: cache inspection truth surface, warmup key correctness, project-context determinism, and wire-payload dedup, each with targeted tests and before/after cache-hit evidence.

@Hmbown Hmbown added this to the v0.8.47 milestone May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants