Skip to content

feat(sight): observe-only LLM cache-hit shadow analyzer#666

Open
jfeng18 wants to merge 2 commits into
alibaba:mainfrom
jfeng18:feat/cache-shadow-analysis
Open

feat(sight): observe-only LLM cache-hit shadow analyzer#666
jfeng18 wants to merge 2 commits into
alibaba:mainfrom
jfeng18:feat/cache-shadow-analysis

Conversation

@jfeng18

@jfeng18 jfeng18 commented May 31, 2026

Copy link
Copy Markdown
Contributor

What

Phase 0 of an LLM/MaaS response-cache effort: an opt-in, observe-only analyzer (cache-shadow, a GenAIExporter) that measures — on real observed traffic — how much an exact-match LLM response cache would save and how trustworthy the cache key is. It never serves and never changes agent behaviour. It answers "is a cache worth building, and is our cache key correct?" before any serving proxy is built, and the key/fingerprint engine carries forward into that proxy.

How it works

  • Cacheability gate: a call counts only if deterministic — temperature == 0 (missing temperature is treated as non-deterministic) or an explicit opt-in marker.
  • Cache key: provider + model + canonical(request raw_body). Canonicalization strips request-volatile fields, sorts object keys, normalizes number forms, strips the per-call [timestamp] prefix agents inject into user messages, and rewrites server-random tool-call ids (call_*/toolu_*) to positional aliases so identical multi-turn replays hash the same.
  • Key-precision self-check: on a would-be hit, a structured (kind-tagged, role/message-delimited) fingerprint of the answer is compared to the stored baseline to report a false-hit rate — the core Phase-0 signal. The fingerprint's raw-body fallback strips the volatile response envelope (id/created/system_fingerprint) so deterministic answers aren't falsely flagged as divergent.
  • Savings are credited only for byte-identical answers; the report leads with hit-rate-among-all-calls and embeds explicit caveats (temperature=0 is not guaranteed deterministic; normalization is partial; table eviction makes the rate a lower bound). Persisted as JSON under the storage dir, with a periodic reporter thread and a final report on shutdown.

Opt-in

Off by default. Enable with --enable-cache-analysis (CLI trace). Not wired into the FFI event path.

Testing

  • 21 unit tests covering: the gate (temp/opt-in/sysom llmParamString nesting), key determinism + false-hit/false-miss cases (reordered keys, number forms, stream/user exclusion, [timestamp] collapse, tool-call id collapse, tool-name divergence), structured fingerprint (text vs tool-call no longer collide), envelope-stripped fallback, and the token-savings response-body fallback. Each of the 2 commits compiles independently.
  • E2E on the real binary (kernel 6.6.102): with --enable-cache-analysis, the analyzer registers, observes real captured LLM calls, computes keys, detects request-level would-be hits, and writes the report on shutdown.
  • Honest limitation: the synthetic E2E drove traffic via the plain-HTTP tcpsniff path, which (for this client pattern) does not capture HTTP responses into the LLMCall — so the response-side metrics (false-hit divergence, token/latency savings) were exercised by the unit tests but not end-to-end. On the SSL-sniffed real-agent path (AgentSight's primary mode) responses are captured, so those metrics populate there.

Independent of #661#665 (branched from main); only textual proximity in config.rs/trace.rs/unified.rs with the other open PRs.

This change went through an adversarial self-review; the confirmed correctness findings (tool-call id normalization, structured fingerprint, envelope-stripped fallback, token-usage fallback) are folded into the commits above.

@jfeng18 jfeng18 requested a review from chengshuyi as a code owner May 31, 2026 11:16
@github-actions github-actions Bot added the component:sight src/agentsight/ label May 31, 2026
@jfeng18 jfeng18 force-pushed the feat/cache-shadow-analysis branch from 23f7cb5 to 77fbc92 Compare June 3, 2026 11:19
@jfeng18

jfeng18 commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

Hi — this PR adds an observe-only LLM cache-hit shadow analyzer (no behavior change, just data collection via GenAIExporter). 918 lines + 16 tests, rebased onto latest main. Would appreciate a review when you get a chance — it unblocks Phase 1 cache proxy evaluation.

@jfeng18 jfeng18 force-pushed the feat/cache-shadow-analysis branch 2 times, most recently from e912ad5 to b6af4d3 Compare June 6, 2026 10:22
jfeng18 and others added 2 commits June 10, 2026 11:19
Observe-only GenAIExporter: measures would-be LLM cache hits and key
precision on real traffic, never serves. Key = provider+model+canonical
request body (strips injected [timestamp] user prefixes, normalizes
server-random tool-call ids to positional aliases). On a would-be hit it
compares a structured answer fingerprint to the stored baseline to report a
false-hit rate; token/latency savings are credited only on byte-identical
answers. The report leads with hit-rate-among-all-calls and carries explicit
caveats (temperature=0 is not deterministic; normalization is partial;
eviction makes the rate a lower bound).

Dead code until wired in (next commit). 21 unit tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Register the shadow analyzer as a GenAI exporter and spawn its periodic
reporter when --enable-cache-analysis (config enable_cache_analysis,
default off) is set; write the final report on shutdown. CLI trace path
only; not wired into the FFI event path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@jfeng18 jfeng18 force-pushed the feat/cache-shadow-analysis branch from b6af4d3 to e7af367 Compare June 10, 2026 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component:sight src/agentsight/

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant