feat(sight): observe-only LLM cache-hit shadow analyzer#666
Open
jfeng18 wants to merge 2 commits into
Open
Conversation
23f7cb5 to
77fbc92
Compare
Contributor
Author
|
Hi — this PR adds an observe-only LLM cache-hit shadow analyzer (no behavior change, just data collection via GenAIExporter). 918 lines + 16 tests, rebased onto latest main. Would appreciate a review when you get a chance — it unblocks Phase 1 cache proxy evaluation. |
e912ad5 to
b6af4d3
Compare
Observe-only GenAIExporter: measures would-be LLM cache hits and key precision on real traffic, never serves. Key = provider+model+canonical request body (strips injected [timestamp] user prefixes, normalizes server-random tool-call ids to positional aliases). On a would-be hit it compares a structured answer fingerprint to the stored baseline to report a false-hit rate; token/latency savings are credited only on byte-identical answers. The report leads with hit-rate-among-all-calls and carries explicit caveats (temperature=0 is not deterministic; normalization is partial; eviction makes the rate a lower bound). Dead code until wired in (next commit). 21 unit tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Register the shadow analyzer as a GenAI exporter and spawn its periodic reporter when --enable-cache-analysis (config enable_cache_analysis, default off) is set; write the final report on shutdown. CLI trace path only; not wired into the FFI event path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
b6af4d3 to
e7af367
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Phase 0 of an LLM/MaaS response-cache effort: an opt-in, observe-only analyzer (
cache-shadow, aGenAIExporter) that measures — on real observed traffic — how much an exact-match LLM response cache would save and how trustworthy the cache key is. It never serves and never changes agent behaviour. It answers "is a cache worth building, and is our cache key correct?" before any serving proxy is built, and the key/fingerprint engine carries forward into that proxy.How it works
temperature == 0(missing temperature is treated as non-deterministic) or an explicit opt-in marker.provider + model + canonical(request raw_body). Canonicalization strips request-volatile fields, sorts object keys, normalizes number forms, strips the per-call[timestamp]prefix agents inject into user messages, and rewrites server-random tool-call ids (call_*/toolu_*) to positional aliases so identical multi-turn replays hash the same.id/created/system_fingerprint) so deterministic answers aren't falsely flagged as divergent.Opt-in
Off by default. Enable with
--enable-cache-analysis(CLItrace). Not wired into the FFI event path.Testing
llmParamStringnesting), key determinism + false-hit/false-miss cases (reordered keys, number forms,stream/userexclusion,[timestamp]collapse, tool-call id collapse, tool-name divergence), structured fingerprint (text vs tool-call no longer collide), envelope-stripped fallback, and the token-savings response-body fallback. Each of the 2 commits compiles independently.--enable-cache-analysis, the analyzer registers, observes real captured LLM calls, computes keys, detects request-level would-be hits, and writes the report on shutdown.tcpsniffpath, which (for this client pattern) does not capture HTTP responses into theLLMCall— so the response-side metrics (false-hit divergence, token/latency savings) were exercised by the unit tests but not end-to-end. On the SSL-sniffed real-agent path (AgentSight's primary mode) responses are captured, so those metrics populate there.Independent of #661–#665 (branched from
main); only textual proximity inconfig.rs/trace.rs/unified.rswith the other open PRs.This change went through an adversarial self-review; the confirmed correctness findings (tool-call id normalization, structured fingerprint, envelope-stripped fallback, token-usage fallback) are folded into the commits above.