Skip to content

Research: SOTA caching to reduce token consumption (build on existing, then explore semantic/embeddings) #900

@nickna

Description

@nickna

Research Task: SOTA Caching to Reduce Token Consumption

Type: Research / investigation (spike before implementation)
Goal: Establish state-of-the-art caching in Conduit to reduce token consumption and cost across providers. Build on the infrastructure we already have first, then go deeper into a semantic/embeddings-based caching system.

This is intentionally scoped as a learning task — we want to understand the full design space of LLM gateway caching before committing to an implementation, then phase the build.


Background — what we already have

An audit of the codebase found that most caching machinery exists but is largely unwired:

  • Provider-native prompt caching (partial): Shared/ConduitLLM.Core/Decorators/PromptCachingLLMClient.cs injects Anthropic-style cache_control breakpoints; response mappers already extract cache_read_input_tokens / cache_creation_input_tokens into Usage.CachedInputTokens / Usage.CachedWriteTokens (OpenAICompatibleClient.Mapping.cs). Cost savings computed in CostCalculationService.CacheSavings.cs. Handles OpenAI cached_tokens, DeepSeek hit/miss fields, and Anthropic-via-OpenRouter.
  • Exact-match response caching (built, NOT wired in): Shared/ConduitLLM.Core/Caching/CachingLLMClient.cs SHA256-hashes the request, but AddLLMCaching() is never called, so it is absent from the decorator chain. Only handles non-streaming requests.
  • Semantic caching (does not exist): RedisEmbeddingCache only caches explicit embedding requests, not chat completions.

Caveat — provider lineup

ProviderType has no native Anthropic provider; Claude is only reachable via OpenRouter. Prompt-caching support is therefore uneven:

  • OpenAI / compatible: automatic prefix caching, savings surface in cached_tokens (already parsed).
  • Anthropic via OpenRouter: requires explicit cache_control injection (max 4 breakpoints).
  • Groq / Cerebras / SambaNova / Cloudflare / etc.: mostly no prompt-caching API — provider-native caching does not apply; response caching is the only lever.

The three caching strategies (to research & document)

Type How it saves tokens Hit condition Risk
1. Provider-native prompt caching Provider charges ~0.1x for cached prefix tokens Identical prefix within TTL (5 min / 1 hr) None — provider-managed
2. Exact-match response caching Skips the LLM call entirely — zero output tokens Byte-identical request (model + messages + params) Stale responses; wrong for non-deterministic use
3. Semantic response caching Skips the LLM call for similar (not identical) requests Embedding similarity above threshold False hits return wrong answers

Phase 1 — Build on what we have (lower risk, higher confidence)

  • Wire up exact-match response caching (CachingLLMClient). Insert into the decorator chain in DatabaseAwareLLMClientFactory.cs (~line 241) and call AddLLMCaching() at startup. Gate behind existing LLMCachingEnabled global setting and a per-virtual-key opt-in. Skip when temperature > 0 unless caller explicitly opts in. Never global-default-on.
  • Verify cache-key correctness. SHA256 key must include model, all messages, and every output-affecting parameter (temperature, top_p, tools, response_format, max_tokens) — including ChatCompletionRequest.ExtensionData. A missing field = serving wrong responses.
  • Finish provider-native prompt caching. Confirm PromptCachingLLMClient is actually in the chain (referenced at DatabaseAwareLLMClientFactory.cs:228). Add an explicit cache_control passthrough field on ChatCompletionRequest so clients can place their own breakpoints. Surface cache-hit metrics in WebAdmin.
  • Decide on streaming. Both CachingLLMClient and the streaming path currently skip each other. Investigate caching the accumulated final result after a stream completes and replaying it as a synthetic stream on hits.

Phase 2 — Go deeper: semantic / embeddings-based caching

  • Research semantic cache designs (embedding model choice, similarity threshold tuning, vector store — extend RedisEmbeddingCache vs. dedicated vector DB).
  • Define correctness guardrails (false-positive mitigation; which workloads it is safe for — FAQ/support vs. general chat).
  • Prototype on a constrained, opt-in workload before any general rollout.

Research questions / SOTA survey

  • Survey SOTA approaches: GPTCache-style semantic caching, prefix/KV-cache sharing, prompt-caching across the provider matrix, dedup of in-flight identical requests.
  • How do other gateways (LiteLLM, Portkey, Helicone, Cloudflare AI Gateway) implement caching? What knobs do they expose?
  • Cache invalidation & TTL strategy per cache type.
  • Observability: a cache-effectiveness dashboard (hit rate, tokens saved, cost saved) in WebAdmin.

Key files

  • Shared/ConduitLLM.Core/Caching/CachingLLMClient.cs
  • Shared/ConduitLLM.Core/Caching/CachingServiceExtensions.cs
  • Shared/ConduitLLM.Core/Decorators/PromptCachingLLMClient.cs
  • Shared/ConduitLLM.Providers/DatabaseAwareLLMClientFactory.cs
  • Shared/ConduitLLM.Providers/Providers/OpenAICompatible/OpenAICompatibleClient.Mapping.cs
  • Shared/ConduitLLM.Core/Services/RedisEmbeddingCache.cs
  • Shared/ConduitLLM.Core/Services/CostCalculationService.CacheSavings.cs
  • Shared/ConduitLLM.Core/Models/Usage.cs

Out of scope (for now)

Implementation of Phase 2 — it depends on Phase 1 findings and the SOTA survey above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions