Research Task: SOTA Caching to Reduce Token Consumption
Type: Research / investigation (spike before implementation)
Goal: Establish state-of-the-art caching in Conduit to reduce token consumption and cost across providers. Build on the infrastructure we already have first, then go deeper into a semantic/embeddings-based caching system.
This is intentionally scoped as a learning task — we want to understand the full design space of LLM gateway caching before committing to an implementation, then phase the build.
Background — what we already have
An audit of the codebase found that most caching machinery exists but is largely unwired:
- Provider-native prompt caching (partial):
Shared/ConduitLLM.Core/Decorators/PromptCachingLLMClient.cs injects Anthropic-style cache_control breakpoints; response mappers already extract cache_read_input_tokens / cache_creation_input_tokens into Usage.CachedInputTokens / Usage.CachedWriteTokens (OpenAICompatibleClient.Mapping.cs). Cost savings computed in CostCalculationService.CacheSavings.cs. Handles OpenAI cached_tokens, DeepSeek hit/miss fields, and Anthropic-via-OpenRouter.
- Exact-match response caching (built, NOT wired in):
Shared/ConduitLLM.Core/Caching/CachingLLMClient.cs SHA256-hashes the request, but AddLLMCaching() is never called, so it is absent from the decorator chain. Only handles non-streaming requests.
- Semantic caching (does not exist):
RedisEmbeddingCache only caches explicit embedding requests, not chat completions.
Caveat — provider lineup
ProviderType has no native Anthropic provider; Claude is only reachable via OpenRouter. Prompt-caching support is therefore uneven:
- OpenAI / compatible: automatic prefix caching, savings surface in
cached_tokens (already parsed).
- Anthropic via OpenRouter: requires explicit
cache_control injection (max 4 breakpoints).
- Groq / Cerebras / SambaNova / Cloudflare / etc.: mostly no prompt-caching API — provider-native caching does not apply; response caching is the only lever.
The three caching strategies (to research & document)
| Type |
How it saves tokens |
Hit condition |
Risk |
| 1. Provider-native prompt caching |
Provider charges ~0.1x for cached prefix tokens |
Identical prefix within TTL (5 min / 1 hr) |
None — provider-managed |
| 2. Exact-match response caching |
Skips the LLM call entirely — zero output tokens |
Byte-identical request (model + messages + params) |
Stale responses; wrong for non-deterministic use |
| 3. Semantic response caching |
Skips the LLM call for similar (not identical) requests |
Embedding similarity above threshold |
False hits return wrong answers |
Phase 1 — Build on what we have (lower risk, higher confidence)
Phase 2 — Go deeper: semantic / embeddings-based caching
Research questions / SOTA survey
Key files
Shared/ConduitLLM.Core/Caching/CachingLLMClient.cs
Shared/ConduitLLM.Core/Caching/CachingServiceExtensions.cs
Shared/ConduitLLM.Core/Decorators/PromptCachingLLMClient.cs
Shared/ConduitLLM.Providers/DatabaseAwareLLMClientFactory.cs
Shared/ConduitLLM.Providers/Providers/OpenAICompatible/OpenAICompatibleClient.Mapping.cs
Shared/ConduitLLM.Core/Services/RedisEmbeddingCache.cs
Shared/ConduitLLM.Core/Services/CostCalculationService.CacheSavings.cs
Shared/ConduitLLM.Core/Models/Usage.cs
Out of scope (for now)
Implementation of Phase 2 — it depends on Phase 1 findings and the SOTA survey above.
Research Task: SOTA Caching to Reduce Token Consumption
Type: Research / investigation (spike before implementation)
Goal: Establish state-of-the-art caching in Conduit to reduce token consumption and cost across providers. Build on the infrastructure we already have first, then go deeper into a semantic/embeddings-based caching system.
This is intentionally scoped as a learning task — we want to understand the full design space of LLM gateway caching before committing to an implementation, then phase the build.
Background — what we already have
An audit of the codebase found that most caching machinery exists but is largely unwired:
Shared/ConduitLLM.Core/Decorators/PromptCachingLLMClient.csinjects Anthropic-stylecache_controlbreakpoints; response mappers already extractcache_read_input_tokens/cache_creation_input_tokensintoUsage.CachedInputTokens/Usage.CachedWriteTokens(OpenAICompatibleClient.Mapping.cs). Cost savings computed inCostCalculationService.CacheSavings.cs. Handles OpenAIcached_tokens, DeepSeek hit/miss fields, and Anthropic-via-OpenRouter.Shared/ConduitLLM.Core/Caching/CachingLLMClient.csSHA256-hashes the request, butAddLLMCaching()is never called, so it is absent from the decorator chain. Only handles non-streaming requests.RedisEmbeddingCacheonly caches explicit embedding requests, not chat completions.Caveat — provider lineup
ProviderTypehas no native Anthropic provider; Claude is only reachable via OpenRouter. Prompt-caching support is therefore uneven:cached_tokens(already parsed).cache_controlinjection (max 4 breakpoints).The three caching strategies (to research & document)
Phase 1 — Build on what we have (lower risk, higher confidence)
CachingLLMClient). Insert into the decorator chain inDatabaseAwareLLMClientFactory.cs(~line 241) and callAddLLMCaching()at startup. Gate behind existingLLMCachingEnabledglobal setting and a per-virtual-key opt-in. Skip whentemperature > 0unless caller explicitly opts in. Never global-default-on.ChatCompletionRequest.ExtensionData. A missing field = serving wrong responses.PromptCachingLLMClientis actually in the chain (referenced atDatabaseAwareLLMClientFactory.cs:228). Add an explicitcache_controlpassthrough field onChatCompletionRequestso clients can place their own breakpoints. Surface cache-hit metrics in WebAdmin.CachingLLMClientand the streaming path currently skip each other. Investigate caching the accumulated final result after a stream completes and replaying it as a synthetic stream on hits.Phase 2 — Go deeper: semantic / embeddings-based caching
RedisEmbeddingCachevs. dedicated vector DB).Research questions / SOTA survey
Key files
Shared/ConduitLLM.Core/Caching/CachingLLMClient.csShared/ConduitLLM.Core/Caching/CachingServiceExtensions.csShared/ConduitLLM.Core/Decorators/PromptCachingLLMClient.csShared/ConduitLLM.Providers/DatabaseAwareLLMClientFactory.csShared/ConduitLLM.Providers/Providers/OpenAICompatible/OpenAICompatibleClient.Mapping.csShared/ConduitLLM.Core/Services/RedisEmbeddingCache.csShared/ConduitLLM.Core/Services/CostCalculationService.CacheSavings.csShared/ConduitLLM.Core/Models/Usage.csOut of scope (for now)
Implementation of Phase 2 — it depends on Phase 1 findings and the SOTA survey above.