Research: SOTA caching to reduce token consumption (build on existing, then explore semantic/embeddings)

## Research Task: SOTA Caching to Reduce Token Consumption

**Type:** Research / investigation (spike before implementation)
**Goal:** Establish state-of-the-art caching in Conduit to reduce token consumption and cost across providers. Build on the infrastructure we already have first, then go deeper into a semantic/embeddings-based caching system.

This is intentionally scoped as a **learning task** — we want to understand the full design space of LLM gateway caching before committing to an implementation, then phase the build.

---

### Background — what we already have

An audit of the codebase found that **most caching machinery exists but is largely unwired**:

- **Provider-native prompt caching (partial):** `Shared/ConduitLLM.Core/Decorators/PromptCachingLLMClient.cs` injects Anthropic-style `cache_control` breakpoints; response mappers already extract `cache_read_input_tokens` / `cache_creation_input_tokens` into `Usage.CachedInputTokens` / `Usage.CachedWriteTokens` (`OpenAICompatibleClient.Mapping.cs`). Cost savings computed in `CostCalculationService.CacheSavings.cs`. Handles OpenAI `cached_tokens`, DeepSeek hit/miss fields, and Anthropic-via-OpenRouter.
- **Exact-match response caching (built, NOT wired in):** `Shared/ConduitLLM.Core/Caching/CachingLLMClient.cs` SHA256-hashes the request, but `AddLLMCaching()` is never called, so it is absent from the decorator chain. Only handles non-streaming requests.
- **Semantic caching (does not exist):** `RedisEmbeddingCache` only caches explicit embedding requests, not chat completions.

### Caveat — provider lineup

`ProviderType` has **no native Anthropic provider**; Claude is only reachable via OpenRouter. Prompt-caching support is therefore uneven:
- **OpenAI / compatible:** automatic prefix caching, savings surface in `cached_tokens` (already parsed).
- **Anthropic via OpenRouter:** requires explicit `cache_control` injection (max 4 breakpoints).
- **Groq / Cerebras / SambaNova / Cloudflare / etc.:** mostly no prompt-caching API — provider-native caching does not apply; response caching is the only lever.

---

### The three caching strategies (to research & document)

| Type | How it saves tokens | Hit condition | Risk |
|------|--------------------|--------------|------|
| **1. Provider-native prompt caching** | Provider charges ~0.1x for cached prefix tokens | Identical prefix within TTL (5 min / 1 hr) | None — provider-managed |
| **2. Exact-match response caching** | Skips the LLM call entirely — zero output tokens | Byte-identical request (model + messages + params) | Stale responses; wrong for non-deterministic use |
| **3. Semantic response caching** | Skips the LLM call for *similar* (not identical) requests | Embedding similarity above threshold | False hits return wrong answers |

---

### Phase 1 — Build on what we have (lower risk, higher confidence)

- [ ] **Wire up exact-match response caching (`CachingLLMClient`).** Insert into the decorator chain in `DatabaseAwareLLMClientFactory.cs` (~line 241) and call `AddLLMCaching()` at startup. Gate behind existing `LLMCachingEnabled` global setting **and** a per-virtual-key opt-in. Skip when `temperature > 0` unless caller explicitly opts in. Never global-default-on.
- [ ] **Verify cache-key correctness.** SHA256 key must include model, all messages, and every output-affecting parameter (temperature, top_p, tools, response_format, max_tokens) — including `ChatCompletionRequest.ExtensionData`. A missing field = serving wrong responses.
- [ ] **Finish provider-native prompt caching.** Confirm `PromptCachingLLMClient` is actually in the chain (referenced at `DatabaseAwareLLMClientFactory.cs:228`). Add an explicit `cache_control` passthrough field on `ChatCompletionRequest` so clients can place their own breakpoints. Surface cache-hit metrics in WebAdmin.
- [ ] **Decide on streaming.** Both `CachingLLMClient` and the streaming path currently skip each other. Investigate caching the accumulated final result after a stream completes and replaying it as a synthetic stream on hits.

### Phase 2 — Go deeper: semantic / embeddings-based caching

- [ ] Research semantic cache designs (embedding model choice, similarity threshold tuning, vector store — extend `RedisEmbeddingCache` vs. dedicated vector DB).
- [ ] Define correctness guardrails (false-positive mitigation; which workloads it is safe for — FAQ/support vs. general chat).
- [ ] Prototype on a constrained, opt-in workload before any general rollout.

### Research questions / SOTA survey

- [ ] Survey SOTA approaches: GPTCache-style semantic caching, prefix/KV-cache sharing, prompt-caching across the provider matrix, dedup of in-flight identical requests.
- [ ] How do other gateways (LiteLLM, Portkey, Helicone, Cloudflare AI Gateway) implement caching? What knobs do they expose?
- [ ] Cache invalidation & TTL strategy per cache type.
- [ ] Observability: a cache-effectiveness dashboard (hit rate, tokens saved, cost saved) in WebAdmin.

---

### Key files

- `Shared/ConduitLLM.Core/Caching/CachingLLMClient.cs`
- `Shared/ConduitLLM.Core/Caching/CachingServiceExtensions.cs`
- `Shared/ConduitLLM.Core/Decorators/PromptCachingLLMClient.cs`
- `Shared/ConduitLLM.Providers/DatabaseAwareLLMClientFactory.cs`
- `Shared/ConduitLLM.Providers/Providers/OpenAICompatible/OpenAICompatibleClient.Mapping.cs`
- `Shared/ConduitLLM.Core/Services/RedisEmbeddingCache.cs`
- `Shared/ConduitLLM.Core/Services/CostCalculationService.CacheSavings.cs`
- `Shared/ConduitLLM.Core/Models/Usage.cs`

### Out of scope (for now)

Implementation of Phase 2 — it depends on Phase 1 findings and the SOTA survey above.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research: SOTA caching to reduce token consumption (build on existing, then explore semantic/embeddings) #900

Research Task: SOTA Caching to Reduce Token Consumption

Background — what we already have

Caveat — provider lineup

The three caching strategies (to research & document)

Phase 1 — Build on what we have (lower risk, higher confidence)

Phase 2 — Go deeper: semantic / embeddings-based caching

Research questions / SOTA survey

Key files

Out of scope (for now)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Type	How it saves tokens	Hit condition	Risk
1. Provider-native prompt caching	Provider charges ~0.1x for cached prefix tokens	Identical prefix within TTL (5 min / 1 hr)	None — provider-managed
2. Exact-match response caching	Skips the LLM call entirely — zero output tokens	Byte-identical request (model + messages + params)	Stale responses; wrong for non-deterministic use
3. Semantic response caching	Skips the LLM call for similar (not identical) requests	Embedding similarity above threshold	False hits return wrong answers

Research: SOTA caching to reduce token consumption (build on existing, then explore semantic/embeddings) #900

Description

Research Task: SOTA Caching to Reduce Token Consumption

Background — what we already have

Caveat — provider lineup

The three caching strategies (to research & document)

Phase 1 — Build on what we have (lower risk, higher confidence)

Phase 2 — Go deeper: semantic / embeddings-based caching

Research questions / SOTA survey

Key files

Out of scope (for now)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions