A semantic similarity-based cache for text embeddings that reduces redundant API calls and improves cache hit rates.
Current embedding cache implementations (LangChain, etc.) use exact string matching as cache keys. This means "How do I reset my password?" and "I forgot my password" are treated as completely different queries, causing cache misses even though they're semantically identical.
Problems with exact string matching:
- Sensitive to typos, punctuation, and minor wording changes
- Misses semantically similar queries
- Results in redundant API calls and increased costs
- Lower cache efficiency (~24% hit rate on varied queries)
SemanticEmbedCache solution: Uses semantic similarity search to find cached embeddings for semantically similar texts, even when worded differently. This improves cache hit rates by 2-3x compared to exact matching.
- Semantic similarity matching: Finds cached embeddings using cosine similarity (threshold: 0.85)
- Two-tier lookup: Fast exact match, fallback to similarity search
- Performance optimizations:
- In-memory deserialized key cache
- Early exit for high similarity (>0.98)
- Pluggable architecture:
- Compatible with any embedder implementing
BaseEmbedderor LangChain'sEmbeddingsinterface - Swappable storage backends (via
BaseStorageinterface)
- Compatible with any embedder implementing
- Easy integration: Simple API with
.get(text)method
# src/const/const.py
SIMILARITY_THRESHOLD = 0.85 # Minimum similarity for cache hit
HIGHEST_SIMILARITY_THRESHOLD = 0.98 # Early exit threshold (near-duplicates)from langchain_cohere import CohereEmbeddings
from src.SemanticEmbedCache import SemanticEmbedCache
from src.embedder.KeyEmbedder import KeyEmbedder
from src.storage.InMemStorage import InMemStorage
# Initialize components
key_embedder = KeyEmbedder() # FastEmbed local model
og_embedder = CohereEmbeddings(model="embed-english-v3.0")
storage = InMemStorage()
# Create cache
sec = SemanticEmbedCache(
key_embedder=key_embedder,
og_embedder=og_embedder,
storage=storage
)
# Use cache
embedding = sec.get("How do I reset my password?") # API call (miss)
embedding = sec.get("I forgot my password") # Cache hit! (0.87 similarity)SemanticEmbedCache stores pairs of (key embedding, original embedding):
| Key (Serialized Key Embedding) | Value (Original Embedding) |
|---|---|
| String representation of key embedding | Embedding from original embedder |
Components:
- Key Embedding: Fast, lightweight embedding using FastEmbed's
jinaai/jina-embeddings-v2-base-en(local model)- Used for similarity search only
- Serialized as string for storage keys
- Original Embedding: Full embedding from your chosen embedder (e.g., Cohere, OpenAI)
- The actual embedding returned to your application
- Stored as value in cache
- Key embeddings are cheap/fast (local FastEmbed) → used for finding similar queries
- Original embeddings are expensive (API calls) → the actual feature-rich embeddings you and your application need
- This 2-tier design minimizes API costs while enabling semantic search
Input text: "I forgot my password"
↓
1. Generate key embedding (FastEmbed - local, fast)
↓
2. Check exact match in cache (serialized key lookup)
↓ (miss)
3. Similarity search over all cached keys (cosine similarity)
↓ (found: "How do I reset my password?" with 0.87 similarity)
4. Return cached original embedding ✓ (cache hit!)
If no match found:
↓
5. Generate original embedding (API call - expensive)
↓
6. Store (key_embedding → original_embedding) in cache
↓
7. Return original embedding
- Key embedding generation: Compute FastEmbed embedding for input text
- Exact match lookup: Check if serialized key embedding exists in storage → instant return if found
- Similarity search (if no exact match):
- Iterate through all stored keys
- Compute cosine similarity between query key and each stored key
- If similarity >
SIMILARITY_THRESHOLD(0.85), return cached embedding - If similarity >
HIGHEST_SIMILARITY_THRESHOLD(0.98), early exit (near-duplicate)
- Cache miss: Generate original embedding via API, store in cache, return
Tested on 100 diverse queries (exact duplicates, semantic variations, unique queries):
| Implementation | Hit Rate | Avg Time | Improvement |
|---|---|---|---|
| SemanticEmbedCache | 60.0% | 0.300s | 2.5x better hit rate |
| LangChain CacheBackedEmbeddings | 24.0% | 0.281s | Only exact matches |
| Implementation | Hit Rate | Avg Time | Improvement |
|---|---|---|---|
| SemanticEmbedCache | 49.0% | 0.327s | 2.04x better hit rate |
| LangChain CacheBackedEmbeddings | 24.0% | 0.282s | Only exact matches |
(Just run the benchmark.py script to reproduce results)
Key Findings:
- Hit Rate Impact: Lowering threshold from 0.90 to 0.85 improves hit rate from 49% → 60% (+22% improvement)
- Speed Trade-off: SEC is ~7% slower due to similarity search, but this is offset by:
- 2-3x fewer API calls (60% cache hits vs 24%)
- Significant cost savings on embedding API usage
- Overall faster application performance due to reduced network latency
Threshold Selection Guide:
- 0.90: Strict matching, fewer false positives, 49% hit rate
- 0.85: Balanced (recommended), good semantic matching, 60% hit rate
- 0.80: More lenient, higher hit rate but risk of unrelated matches
Notes:
- Average time includes embedding generation and cache lookup
- LangChain's CacheBackedEmbeddings doesn't return hit/miss status natively; benchmark uses custom modification
- Benchmarks performed using Cohere's
embed-english-v3.0model as original embedder - Dataset: 100 queries with ~30% exact duplicates, ~40% semantic variations, ~30% unique queries
Current implementation uses O(n) linear search through all cached keys. For larger caches:
Recommended optimization: Add FAISS for approximate nearest neighbor search
- Reduces search complexity from O(n) to O(log n)
- Expected speedup: 10-100x for caches with 1000+ entries
- See implementation guide in codebase discussions
__init__(key_embedder, og_embedder, storage)
key_embedder:KeyEmbedderinstance for generating key embeddingsog_embedder:BaseEmbedder | Embeddings- your original embedder (Cohere, OpenAI, etc.)storage:BaseStorage- storage backend (InMemStorage or custom)
get(text: str) -> list[float]
- Embeds text using semantic cache
- Returns: List of floats (the original embedding)
_benchmark_get(text: str) -> Tuple[list[float], bool]
- Same as
get()but also returns cache hit/miss status - For benchmarking purposes only
Implement BaseStorage interface:
class RedisStorage(BaseStorage):
def get(self, key: str) -> Any: ...
def get_all_keys(self) -> list[str]: ...
def set(self, key: str, value: Any) -> None: ...Implement BaseEmbedder interface or use any LangChain Embeddings class.
MIT License