SemanticEmbedCache (SEC)

A semantic similarity-based cache for text embeddings that reduces redundant API calls and improves cache hit rates.

Motivation

Current embedding cache implementations (LangChain, etc.) use exact string matching as cache keys. This means "How do I reset my password?" and "I forgot my password" are treated as completely different queries, causing cache misses even though they're semantically identical.

Problems with exact string matching:

Sensitive to typos, punctuation, and minor wording changes
Misses semantically similar queries
Results in redundant API calls and increased costs
Lower cache efficiency (~24% hit rate on varied queries)

SemanticEmbedCache solution: Uses semantic similarity search to find cached embeddings for semantically similar texts, even when worded differently. This improves cache hit rates by 2-3x compared to exact matching.

Features

Semantic similarity matching: Finds cached embeddings using cosine similarity (threshold: 0.85)
Two-tier lookup: Fast exact match, fallback to similarity search
Performance optimizations:
- In-memory deserialized key cache
- Early exit for high similarity (>0.98)
Pluggable architecture:
- Compatible with any embedder implementing BaseEmbedder or LangChain's Embeddings interface
- Swappable storage backends (via BaseStorage interface)
Easy integration: Simple API with .get(text) method

Configuration

# src/const/const.py
SIMILARITY_THRESHOLD = 0.85           # Minimum similarity for cache hit
HIGHEST_SIMILARITY_THRESHOLD = 0.98   # Early exit threshold (near-duplicates)

Usage Example

from langchain_cohere import CohereEmbeddings
from src.SemanticEmbedCache import SemanticEmbedCache
from src.embedder.KeyEmbedder import KeyEmbedder
from src.storage.InMemStorage import InMemStorage

# Initialize components
key_embedder = KeyEmbedder()  # FastEmbed local model
og_embedder = CohereEmbeddings(model="embed-english-v3.0")
storage = InMemStorage()

# Create cache
sec = SemanticEmbedCache(
    key_embedder=key_embedder,
    og_embedder=og_embedder,
    storage=storage
)

# Use cache
embedding = sec.get("How do I reset my password?")  # API call (miss)
embedding = sec.get("I forgot my password")          # Cache hit! (0.87 similarity)

Architecture

Cache Structure

SemanticEmbedCache stores pairs of (key embedding, original embedding):

Key (Serialized Key Embedding)	Value (Original Embedding)
String representation of key embedding	Embedding from original embedder

Components:

Key Embedding: Fast, lightweight embedding using FastEmbed's jinaai/jina-embeddings-v2-base-en (local model)
- Used for similarity search only
- Serialized as string for storage keys
Original Embedding: Full embedding from your chosen embedder (e.g., Cohere, OpenAI)
- The actual embedding returned to your application
- Stored as value in cache

Why Two Embeddings?

Key embeddings are cheap/fast (local FastEmbed) → used for finding similar queries
Original embeddings are expensive (API calls) → the actual feature-rich embeddings you and your application need
This 2-tier design minimizes API costs while enabling semantic search

How It Works

Input text: "I forgot my password"
     ↓
1. Generate key embedding (FastEmbed - local, fast)
     ↓
2. Check exact match in cache (serialized key lookup)
     ↓ (miss)
3. Similarity search over all cached keys (cosine similarity)
     ↓ (found: "How do I reset my password?" with 0.87 similarity)
4. Return cached original embedding ✓ (cache hit!)

If no match found:
     ↓
5. Generate original embedding (API call - expensive)
     ↓
6. Store (key_embedding → original_embedding) in cache
     ↓
7. Return original embedding

Lookup Flow

Key embedding generation: Compute FastEmbed embedding for input text
Exact match lookup: Check if serialized key embedding exists in storage → instant return if found
Similarity search (if no exact match):
- Iterate through all stored keys
- Compute cosine similarity between query key and each stored key
- If similarity > SIMILARITY_THRESHOLD (0.85), return cached embedding
- If similarity > HIGHEST_SIMILARITY_THRESHOLD (0.98), early exit (near-duplicate)
Cache miss: Generate original embedding via API, store in cache, return

Benchmark Results

Tested on 100 diverse queries (exact duplicates, semantic variations, unique queries):

Threshold: 0.85 (Recommended)

Implementation	Hit Rate	Avg Time	Improvement
SemanticEmbedCache	60.0%	0.300s	2.5x better hit rate
LangChain CacheBackedEmbeddings	24.0%	0.281s	Only exact matches

Threshold: 0.90 (Strict)

Implementation	Hit Rate	Avg Time	Improvement
SemanticEmbedCache	49.0%	0.327s	2.04x better hit rate
LangChain CacheBackedEmbeddings	24.0%	0.282s	Only exact matches

(Just run the benchmark.py script to reproduce results)

Key Findings:

Hit Rate Impact: Lowering threshold from 0.90 to 0.85 improves hit rate from 49% → 60% (+22% improvement)
Speed Trade-off: SEC is ~7% slower due to similarity search, but this is offset by:
- 2-3x fewer API calls (60% cache hits vs 24%)
- Significant cost savings on embedding API usage
- Overall faster application performance due to reduced network latency

Threshold Selection Guide:

0.90: Strict matching, fewer false positives, 49% hit rate
0.85: Balanced (recommended), good semantic matching, 60% hit rate
0.80: More lenient, higher hit rate but risk of unrelated matches

Notes:

Average time includes embedding generation and cache lookup
LangChain's CacheBackedEmbeddings doesn't return hit/miss status natively; benchmark uses custom modification
Benchmarks performed using Cohere's embed-english-v3.0 model as original embedder
Dataset: 100 queries with ~30% exact duplicates, ~40% semantic variations, ~30% unique queries

Performance Optimization Opportunities

Current implementation uses O(n) linear search through all cached keys. For larger caches:

Recommended optimization: Add FAISS for approximate nearest neighbor search

Reduces search complexity from O(n) to O(log n)
Expected speedup: 10-100x for caches with 1000+ entries
See implementation guide in codebase discussions

API Reference