DevOps and MLOps best practices A production-ready RAG system including caching, rate limiting, circuit breakers, monitoring, and Kubernetes deployment.
- Distributed Caching: Redis-based caching for embeddings and query results
- Rate Limiting: Token bucket and sliding window rate limiters
- Circuit Breakers: Fault tolerance for external service calls
- Prometheus Metrics: Comprehensive observability
- Health Checks: Kubernetes-ready liveness and readiness probes
- Horizontal Scaling: Auto-scaling based on load
- Async Support: High-concurrency request handling
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Load Balancer β
βββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Kubernetes Cluster β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β RAG API Pods (HPA) β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β
β β β Pod 1 β β Pod 2 β β Pod 3 β β Pod N β β β
β β ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ β β
β βββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββββ β
β β β β β β
β βΌ βΌ βΌ βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Service Mesh β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β β
β β β Rate Limiter β β Circuit β β Cache β β β
β β β β β Breaker β β (Redis) β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββΌββββββββββββββββ
βΌ βΌ βΌ
ββββββββββββ ββββββββββββ βββββββββββββ
β OpenAI β β Pinecone β β Prometheusβ
β API β β Vector β β Grafana β
ββββββββββββ ββββββββββββ βββββββββββββ
scalable-rag/
βββ config.py # Configuration management
βββ cache.py # Redis caching layer
βββ rate_limiter.py # Rate limiting & circuit breakers
βββ monitoring.py # Prometheus metrics & health checks
βββ rag_engine.py # Core RAG implementation
βββ main.py # FastAPI application
βββ Dockerfile # Container image
βββ kubernetes/
β βββ deployment.yaml # K8s manifests
βββ requirements.txt # Dependencies
βββ README.md # Documentation
# Install dependencies
pip install -r requirements.txt
# Set environment variables
export OPENAI_API_KEY=sk-...
export PINECONE_API_KEY=...
export REDIS_URL=redis://localhost:6379
# Start Redis (Docker)
docker run -d -p 6379:6379 redis:alpine
# Run the server
python main.py --port 8000 --reload# Build image
docker build -t scalable-rag .
# Run container
docker run -p 8000:8000 \
-e OPENAI_API_KEY=sk-... \
-e PINECONE_API_KEY=... \
-e REDIS_URL=redis://redis:6379 \
scalable-rag# Create secrets
kubectl create secret generic rag-secrets \
--from-literal=openai-api-key=sk-... \
--from-literal=pinecone-api-key=...
# Deploy
kubectl apply -f kubernetes/deployment.yaml
# Check status
kubectl get pods -l app=scalable-ragPOST /query
{
"question": "What is machine learning?",
"k": 5,
"use_cache": true
}GET /health # Full health status
GET /health/live # Liveness probe
GET /health/ready # Readiness probeGET /metrics # Prometheus metrics
GET /status # System status + circuit breakersPOST /documents
{
"texts": ["Document 1 content", "Document 2 content"],
"metadatas": [{"source": "doc1"}, {"source": "doc2"}]
}| Metric | Type | Description |
|---|---|---|
rag_requests_total |
Counter | Total requests by endpoint |
rag_request_latency_seconds |
Histogram | Request latency |
rag_llm_calls_total |
Counter | LLM API calls |
rag_llm_tokens_total |
Counter | Token usage |
rag_cache_hits_total |
Counter | Cache hits |
rag_errors_total |
Counter | Errors by type |
Import the provided dashboard for visualizations:
- Request rate and latency
- Cache hit ratio
- LLM token usage
- Error rates
- Circuit breaker states
# Token bucket for smooth limiting
rate_limiter = TokenBucketRateLimiter(
rate=60, # tokens per minute
capacity=100 # burst capacity
)
# Sliding window for strict limits
rate_limiter = SlidingWindowRateLimiter(
limit=60,
window_seconds=60
)# Protects against cascading failures
circuit = CircuitBreaker(
name="llm",
config=CircuitBreakerConfig(
failure_threshold=5, # Open after 5 failures
timeout_seconds=30, # Try again after 30s
success_threshold=2 # Close after 2 successes
)
)Query β Check Response Cache β Hit? Return
β Miss? Continue
β
Check Embedding Cache β Hit? Use cached embedding
β Miss? Generate & cache
β
Check Search Cache β Hit? Use cached results
β Miss? Search & cache
β
Generate Response β Cache & Return
All settings via environment variables:
# LLM
LLM_MODEL=gpt-4o-mini
LLM_TEMPERATURE=0.3
LLM_MAX_TOKENS=2000
# Vector Store
PINECONE_INDEX=production-rag
PINECONE_ENV=us-east-1
# Cache
CACHE_ENABLED=true
CACHE_TTL=3600
REDIS_URL=redis://localhost:6379
# Rate Limiting
RATE_LIMIT_RPM=60
RATE_LIMIT_TPM=100000
MAX_CONCURRENT=10
# Monitoring
PROMETHEUS_ENABLED=true
PROMETHEUS_PORT=9090
LOG_LEVEL=INFO# Run tests
pytest tests/ -v
# Load testing
locust -f tests/locustfile.py --host=http://localhost:8000| Users | Pods | Redis | Notes |
|---|---|---|---|
| < 100 | 2 | 1 | Development |
| 100-1K | 3-5 | 1 | Small production |
| 1K-10K | 5-10 | 3 (cluster) | Medium production |
| 10K+ | 10+ | Redis Cluster | Large production |
- API key management via Kubernetes secrets
- Rate limiting prevents abuse
- Circuit breakers prevent cascade failures
- Health checks enable zero-downtime deploys
MIT License