Skip to content

Shameendra/Scalable_RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Production-Grade Scalable RAG System

DevOps and MLOps best practices A production-ready RAG system including caching, rate limiting, circuit breakers, monitoring, and Kubernetes deployment.

🎯 Key Features

  • Distributed Caching: Redis-based caching for embeddings and query results
  • Rate Limiting: Token bucket and sliding window rate limiters
  • Circuit Breakers: Fault tolerance for external service calls
  • Prometheus Metrics: Comprehensive observability
  • Health Checks: Kubernetes-ready liveness and readiness probes
  • Horizontal Scaling: Auto-scaling based on load
  • Async Support: High-concurrency request handling

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        Load Balancer                                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
                          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Kubernetes Cluster                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚                    RAG API Pods (HPA)                       β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚    β”‚
β”‚  β”‚  β”‚  Pod 1  β”‚  β”‚  Pod 2  β”‚  β”‚  Pod 3  β”‚  β”‚  Pod N  β”‚         β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜         β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚          β”‚            β”‚            β”‚            β”‚                   β”‚
β”‚          β–Ό            β–Ό            β–Ό            β–Ό                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚                    Service Mesh                             β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚    β”‚
β”‚  β”‚  β”‚ Rate Limiter β”‚  β”‚   Circuit    β”‚  β”‚    Cache     β”‚       β”‚    β”‚
β”‚  β”‚  β”‚              β”‚  β”‚   Breaker    β”‚  β”‚   (Redis)    β”‚       β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β–Ό               β–Ό               β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  OpenAI  β”‚   β”‚ Pinecone β”‚   β”‚ Prometheusβ”‚
    β”‚   API    β”‚   β”‚  Vector  β”‚   β”‚  Grafana  β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“ Project Structure

scalable-rag/
β”œβ”€β”€ config.py              # Configuration management
β”œβ”€β”€ cache.py               # Redis caching layer
β”œβ”€β”€ rate_limiter.py        # Rate limiting & circuit breakers
β”œβ”€β”€ monitoring.py          # Prometheus metrics & health checks
β”œβ”€β”€ rag_engine.py          # Core RAG implementation
β”œβ”€β”€ main.py                # FastAPI application
β”œβ”€β”€ Dockerfile             # Container image
β”œβ”€β”€ kubernetes/
β”‚   └── deployment.yaml    # K8s manifests
β”œβ”€β”€ requirements.txt       # Dependencies
└── README.md              # Documentation

πŸš€ Quick Start

Local Development

# Install dependencies
pip install -r requirements.txt

# Set environment variables
export OPENAI_API_KEY=sk-...
export PINECONE_API_KEY=...
export REDIS_URL=redis://localhost:6379

# Start Redis (Docker)
docker run -d -p 6379:6379 redis:alpine

# Run the server
python main.py --port 8000 --reload

Docker

# Build image
docker build -t scalable-rag .

# Run container
docker run -p 8000:8000 \
  -e OPENAI_API_KEY=sk-... \
  -e PINECONE_API_KEY=... \
  -e REDIS_URL=redis://redis:6379 \
  scalable-rag

Kubernetes

# Create secrets
kubectl create secret generic rag-secrets \
  --from-literal=openai-api-key=sk-... \
  --from-literal=pinecone-api-key=...

# Deploy
kubectl apply -f kubernetes/deployment.yaml

# Check status
kubectl get pods -l app=scalable-rag

πŸ“‘ API Endpoints

Query

POST /query
{
  "question": "What is machine learning?",
  "k": 5,
  "use_cache": true
}

Health Checks

GET /health        # Full health status
GET /health/live   # Liveness probe
GET /health/ready  # Readiness probe

Metrics

GET /metrics       # Prometheus metrics
GET /status        # System status + circuit breakers

Documents

POST /documents
{
  "texts": ["Document 1 content", "Document 2 content"],
  "metadatas": [{"source": "doc1"}, {"source": "doc2"}]
}

πŸ“Š Monitoring

Prometheus Metrics

Metric Type Description
rag_requests_total Counter Total requests by endpoint
rag_request_latency_seconds Histogram Request latency
rag_llm_calls_total Counter LLM API calls
rag_llm_tokens_total Counter Token usage
rag_cache_hits_total Counter Cache hits
rag_errors_total Counter Errors by type

Grafana Dashboard

Import the provided dashboard for visualizations:

  • Request rate and latency
  • Cache hit ratio
  • LLM token usage
  • Error rates
  • Circuit breaker states

πŸ›‘οΈ Fault Tolerance

Rate Limiting

# Token bucket for smooth limiting
rate_limiter = TokenBucketRateLimiter(
    rate=60,        # tokens per minute
    capacity=100    # burst capacity
)

# Sliding window for strict limits
rate_limiter = SlidingWindowRateLimiter(
    limit=60,
    window_seconds=60
)

Circuit Breaker

# Protects against cascading failures
circuit = CircuitBreaker(
    name="llm",
    config=CircuitBreakerConfig(
        failure_threshold=5,    # Open after 5 failures
        timeout_seconds=30,     # Try again after 30s
        success_threshold=2     # Close after 2 successes
    )
)

Caching Strategy

Query β†’ Check Response Cache β†’ Hit? Return
                            β†’ Miss? Continue
        ↓
        Check Embedding Cache β†’ Hit? Use cached embedding
                             β†’ Miss? Generate & cache
        ↓
        Check Search Cache β†’ Hit? Use cached results
                          β†’ Miss? Search & cache
        ↓
        Generate Response β†’ Cache & Return

πŸ”§ Configuration

All settings via environment variables:

# LLM
LLM_MODEL=gpt-4o-mini
LLM_TEMPERATURE=0.3
LLM_MAX_TOKENS=2000

# Vector Store
PINECONE_INDEX=production-rag
PINECONE_ENV=us-east-1

# Cache
CACHE_ENABLED=true
CACHE_TTL=3600
REDIS_URL=redis://localhost:6379

# Rate Limiting
RATE_LIMIT_RPM=60
RATE_LIMIT_TPM=100000
MAX_CONCURRENT=10

# Monitoring
PROMETHEUS_ENABLED=true
PROMETHEUS_PORT=9090
LOG_LEVEL=INFO

πŸ§ͺ Testing

# Run tests
pytest tests/ -v

# Load testing
locust -f tests/locustfile.py --host=http://localhost:8000

πŸ“ˆ Scaling Guidelines

Users Pods Redis Notes
< 100 2 1 Development
100-1K 3-5 1 Small production
1K-10K 5-10 3 (cluster) Medium production
10K+ 10+ Redis Cluster Large production

πŸ” Security

  • API key management via Kubernetes secrets
  • Rate limiting prevents abuse
  • Circuit breakers prevent cascade failures
  • Health checks enable zero-downtime deploys

πŸ“ License

MIT License


About

DevOps and MLOps best practices A production-ready RAG system including caching, rate limiting, circuit breakers, monitoring, and Kubernetes deployment

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors