A Production-Grade RAG System with Hybrid Search, Reranking, and Local Embeddings
An intelligent SAT study assistant powered by Retrieval-Augmented Generation (RAG), designed for production use with resilient architecture, local embeddings, and optional advanced reranking.
graph LR
A[π SAT Materials<br/>PDF/Markdown/TXT] --> B[β‘ Ingestion Pipeline]
B --> C[FastEmbed<br/>bge-small-en-v1.5<br/>384-dim]
C --> D[(ποΈ Qdrant Vector DB<br/>Hybrid Search)]
E[π€ Student Query] --> F[π Vector Retrieval<br/>Top 25]
D --> F
F --> G{Cohere API<br/>Available?}
G -->|Yes| H[π― Cohere Rerank<br/>Top 5]
G -->|No| I[β‘ Fallback<br/>Top 5 Pass-through]
H --> J[π€ Gemini 2.5 Flash<br/>SAT Tutor Persona]
I --> J
J --> K[π¬ Streamlit UI<br/>Answer + Citations]
style A fill:#f9f,stroke:#333,stroke-width:2px
style D fill:#bbf,stroke:#333,stroke-width:2px
style J fill:#bfb,stroke:#333,stroke-width:2px
style K fill:#fbb,stroke:#333,stroke-width:2px
Gracefully degrades when Cohere API key is missing - the system automatically falls back to standard top-5 retrieval without crashing. Production-ready error handling throughout.
- Uses FastEmbed with
BAAI/bge-small-en-v1.5for embedding generation - Zero API calls for embeddings = zero latency, zero cost
- Fully offline-capable for privacy-sensitive deployments
- 384-dimensional vectors optimized for semantic search
Every answer includes source references with:
- Source filename and section
- Relevance scores (0-100%)
- Direct citation to original study materials
- Built-in trust and verifiability
- Google Gemini-inspired interface with dark theme
- "Zero State" hero section with clickable suggestion cards
- Real-time system status indicators
- Smooth animations and glass-morphism effects
| Component | Technology | Why? |
|---|---|---|
| Orchestration | LlamaIndex | Context-aware retrieval with rich ecosystem |
| Vector Database | Qdrant | Hybrid search capabilities, cloud-native, high performance |
| Embeddings | FastEmbed (bge-small-en-v1.5) |
Local execution, fast ONNX runtime, privacy-focused, zero cost |
| LLM | Google Gemini 2.5 Flash | Long context window (1M tokens), blazing speed, cost-effective |
| Reranking | Cohere Rerank (Optional) | SOTA precision with graceful fallback to standard retrieval |
| UI Framework | Streamlit | Rapid prototyping, native chat components, Python-first |
| Package Manager | uv |
10-100x faster than pip, modern dependency resolution |
| Configuration | Pydantic Settings | Type-safe config with automatic .env loading |
- Python 3.10 or higher
- uv package manager
- Docker (for local Qdrant, optional)
- Clone the repository
git clone https://github.com/yourusername/atlas-sat-assistant.git
cd atlas-sat-assistant- Configure environment variables
cp .env.example .envEdit .env and add your API keys:
# Required
GOOGLE_API_KEY=your_google_api_key_here
# Optional (for better reranking)
COHERE_API_KEY=your_cohere_key_here
# Qdrant Configuration
QDRANT_URL=http://localhost:6333 # or your Qdrant Cloud URL
QDRANT_API_KEY= # Optional for local, required for cloud
# Embedding Configuration (defaults are fine)
EMBEDDING_MODEL=BAAI/bge-small-en-v1.5
EMBEDDING_DIM=384- Install dependencies
uv sync- Start Qdrant (if using local)
docker-compose up -dOr use Qdrant Cloud and update QDRANT_URL and QDRANT_API_KEY.
Place your SAT materials (PDF, Markdown, TXT) in data/raw/:
data/raw/
βββ sat_reading_sample.md
βββ sat_math_formulas.pdf
βββ writing_strategies.txtRun the ingestion pipeline:
uv run python -m src.scripts.ingest
# Options:
# --dry-run Test without upserting to Qdrant
# --chunking semantic Use semantic chunking (default)
# --chunking sentence Use sentence-based chunking
# --section reading Only ingest reading materialsuv run streamlit run src/interface/app.pyOpen your browser to http://localhost:8501
uv run python -m src.scripts.test_rag "What is the quadratic formula?"# Install evaluation dependencies
pip install langchain-google-genai langchain-community
# Run RAGAS evaluation
uv run python -m src.evaluation.evaluate_ragasResults saved to evaluation_results/:
inference_results_*.csv- Questions, answers, ground truthragas_metrics_*.csv- Context precision, faithfulness scores
atlas-sat-assistant/
βββ data/
β βββ raw/ # Input SAT materials (PDF, MD, TXT)
β βββ processed/ # Processed chunks (optional)
β
βββ src/
β βββ core/
β β βββ config.py # Pydantic settings with .env loading
β β βββ logging.py # Centralized logging configuration
β β
β βββ ingestion/
β β βββ loader.py # Custom SAT document loaders
β β βββ chunking.py # Semantic + sentence chunking strategies
β β βββ vector_db.py # Qdrant client with retry logic
β β
β βββ rag/
β β βββ retriever.py # Vector retrieval with FastEmbed (top 25)
β β βββ reranker.py # Optional Cohere reranking (top 5)
β β βββ engine.py # Query engine orchestrating full pipeline
β β
β βββ interface/
β β βββ app.py # Modern Streamlit UI (Gemini-style)
β β
β βββ evaluation/
β β βββ evaluate_ragas.py # RAGAS evaluation with Gemini LLM
β β
β βββ scripts/
β βββ ingest.py # CLI ingestion with progress tracking
β βββ test_rag.py # CLI RAG testing
β
βββ evaluation_results/ # RAGAS evaluation outputs
βββ docker-compose.yaml # Local Qdrant setup
βββ pyproject.toml # uv project configuration
βββ .env.example # Environment variables template
βββ README.md # You are here!
| Variable | Required | Default | Description |
|---|---|---|---|
GOOGLE_API_KEY |
β Yes | - | Google API key for Gemini LLM |
COHERE_API_KEY |
- | Cohere key for reranking (graceful fallback) | |
QDRANT_URL |
β Yes | http://localhost:6333 |
Qdrant server URL |
QDRANT_API_KEY |
- | Required for Qdrant Cloud | |
COLLECTION_NAME |
No | sat_prep |
Qdrant collection name |
EMBEDDING_MODEL |
No | BAAI/bge-small-en-v1.5 |
FastEmbed model |
EMBEDDING_DIM |
No | 384 |
Vector dimension |
CHUNK_SIZE |
No | 512 |
Text chunk size |
TOP_K |
No | 5 |
Number of results to retrieve |
# Unit tests (if implemented)
pytest tests/
# Lint and format
ruff check src/
black src/- New Document Types: Extend
src/ingestion/loader.py - Custom Chunking: Add strategies to
src/ingestion/chunking.py - Different LLMs: Modify
src/rag/engine.py - UI Customization: Edit
src/interface/app.pyCSS
- Ingestion: ~50 documents/min with semantic chunking
- Query Latency:
- Retrieval: ~200ms (local FastEmbed)
- Reranking: ~300ms (Cohere API)
- Generation: ~800ms (Gemini 2.5 Flash)
- Total: ~1.3s per query
- Embedding Cost: $0 (local FastEmbed)
- LLM Cost: ~$0.0001 per query (Gemini pricing)
Average scores on golden SAT dataset:
- Context Precision: 0.85 (85% relevant contexts retrieved)
- Faithfulness: 0.92 (92% answers grounded in context)
1. Gemini 404 Error
Error: 404 Model is not found: models/gemini-1.5-flash
Solution: Model name updated to gemini-2.5-flash (latest stable version)
2. Qdrant Connection Failed
Error: Failed to connect to Qdrant at http://localhost:6333
Solution: Start Qdrant with docker-compose up -d or check cloud credentials
3. Missing Cohere Reranker
WARNING: Cohere API key not found. Reranking will use pass-through mode
Solution: This is not an error! System works in fallback mode. Add COHERE_API_KEY for better results.
4. Empty Retrieval Results
WARNING: No nodes retrieved for query
Solution: Run ingestion first: python -m src.scripts.ingest