End-to-end AI system for document parsing, NLP-powered data extraction, and Retrieval-Augmented Generation (RAG). Features hybrid retrieval (dense embeddings + BM25), CrossEncoder reranking, named entity recognition, and a production-ready FastAPI backend with Docker support.
| Feature | Implementation |
|---|---|
| Document ingestion | PDF, DOCX, TXT via LangChain loaders |
| Text preprocessing | Whitespace normalization, boilerplate removal |
| Named Entity Recognition | spaCy en_core_web_sm |
| Dense embeddings | sentence-transformers/all-MiniLM-L6-v2 (local) |
| Vector store | ChromaDB (persistent) |
| Sparse retrieval | BM25 (rank-bm25) |
| Hybrid retrieval | Reciprocal Rank Fusion (RRF) |
| Reranking | cross-encoder/ms-marco-MiniLM-L-6-v2 |
| LLM Q&A | OpenAI GPT or local HuggingFace model |
| REST API | FastAPI with OpenAPI docs |
| Monitoring | Structured JSON logs, metrics endpoint, alerting |
| Containerization | Docker multi-stage build + docker-compose |
pip install -r requirements.txt
python -m spacy download en_core_web_smcp .env.example .env
# Edit .env — set OPENAI_API_KEY (or LLM_PROVIDER=huggingface)uvicorn app.main:app --reloadAPI available at http://localhost:8000
Interactive docs at http://localhost:8000/docs
curl -X POST http://localhost:8000/api/v1/ingest \
-F "file=@/path/to/document.pdf"curl -X POST http://localhost:8000/api/v1/search \
-H "Content-Type: application/json" \
-d '{"query": "machine learning", "top_k": 5, "use_hybrid": true, "use_reranker": true}'curl -X POST http://localhost:8000/api/v1/query \
-H "Content-Type: application/json" \
-d '{"question": "What are the key findings?", "top_k": 5}'curl http://localhost:8000/api/v1/documentscurl -X DELETE http://localhost:8000/api/v1/documents/{document_id}curl http://localhost:8000/api/v1/health
curl http://localhost:8000/api/v1/health/deep
curl http://localhost:8000/api/v1/metricscp .env.example .env # configure your keys
docker compose up --buildThe app runs on port 8000. Data (ChromaDB + BM25 index + logs) is persisted to ./data/.
Upload → Load → Clean → Chunk → NER → Embed (dense) → ChromaDB
└──────────────→ BM25 index
Query → Hybrid Retrieval (Dense RRF Sparse) → CrossEncoder Rerank → LLM → Answer
- Dense: ChromaDB cosine similarity search using
all-MiniLM-L6-v2embeddings - Sparse: BM25Okapi keyword search across all ingested chunks
- Fusion: Reciprocal Rank Fusion (RRF) merges both ranked lists
- Reranking: CrossEncoder scores fused candidates against the query
| Variable | Default | Description |
|---|---|---|
LLM_PROVIDER |
openai |
openai or huggingface |
OPENAI_API_KEY |
— | Required if using OpenAI |
OPENAI_MODEL |
gpt-4o-mini |
OpenAI model name |
EMBEDDING_MODEL |
all-MiniLM-L6-v2 |
HuggingFace embedding model |
RERANKER_MODEL |
ms-marco-MiniLM-L-6-v2 |
CrossEncoder model |
CHUNK_SIZE |
512 |
Characters per chunk |
CHUNK_OVERLAP |
64 |
Overlap between chunks |
DENSE_TOP_K |
10 |
Dense retrieval candidates |
SPARSE_TOP_K |
10 |
BM25 retrieval candidates |
HYBRID_TOP_K |
5 |
Final results after fusion |
RERANKER_TOP_K |
5 |
Results after reranking |
pytest tests/ -vrag/
├── app/
│ ├── main.py # FastAPI app + lifespan
│ ├── config.py # Settings (pydantic-settings)
│ ├── dependencies.py # DI helpers
│ ├── api/v1/endpoints/ # REST endpoints
│ ├── core/
│ │ ├── ingestion/ # Load, clean, chunk, NER
│ │ ├── embeddings/ # Dense, sparse, ChromaDB
│ │ ├── retrieval/ # Dense, sparse, hybrid, reranker
│ │ ├── generation/ # LLM, prompts, RAG chain
│ │ └── monitoring/ # Logging, metrics, alerting
│ └── models/ # Pydantic schemas
├── docker/ # Dockerfile + entrypoint
├── docker-compose.yml
├── requirements.txt
└── tests/