Skip to content

thebharathkumar/AI-Document-Intelligence-system

Repository files navigation

Document Intelligence System

End-to-end AI system for document parsing, NLP-powered data extraction, and Retrieval-Augmented Generation (RAG). Features hybrid retrieval (dense embeddings + BM25), CrossEncoder reranking, named entity recognition, and a production-ready FastAPI backend with Docker support.


Features

Feature Implementation
Document ingestion PDF, DOCX, TXT via LangChain loaders
Text preprocessing Whitespace normalization, boilerplate removal
Named Entity Recognition spaCy en_core_web_sm
Dense embeddings sentence-transformers/all-MiniLM-L6-v2 (local)
Vector store ChromaDB (persistent)
Sparse retrieval BM25 (rank-bm25)
Hybrid retrieval Reciprocal Rank Fusion (RRF)
Reranking cross-encoder/ms-marco-MiniLM-L-6-v2
LLM Q&A OpenAI GPT or local HuggingFace model
REST API FastAPI with OpenAPI docs
Monitoring Structured JSON logs, metrics endpoint, alerting
Containerization Docker multi-stage build + docker-compose

Quick Start

1. Install dependencies

pip install -r requirements.txt
python -m spacy download en_core_web_sm

2. Configure environment

cp .env.example .env
# Edit .env — set OPENAI_API_KEY (or LLM_PROVIDER=huggingface)

3. Run the API

uvicorn app.main:app --reload

API available at http://localhost:8000 Interactive docs at http://localhost:8000/docs


API Endpoints

Ingest a document

curl -X POST http://localhost:8000/api/v1/ingest \
  -F "file=@/path/to/document.pdf"

Semantic search

curl -X POST http://localhost:8000/api/v1/search \
  -H "Content-Type: application/json" \
  -d '{"query": "machine learning", "top_k": 5, "use_hybrid": true, "use_reranker": true}'

RAG Q&A

curl -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What are the key findings?", "top_k": 5}'

List documents

curl http://localhost:8000/api/v1/documents

Delete a document

curl -X DELETE http://localhost:8000/api/v1/documents/{document_id}

Health & Metrics

curl http://localhost:8000/api/v1/health
curl http://localhost:8000/api/v1/health/deep
curl http://localhost:8000/api/v1/metrics

Docker

cp .env.example .env  # configure your keys
docker compose up --build

The app runs on port 8000. Data (ChromaDB + BM25 index + logs) is persisted to ./data/.


Architecture

Upload → Load → Clean → Chunk → NER → Embed (dense) → ChromaDB
                                   └──────────────→ BM25 index

Query → Hybrid Retrieval (Dense RRF Sparse) → CrossEncoder Rerank → LLM → Answer

Hybrid Retrieval Detail

  1. Dense: ChromaDB cosine similarity search using all-MiniLM-L6-v2 embeddings
  2. Sparse: BM25Okapi keyword search across all ingested chunks
  3. Fusion: Reciprocal Rank Fusion (RRF) merges both ranked lists
  4. Reranking: CrossEncoder scores fused candidates against the query

Configuration (.env)

Variable Default Description
LLM_PROVIDER openai openai or huggingface
OPENAI_API_KEY Required if using OpenAI
OPENAI_MODEL gpt-4o-mini OpenAI model name
EMBEDDING_MODEL all-MiniLM-L6-v2 HuggingFace embedding model
RERANKER_MODEL ms-marco-MiniLM-L-6-v2 CrossEncoder model
CHUNK_SIZE 512 Characters per chunk
CHUNK_OVERLAP 64 Overlap between chunks
DENSE_TOP_K 10 Dense retrieval candidates
SPARSE_TOP_K 10 BM25 retrieval candidates
HYBRID_TOP_K 5 Final results after fusion
RERANKER_TOP_K 5 Results after reranking

Running Tests

pytest tests/ -v

Project Structure

rag/
├── app/
│   ├── main.py                  # FastAPI app + lifespan
│   ├── config.py                # Settings (pydantic-settings)
│   ├── dependencies.py          # DI helpers
│   ├── api/v1/endpoints/        # REST endpoints
│   ├── core/
│   │   ├── ingestion/           # Load, clean, chunk, NER
│   │   ├── embeddings/          # Dense, sparse, ChromaDB
│   │   ├── retrieval/           # Dense, sparse, hybrid, reranker
│   │   ├── generation/          # LLM, prompts, RAG chain
│   │   └── monitoring/          # Logging, metrics, alerting
│   └── models/                  # Pydantic schemas
├── docker/                      # Dockerfile + entrypoint
├── docker-compose.yml
├── requirements.txt
└── tests/

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors