Document Intelligence System

End-to-end AI system for document parsing, NLP-powered data extraction, and Retrieval-Augmented Generation (RAG). Features hybrid retrieval (dense embeddings + BM25), CrossEncoder reranking, named entity recognition, and a production-ready FastAPI backend with Docker support.

Features

Feature	Implementation
Document ingestion	PDF, DOCX, TXT via LangChain loaders
Text preprocessing	Whitespace normalization, boilerplate removal
Named Entity Recognition	spaCy `en_core_web_sm`
Dense embeddings	`sentence-transformers/all-MiniLM-L6-v2` (local)
Vector store	ChromaDB (persistent)
Sparse retrieval	BM25 (`rank-bm25`)
Hybrid retrieval	Reciprocal Rank Fusion (RRF)
Reranking	`cross-encoder/ms-marco-MiniLM-L-6-v2`
LLM Q&A	OpenAI GPT or local HuggingFace model
REST API	FastAPI with OpenAPI docs
Monitoring	Structured JSON logs, metrics endpoint, alerting
Containerization	Docker multi-stage build + docker-compose

Quick Start

1. Install dependencies

pip install -r requirements.txt
python -m spacy download en_core_web_sm

2. Configure environment

cp .env.example .env
# Edit .env — set OPENAI_API_KEY (or LLM_PROVIDER=huggingface)

3. Run the API

uvicorn app.main:app --reload

API available at http://localhost:8000 Interactive docs at http://localhost:8000/docs

API Endpoints

Ingest a document

curl -X POST http://localhost:8000/api/v1/ingest \
  -F "file=@/path/to/document.pdf"

Semantic search

curl -X POST http://localhost:8000/api/v1/search \
  -H "Content-Type: application/json" \
  -d '{"query": "machine learning", "top_k": 5, "use_hybrid": true, "use_reranker": true}'

RAG Q&A

curl -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What are the key findings?", "top_k": 5}'

List documents

curl http://localhost:8000/api/v1/documents

Delete a document

curl -X DELETE http://localhost:8000/api/v1/documents/{document_id}

Health & Metrics

curl http://localhost:8000/api/v1/health
curl http://localhost:8000/api/v1/health/deep
curl http://localhost:8000/api/v1/metrics

Docker

cp .env.example .env  # configure your keys
docker compose up --build

The app runs on port 8000. Data (ChromaDB + BM25 index + logs) is persisted to ./data/.

Architecture

Upload → Load → Clean → Chunk → NER → Embed (dense) → ChromaDB
                                   └──────────────→ BM25 index

Query → Hybrid Retrieval (Dense RRF Sparse) → CrossEncoder Rerank → LLM → Answer

Hybrid Retrieval Detail

Dense: ChromaDB cosine similarity search using all-MiniLM-L6-v2 embeddings
Sparse: BM25Okapi keyword search across all ingested chunks
Fusion: Reciprocal Rank Fusion (RRF) merges both ranked lists
Reranking: CrossEncoder scores fused candidates against the query

Configuration (`.env`)

Variable	Default	Description
`LLM_PROVIDER`	`openai`	`openai` or `huggingface`
`OPENAI_API_KEY`	—	Required if using OpenAI
`OPENAI_MODEL`	`gpt-4o-mini`	OpenAI model name
`EMBEDDING_MODEL`	`all-MiniLM-L6-v2`	HuggingFace embedding model
`RERANKER_MODEL`	`ms-marco-MiniLM-L-6-v2`	CrossEncoder model
`CHUNK_SIZE`	`512`	Characters per chunk
`CHUNK_OVERLAP`	`64`	Overlap between chunks
`DENSE_TOP_K`	`10`	Dense retrieval candidates
`SPARSE_TOP_K`	`10`	BM25 retrieval candidates
`HYBRID_TOP_K`	`5`	Final results after fusion
`RERANKER_TOP_K`	`5`	Results after reranking

Running Tests

pytest tests/ -v

Project Structure

rag/
├── app/
│   ├── main.py                  # FastAPI app + lifespan
│   ├── config.py                # Settings (pydantic-settings)
│   ├── dependencies.py          # DI helpers
│   ├── api/v1/endpoints/        # REST endpoints
│   ├── core/
│   │   ├── ingestion/           # Load, clean, chunk, NER
│   │   ├── embeddings/          # Dense, sparse, ChromaDB
│   │   ├── retrieval/           # Dense, sparse, hybrid, reranker
│   │   ├── generation/          # LLM, prompts, RAG chain
│   │   └── monitoring/          # Logging, metrics, alerting
│   └── models/                  # Pydantic schemas
├── docker/                      # Dockerfile + entrypoint
├── docker-compose.yml
├── requirements.txt
└── tests/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Intelligence System

Features

Quick Start

1. Install dependencies

2. Configure environment

3. Run the API

API Endpoints

Ingest a document

Semantic search

RAG Q&A

List documents

Delete a document

Health & Metrics

Docker

Architecture

Hybrid Retrieval Detail

Configuration (`.env`)

Running Tests

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.claude		.claude
app		app
data		data
docker		docker
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
test_ai_document.txt		test_ai_document.txt

Folders and files

Latest commit

History

Repository files navigation

Document Intelligence System

Features

Quick Start

1. Install dependencies

2. Configure environment

3. Run the API

API Endpoints

Ingest a document

Semantic search

RAG Q&A

List documents

Delete a document

Health & Metrics

Docker

Architecture

Hybrid Retrieval Detail

Configuration (.env)

Running Tests

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Configuration (`.env`)

Packages