Production RAG AI Chat Agent

A production-grade retrieval augmented generation (RAG) system built with FastAPI, LangChain, Qdrant, and OpenRouter. This project demonstrates a scalable, containerized architecture for ingesting PDF documents and performing context-aware chat operations with advanced features including streaming responses, conversational memory, reranking, and comprehensive observability.

🚀 Features

Core RAG Pipeline

PDF Ingestion — Upload and process PDF documents with SHA256 deduplication
Semantic Search — Vector-based retrieval using sentence-transformers (all-MiniLM-L6-v2)
Context-Aware Chat — Ask questions about ingested documents with source attribution

Advanced Features

Streaming Responses — Real-time token streaming for improved UX
Conversational Memory — Multi-turn conversations with session management
Reranking — Two-stage retrieval with cross-encoder reranking for 20-30% quality improvement
Multi-Collection Support — Data isolation per user/project/tenant
Observability — Built-in metrics, performance tracking, and optional LangSmith tracing

Architecture Overview

graph TB
    User[User] -->|POST /ingest| Ingest[Ingest Pipeline]
    User -->|POST /chat| Chat[Chat Pipeline]
    User -->|POST /chat/stream| Stream[Streaming Pipeline]

    subgraph "Ingest Pipeline"
        Ingest -->|PDF| Parser[PyMuPDF Parser]
        Parser -->|Raw Text| Chunker[RecursiveCharacterTextSplitter]
        Chunker -->|Chunks| Hash[SHA256 Hasher]
        Hash -->|Deduplicated| Embedder[SentenceTransformer]
        Embedder -->|Vectors| VectorDB[(Qdrant)]
    end

    subgraph "RAG Pipeline with Memory"
        Chat -->|Query| Memory[Session Memory]
        Memory -->|History| Retriever[Vector Search]
        Retriever -->|Top-K Docs| VectorDB
        VectorDB -->|Context| Reranker[Cross-Encoder Reranker]
        Reranker -->|Top-N Best| LLM[LLM OpenRouter]
        LLM -->|Response| Memory
        Memory -->|Answer| User
    end

    subgraph "Observability"
        Chat -.->|Metrics| Metrics[Metrics Collector]
        Stream -.->|Metrics| Metrics
        Metrics -->|API| Dashboard["/metrics/summary"]
    end

Data Flow

Ingestion:

PDF → Parse → Chunk → Hash (dedup) → Embed (384-dim) → Qdrant

Query (with all features):

Query → Embed → Qdrant Search (top-10) → Rerank (top-3) →
Load Session Memory → LLM Generate (with history) →
Save to Memory → Stream Response → Record Metrics

Project Structure

rag-agent/
├── app/
│   ├── api/
│   │   ├── routes/
│   │   │   ├── ingest.py         # Document ingestion
│   │   │   ├── chat.py           # Standard chat endpoint
│   │   │   ├── stream.py         # Streaming chat endpoint
│   │   │   ├── collections.py    # Collection management
│   │   │   ├── sessions.py       # Session/memory management
│   │   │   └── metrics.py        # Observability endpoints
│   │   └── dependencies.py       # FastAPI DI
│   ├── core/
│   │   ├── config.py             # Settings management
│   │   ├── logging.py            # Structured logging
│   │   ├── metrics.py            # Metrics collection
│   │   └── tracing.py            # LangSmith integration
│   ├── pipelines/
│   │   ├── ingest.py             # Ingestion orchestration
│   │   └── rag.py                # RAG pipeline with memory
│   ├── services/
│   │   ├── parser.py             # PDF text extraction
│   │   ├── chunker.py            # Text splitting
│   │   ├── embedder.py           # Embedding generation
│   │   ├── reranker.py           # Cross-encoder reranking
│   │   ├── hasher.py             # SHA256 deduplication
│   │   ├── memory.py             # Conversation memory
│   │   └── vector_store.py       # Qdrant operations
│   ├── schemas/
│   │   ├── chat.py               # Chat request/response models
│   │   ├── ingest.py             # Ingestion models
│   │   ├── collections.py        # Collection models
│   │   └── sessions.py           # Session models
│   ├── main.py                   # FastAPI application
│   ├── Dockerfile
│   └── requirements.txt
├── qdrant_storage/               # Persisted vector data
├── docker-compose.yml
├── .env.example
├── .gitignore
├── LICENSE
└── README.md

Tech Stack

Component	Technology	Purpose
Framework	FastAPI (Python 3.11)	High-performance async API
Orchestration	LangChain	LLM workflow management
Vector Database	Qdrant	Persistent vector storage
LLM Provider	OpenRouter	Access to Google Gemma 3 27B
Embeddings	sentence-transformers	Local embedding generation (all-MiniLM-L6-v2)
Reranking	sentence-transformers	Cross-encoder reranking (ms-marco-MiniLM-L-6-v2)
Memory	In-Memory (upgradeable to Redis)	Conversation state management
Containerization	Docker & Docker Compose	Reproducible deployment
Observability	Custom Metrics + LangSmith (optional)	Performance monitoring

Quick Start

Prerequisites

Docker & Docker Compose
OpenRouter API Key (Get one free)

Installation

# 1. Clone the repository
git clone https://github.com/SRV-YouSoRandom/ragagent.git
cd ragagent

# 2. Configure environment
cp .env.example .env
nano .env  # Add your OPENROUTER_API_KEY

# 3. Build and run
docker compose up --build -d

# 4. Verify it's running
curl http://localhost:8000/health

The API will be available at http://localhost:8000

Swagger UI: http://localhost:8000/docs

API Usage

1. Ingest Documents

Upload PDF documents to the vector store.

curl -X POST http://localhost:8000/api/v1/ingest \
  -F "file=@document.pdf" \
  -F "collection_name=my_docs"

Response:

{
  "filename": "document.pdf",
  "doc_hash": "a3f5c2...",
  "total_chunks": 42,
  "new_chunks_indexed": 42,
  "collection_name": "my_docs",
  "message": "Document ingested successfully."
}

2. Chat (Standard)

Ask questions with source attribution.

curl -X POST http://localhost:8000/api/v1/chat \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What is the main topic?",
    "collection_name": "my_docs",
    "session_id": null
  }'

Response:

{
  "answer": "Based on the document, the main topic is...",
  "sources": [
    {
      "filename": "document.pdf",
      "score": 0.923,
      "rerank_score": 0.987
    }
  ],
  "collection_name": "my_docs",
  "session_id": "a1b2c3d4-5e6f-..."
}

3. Chat (Streaming)

Stream responses in real-time.

curl -X POST http://localhost:8000/api/v1/chat/stream \
  -H "Content-Type: application/json" \
  -d '{
    "question": "Explain this in detail",
    "session_id": "a1b2c3d4-5e6f-..."
  }' \
  --no-buffer

Tokens appear as they're generated!

4. Multi-Turn Conversations

The system remembers context within a session.

# First question
curl -X POST http://localhost:8000/api/v1/chat \
  -H "Content-Type: application/json" \
  -d '{"question": "What is quantum computing?"}'

# Follow-up (use returned session_id)
curl -X POST http://localhost:8000/api/v1/chat \
  -H "Content-Type: application/json" \
  -d '{
    "question": "How does it compare to classical computing?",
    "session_id": "<session_id_from_previous_response>"
  }'

5. Collection Management

# Create collection
curl -X POST http://localhost:8000/api/v1/collections \
  -H "Content-Type: application/json" \
  -d '{"name": "project_alpha"}'

# List collections
curl http://localhost:8000/api/v1/collections

# Get collection stats
curl http://localhost:8000/api/v1/collections/project_alpha

# Delete collection
curl -X DELETE http://localhost:8000/api/v1/collections/project_alpha

6. Session Management

# List active sessions
curl http://localhost:8000/api/v1/sessions

# Get conversation history
curl http://localhost:8000/api/v1/sessions/{session_id}/history

# Clear session
curl -X DELETE http://localhost:8000/api/v1/sessions/{session_id}

7. Metrics & Observability

# Get aggregated metrics
curl http://localhost:8000/api/v1/metrics/summary

# Get recent queries
curl http://localhost:8000/api/v1/metrics/recent?limit=20

Example metrics response:

{
  "total_queries": 150,
  "successful_queries": 147,
  "failed_queries": 3,
  "avg_latency_ms": 1234.5,
  "avg_docs_retrieved": 10.0,
  "avg_docs_after_rerank": 3.0,
  "p95_latency_ms": 1850.2,
  "p99_latency_ms": 2100.5
}

Configuration

Edit .env to customize behavior:

# LLM Configuration
OPENROUTER_API_KEY=your_key_here
LLM_MODEL=google/gemma-3-27b-it:free

# Embedding & Reranking
EMBEDDING_MODEL=all-MiniLM-L6-v2
RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2

# RAG Parameters
CHUNK_SIZE=512
CHUNK_OVERLAP=64
TOP_K=5              # Initial retrieval
RERANK_TOP_K=3       # After reranking

# Observability (Optional)
LANGSMITH_TRACING=false
LANGSMITH_API_KEY=
LANGSMITH_PROJECT=rag-agent

Key Design Decisions

Decision	Reasoning
all-MiniLM-L6-v2	Fast, lightweight (384-dim), runs fully local without API costs
Cross-Encoder Reranking	20-30% quality improvement with minimal latency cost (~100ms)
SHA256 Hashing	Prevents re-indexing duplicate content at document and chunk levels
Qdrant over FAISS	Persistent storage, production-ready, REST API, easier scaling
OpenRouter	Access to top-tier models without local GPU requirements
In-Memory Sessions	Fast, simple, upgradeable to Redis for multi-instance deployments
LRU Caching	Avoids expensive model re-initialization per request
FastAPI Lifespan	Ensures resources are ready before serving traffic
Streaming by Default	Better UX, lower perceived latency
Two-Stage Retrieval	Fast vector search (top-10) + accurate reranking (top-3)

Docker Commands

# View logs
docker compose logs -f app

# Rebuild after code changes
docker compose up --build -d

# Stop services
docker compose down

# Restart specific service
docker compose restart app

# Shell into container
docker exec -it rag-app bash

# View Qdrant dashboard
# Open http://localhost:6333/dashboard

# Nuclear reset (deletes all data)
docker compose down -v && rm -rf qdrant_storage/

Advanced Features

Reranking Pipeline

Two-stage retrieval significantly improves answer quality:

Query → Vector Search (cosine similarity, top-10)
      ↓
      Cross-Encoder Reranking (semantic scoring, top-3)
      ↓
      LLM Generation

Impact: 20-30% improvement in relevance scores with minimal latency cost.

Conversational Memory

Uses LangChain's ConversationBufferWindowMemory to maintain context:

Keeps last 5 Q&A exchanges per session
Enables follow-up questions like "tell me more" or "what about X?"
Session-based isolation (in-memory, upgradeable to Redis)

Observability Stack

Metrics Collected:

Request latency (total, embedding, retrieval, reranking, LLM)
Retrieval quality (scores, document counts)
Success/failure rates
p95/p99 latency percentiles

Optional LangSmith Integration:

Full LLM call tracing
Token usage tracking
Chain visualization
Production debugging

Testing

# Health check
curl http://localhost:8000/health

# API docs (interactive)
open http://localhost:8000/docs

# Test ingestion
curl -X POST http://localhost:8000/api/v1/ingest \
  -F "file=@test.pdf"

# Test chat
curl -X POST http://localhost:8000/api/v1/chat \
  -H "Content-Type: application/json" \
  -d '{"question": "test"}'

# Test metrics
curl http://localhost:8000/api/v1/metrics/summary

Future Enhancements

Planned Improvements

Production Upgrade Path

Memory → Redis:

from langchain.memory import RedisChatMessageHistory

message_history = RedisChatMessageHistory(
    url="redis://localhost:6379",
    session_id=session_id
)

Metrics → Prometheus:

from prometheus_client import Counter, Histogram

query_counter = Counter('rag_queries_total', 'Total queries')
query_latency = Histogram('rag_query_latency_seconds', 'Query latency')

Rate Limiting:

from slowapi import Limiter

limiter = Limiter(key_func=get_remote_address)

@app.post("/chat")
@limiter.limit("10/minute")
async def chat(...):
    ...

Contributing

Contributions are welcome! This project serves as a boilerplate for building production-grade RAG systems.

How to Contribute

Fork the repository
Create a feature branch
```
git checkout -b feature/amazing-feature
```
Commit your changes
```
git commit -m 'Add amazing feature'
```
Push to the branch
```
git push origin feature/amazing-feature
```
Open a Pull Request

Contribution Guidelines

Follow PEP 8 style guide
Add tests for new features
Update documentation
Ensure Docker builds successfully
Test with sample PDFs before submitting

Areas for Contribution

🐛 Bug fixes
✨ New features (see Future Enhancements)
📝 Documentation improvements
🧪 Test coverage
🎨 UI/Frontend development
🔧 DevOps improvements (CI/CD, monitoring)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

LangChain — LLM orchestration framework
Qdrant — High-performance vector database
sentence-transformers — State-of-the-art embeddings
FastAPI — Modern Python web framework
OpenRouter — Unified LLM API access

Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: See /docs endpoint when running locally

Performance Benchmarks

Tested on modest hardware (4 vCPU, 8GB RAM):

Operation	Latency	Notes
PDF Ingestion (10 pages)	~2-3s	Includes chunking, embedding, indexing
Vector Search	~50-100ms	Top-10 retrieval from 10k chunks
Reranking	~100-150ms	Cross-encoder on top-10 candidates
LLM Generation	~1-2s	Depends on response length
Total Query Latency	~1.5-2.5s	Full pipeline with reranking

🌟 Star History

If you find this project useful, please consider giving it a ⭐!

Built with ❤️ for the AI/ML community

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
app		app
.example.env		.example.env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

Production RAG AI Chat Agent

🚀 Features

Core RAG Pipeline

Advanced Features

Architecture Overview

Data Flow

Project Structure

Tech Stack

Quick Start

Prerequisites

Installation

API Usage

1. Ingest Documents

2. Chat (Standard)

3. Chat (Streaming)

4. Multi-Turn Conversations

5. Collection Management

6. Session Management

7. Metrics & Observability

Configuration

Key Design Decisions

Docker Commands

Advanced Features

Reranking Pipeline

Conversational Memory

Observability Stack

Testing

Future Enhancements

Planned Improvements

Production Upgrade Path

Contributing

How to Contribute

Contribution Guidelines

Areas for Contribution

License

Acknowledgments

Support

Performance Benchmarks

🌟 Star History

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages