RAG PDF Chatbot

A Professional, Enterprise-Grade Retrieval-Augmented Generation System for PDF Documents

🎯 Problem Statement

Organizations struggle with extracting actionable insights from large collections of PDF documents. Traditional search methods fail to provide contextual, accurate answers to complex questions. RAG PDF Chatbot solves this by combining:

Document Retrieval: Find relevant information from PDF collections
Contextual Understanding: Use LLM to understand and synthesize information
Natural Language Interface: Ask questions in plain English and get precise answers

🏗️ Architecture

graph TD
    A[PDF Documents] --> B[Document Processor]
    B --> C[Vector Store]
    C --> D[Retriever]
    D --> E[RAG Chain]
    E --> F[LLM]
    F --> G[Answer]
    G --> H[User]
    H -->|Question| E

Key Components

Document Processor: Loads and chunks PDF documents
Vector Store: Stores document embeddings for efficient retrieval
Retriever: Finds relevant documents for a given question
RAG Chain: Combines retrieved context with LLM for answer generation
LLM Interface: Uses Ollama to run local language models

🛠️ Tech Stack

Core: Python 3.8+
Document Processing: LangChain, PyMuPDF
Embeddings: Ollama (nomic-embed-text)
Vector Store: FAISS
LLM: Ollama (llama3.2:3b)
Configuration: Python dataclasses + environment variables
Testing: pytest

🚀 Quick Start

Prerequisites

Python 3.8+
Ollama running locally with required models
PDF documents in the rag-dataset/ directory

Installation

# Clone the repository
git clone https://github.com/your-org/rag-pdf-chatbot.git
cd rag-pdf-chatbot

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env
# Edit .env with your configuration

Running the Application

# Basic usage
python -m src.main --help

# Ask a specific question
python -m src.main --question "What are the benefits of BCAA supplements?"

# Interactive mode
python -m src.main --interactive

# Rebuild vector store
python -m src.main --rebuild --interactive

📂 Project Structure

rag-pdf-chatbot/
├── src/                  # Core application code
│   ├── __init__.py       # Package initialization
│   ├── config.py         # Configuration management
│   ├── document_processor.py  # Document loading and processing
│   ├── vector_store.py   # Vector storage and retrieval
│   ├── rag_chain.py      # RAG pipeline implementation
│   └── main.py           # Main application entry point
├── tests/                # Unit and integration tests
├── docs/                 # Architecture and design documentation
├── config/               # Configuration files
├── scripts/              # Automation and utility scripts
├── .env.example          # Environment variable template
├── .gitignore            # Git ignore patterns
├── README.md             # This file
└── requirements.txt      # Python dependencies

🔧 Configuration

The application uses environment variables for configuration. See .env.example for all available options:

# Ollama Configuration
OLLAMA_BASE_URL=http://localhost:11434
EMBEDDING_MODEL=nomic-embed-text
LLM_MODEL=llama3.2:3b

# Document Processing
DATASET_PATH=rag-dataset
CHUNK_SIZE=1000
CHUNK_OVERLAP=100

# Vector Store
VECTOR_STORE_PATH=health_supplemets
SAVE_VECTOR_STORE=true

# Retrieval
RETRIEVAL_TYPE=mmr
RETRIEVAL_K=3
RETRIEVAL_FETCH_K=100
RETRIEVAL_LAMBDA=1.0

🧪 Testing

# Run all tests
pytest tests/

# Run specific test
pytest tests/test_document_processor.py

# Run with coverage
pytest --cov=src tests/

📖 Documentation

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🎯 Value Proposition

For Developers:

Clean, modular architecture following SOLID principles
Easy to extend and customize
Comprehensive documentation and examples

For Organizations:

Extract insights from PDF documents efficiently
Reduce manual document review time
Improve knowledge discovery and decision making

For Recruiters:

Professional, enterprise-grade codebase
Follows best practices for security and maintainability
Demonstrates advanced Python and AI/ML skills

🔒 Security

This project follows GitGuardian security standards:

No hardcoded secrets
Environment variable configuration
Secure dependency management
Regular security audits

📞 Support

For issues, questions, or feature requests, please open an issue on GitHub.

Built with ❤️ for developers, by developers.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
Include/site/python3.13/greenlet		Include/site/python3.13/greenlet
docs		docs
health_supplemets		health_supplemets
rag-dataset		rag-dataset
scripts		scripts
share		share
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
pyvenv.cfg		pyvenv.cfg
rag_chatbot.ipynb		rag_chatbot.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG PDF Chatbot

🎯 Problem Statement

🏗️ Architecture

Key Components

🛠️ Tech Stack

🚀 Quick Start

Prerequisites

Installation

Running the Application

📂 Project Structure

🔧 Configuration

🧪 Testing

📖 Documentation

🤝 Contributing

📜 License

🎯 Value Proposition

🔒 Security

📞 Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG PDF Chatbot

🎯 Problem Statement

🏗️ Architecture

Key Components

🛠️ Tech Stack

🚀 Quick Start

Prerequisites

Installation

Running the Application

📂 Project Structure

🔧 Configuration

🧪 Testing

📖 Documentation

🤝 Contributing

📜 License

🎯 Value Proposition

🔒 Security

📞 Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages