Skip to content

JosephJonathanFernandes/RAG-PDF_chat

Β 
Β 

Repository files navigation

RAG PDF Chatbot

RAG PDF Chatbot Logo

A Professional, Enterprise-Grade Retrieval-Augmented Generation System for PDF Documents

License: MIT Python 3.8+ Code Style: Black

🎯 Problem Statement

Organizations struggle with extracting actionable insights from large collections of PDF documents. Traditional search methods fail to provide contextual, accurate answers to complex questions. RAG PDF Chatbot solves this by combining:

  • Document Retrieval: Find relevant information from PDF collections
  • Contextual Understanding: Use LLM to understand and synthesize information
  • Natural Language Interface: Ask questions in plain English and get precise answers

πŸ—οΈ Architecture

graph TD
    A[PDF Documents] --> B[Document Processor]
    B --> C[Vector Store]
    C --> D[Retriever]
    D --> E[RAG Chain]
    E --> F[LLM]
    F --> G[Answer]
    G --> H[User]
    H -->|Question| E
Loading

Key Components

  1. Document Processor: Loads and chunks PDF documents
  2. Vector Store: Stores document embeddings for efficient retrieval
  3. Retriever: Finds relevant documents for a given question
  4. RAG Chain: Combines retrieved context with LLM for answer generation
  5. LLM Interface: Uses Ollama to run local language models

πŸ› οΈ Tech Stack

  • Core: Python 3.8+
  • Document Processing: LangChain, PyMuPDF
  • Embeddings: Ollama (nomic-embed-text)
  • Vector Store: FAISS
  • LLM: Ollama (llama3.2:3b)
  • Configuration: Python dataclasses + environment variables
  • Testing: pytest

πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • Ollama running locally with required models
  • PDF documents in the rag-dataset/ directory

Installation

# Clone the repository
git clone https://github.com/your-org/rag-pdf-chatbot.git
cd rag-pdf-chatbot

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env
# Edit .env with your configuration

Running the Application

# Basic usage
python -m src.main --help

# Ask a specific question
python -m src.main --question "What are the benefits of BCAA supplements?"

# Interactive mode
python -m src.main --interactive

# Rebuild vector store
python -m src.main --rebuild --interactive

πŸ“‚ Project Structure

rag-pdf-chatbot/
β”œβ”€β”€ src/                  # Core application code
β”‚   β”œβ”€β”€ __init__.py       # Package initialization
β”‚   β”œβ”€β”€ config.py         # Configuration management
β”‚   β”œβ”€β”€ document_processor.py  # Document loading and processing
β”‚   β”œβ”€β”€ vector_store.py   # Vector storage and retrieval
β”‚   β”œβ”€β”€ rag_chain.py      # RAG pipeline implementation
β”‚   └── main.py           # Main application entry point
β”œβ”€β”€ tests/                # Unit and integration tests
β”œβ”€β”€ docs/                 # Architecture and design documentation
β”œβ”€β”€ config/               # Configuration files
β”œβ”€β”€ scripts/              # Automation and utility scripts
β”œβ”€β”€ .env.example          # Environment variable template
β”œβ”€β”€ .gitignore            # Git ignore patterns
β”œβ”€β”€ README.md             # This file
└── requirements.txt      # Python dependencies

πŸ”§ Configuration

The application uses environment variables for configuration. See .env.example for all available options:

# Ollama Configuration
OLLAMA_BASE_URL=http://localhost:11434
EMBEDDING_MODEL=nomic-embed-text
LLM_MODEL=llama3.2:3b

# Document Processing
DATASET_PATH=rag-dataset
CHUNK_SIZE=1000
CHUNK_OVERLAP=100

# Vector Store
VECTOR_STORE_PATH=health_supplemets
SAVE_VECTOR_STORE=true

# Retrieval
RETRIEVAL_TYPE=mmr
RETRIEVAL_K=3
RETRIEVAL_FETCH_K=100
RETRIEVAL_LAMBDA=1.0

πŸ§ͺ Testing

# Run all tests
pytest tests/

# Run specific test
pytest tests/test_document_processor.py

# Run with coverage
pytest --cov=src tests/

πŸ“– Documentation

🀝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

🎯 Value Proposition

For Developers:

  • Clean, modular architecture following SOLID principles
  • Easy to extend and customize
  • Comprehensive documentation and examples

For Organizations:

  • Extract insights from PDF documents efficiently
  • Reduce manual document review time
  • Improve knowledge discovery and decision making

For Recruiters:

  • Professional, enterprise-grade codebase
  • Follows best practices for security and maintainability
  • Demonstrates advanced Python and AI/ML skills

πŸ”’ Security

This project follows GitGuardian security standards:

  • No hardcoded secrets
  • Environment variable configuration
  • Secure dependency management
  • Regular security audits

πŸ“ž Support

For issues, questions, or feature requests, please open an issue on GitHub.


Built with ❀️ for developers, by developers.

About

RAG PDF chatbot, retrieval-augmented QA over PDFs using FAISS and Ollama Llama 3.2:3b.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 74.5%
  • Python 19.9%
  • C 2.5%
  • PowerShell 2.0%
  • Roff 1.1%