A Professional, Enterprise-Grade Retrieval-Augmented Generation System for PDF Documents
Organizations struggle with extracting actionable insights from large collections of PDF documents. Traditional search methods fail to provide contextual, accurate answers to complex questions. RAG PDF Chatbot solves this by combining:
- Document Retrieval: Find relevant information from PDF collections
- Contextual Understanding: Use LLM to understand and synthesize information
- Natural Language Interface: Ask questions in plain English and get precise answers
graph TD
A[PDF Documents] --> B[Document Processor]
B --> C[Vector Store]
C --> D[Retriever]
D --> E[RAG Chain]
E --> F[LLM]
F --> G[Answer]
G --> H[User]
H -->|Question| E
- Document Processor: Loads and chunks PDF documents
- Vector Store: Stores document embeddings for efficient retrieval
- Retriever: Finds relevant documents for a given question
- RAG Chain: Combines retrieved context with LLM for answer generation
- LLM Interface: Uses Ollama to run local language models
- Core: Python 3.8+
- Document Processing: LangChain, PyMuPDF
- Embeddings: Ollama (nomic-embed-text)
- Vector Store: FAISS
- LLM: Ollama (llama3.2:3b)
- Configuration: Python dataclasses + environment variables
- Testing: pytest
- Python 3.8+
- Ollama running locally with required models
- PDF documents in the
rag-dataset/directory
# Clone the repository
git clone https://github.com/your-org/rag-pdf-chatbot.git
cd rag-pdf-chatbot
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Set up environment variables
cp .env.example .env
# Edit .env with your configuration# Basic usage
python -m src.main --help
# Ask a specific question
python -m src.main --question "What are the benefits of BCAA supplements?"
# Interactive mode
python -m src.main --interactive
# Rebuild vector store
python -m src.main --rebuild --interactiverag-pdf-chatbot/
βββ src/ # Core application code
β βββ __init__.py # Package initialization
β βββ config.py # Configuration management
β βββ document_processor.py # Document loading and processing
β βββ vector_store.py # Vector storage and retrieval
β βββ rag_chain.py # RAG pipeline implementation
β βββ main.py # Main application entry point
βββ tests/ # Unit and integration tests
βββ docs/ # Architecture and design documentation
βββ config/ # Configuration files
βββ scripts/ # Automation and utility scripts
βββ .env.example # Environment variable template
βββ .gitignore # Git ignore patterns
βββ README.md # This file
βββ requirements.txt # Python dependencies
The application uses environment variables for configuration. See .env.example for all available options:
# Ollama Configuration
OLLAMA_BASE_URL=http://localhost:11434
EMBEDDING_MODEL=nomic-embed-text
LLM_MODEL=llama3.2:3b
# Document Processing
DATASET_PATH=rag-dataset
CHUNK_SIZE=1000
CHUNK_OVERLAP=100
# Vector Store
VECTOR_STORE_PATH=health_supplemets
SAVE_VECTOR_STORE=true
# Retrieval
RETRIEVAL_TYPE=mmr
RETRIEVAL_K=3
RETRIEVAL_FETCH_K=100
RETRIEVAL_LAMBDA=1.0# Run all tests
pytest tests/
# Run specific test
pytest tests/test_document_processor.py
# Run with coverage
pytest --cov=src tests/We welcome contributions! Please see CONTRIBUTING.md for guidelines.
This project is licensed under the MIT License - see the LICENSE file for details.
For Developers:
- Clean, modular architecture following SOLID principles
- Easy to extend and customize
- Comprehensive documentation and examples
For Organizations:
- Extract insights from PDF documents efficiently
- Reduce manual document review time
- Improve knowledge discovery and decision making
For Recruiters:
- Professional, enterprise-grade codebase
- Follows best practices for security and maintainability
- Demonstrates advanced Python and AI/ML skills
This project follows GitGuardian security standards:
- No hardcoded secrets
- Environment variable configuration
- Secure dependency management
- Regular security audits
For issues, questions, or feature requests, please open an issue on GitHub.
Built with β€οΈ for developers, by developers.