A modular pipeline for document embedding, retrieval, and question answering using ChromaDB and state-of-the-art transformer models.
DocuRAG is a demonstration project that implements a Retrieval-Augmented Generation (RAG) pipeline. It covers the entire process from converting documents to embeddings, storing them in a vector database, and finally performing question answering (QA) on the stored data.
This project leverages several powerful tools and libraries:
- ChromaDB for managing vector embeddings.
- SentenceTransformers (LaBSE) for generating embeddings.
- Transformers for question answering and translation.
- spaCy for natural language processing and language detection.
- Other libraries such as PyMuPDF, python-docx, python-pptx, and nltk for document processing.
- Document Processing: Monitors a folder for new documents (PDF, DOCX, TXT, PPTX), converts them to text, and splits the text into chunks.
- Embedding Generation: Uses the LaBSE model to generate semantic embeddings from document text.
- Vector Storage: Stores document embeddings in a ChromaDB collection to support efficient similarity searches.
- Question Answering: Processes user queries in Spanish by retrieving relevant document contexts and generating answers using a QA model.
- Logging: Keeps track of processed files to avoid duplicate processing.
DocuRAG/
├── clean_database.py # Script to clean the embeddings collection and log file
├── document_vectorizer.py # Script to process documents and store embeddings in ChromaDB
├── document_qa.py # Script to perform question answering on stored document embeddings
├── config.yml # Configuration file for paths and model parameters
└── README.md # This file
- Python: 3.8 or above (recommended Python 3.11)
- Libraries:
- chromadb
- PyYAML
- PyMuPDF
- sentence-transformers
- python-docx
- python-pptx
- nltk
- spacy
- transformers
- spacy-langdetect
- (and their dependencies)
You can install all required packages using pip or by creating a virtual environment with a provided requirements.txt (if added).
-
Clone the Repository:
git clone https://github.com/runciter2078/DocuRAG.git cd DocuRAG -
(Optional) Create and Activate a Virtual Environment:
python -m venv venv source venv/bin/activate # On Windows use: venv\Scripts\activate
-
Install Dependencies:
If you have a
requirements.txt:pip install -r requirements.txt
Otherwise, install the libraries manually:
pip install chromadb PyYAML PyMuPDF sentence-transformers python-docx python-pptx nltk spacy transformers spacy-langdetect
-
Download Necessary Data:
- For nltk, ensure the
punkttokenizer is downloaded. This will happen automatically on first run. - For spaCy, download the Spanish model if not already installed:
python -m spacy download es_core_news_sm
- For nltk, ensure the
-
Configuration:
Edit the
config.ymlfile if needed. The default settings use relative paths and generic parameters suitable for testing and demonstration.
Before processing new documents, you may want to clean the current embeddings collection and log file.
Run:
python clean_database.pyFollow the on-screen prompt to confirm the deletion.
Place your documents (PDF, DOCX, TXT, PPTX) in the folder specified by data_path (default is Data).
Then, run:
python document_vectorizer.pyThis script will process new or modified files, generate embeddings, and store them in ChromaDB.
Once the documents are processed and embeddings are stored, you can ask questions (in Spanish) based on the stored data.
Run:
python document_qa.pyEnter your question when prompted. The script will retrieve relevant contexts and generate an answer.
- Safety: Use the cleaning script with caution as it will permanently remove the stored embeddings and log file.
- Customization: Feel free to modify
config.ymlto suit your directory structure and model preferences. - Extensibility: DocuRAG is modular; you can integrate additional processing steps or swap models as needed.
MIT License [https://opensource.org/license/mit]
Contributions are welcome! If you have suggestions, improvements, or bug fixes, please open an issue or submit a pull request.
Enjoy exploring DocuRAG and showcasing your work!