A Python-based Retrieval-Augmented Generation pipeline that understands text, images, and tables — powered by OCR, FAISS, and Groq LLaMA 3.
Most RAG systems only handle plain text. This one goes further — it ingests PDFs, Word docs, CSVs, Excel files, and even images embedded inside PDFs (via Tesseract OCR). Everything gets chunked, embedded with Google Generative AI, stored in FAISS, and queried through Groq LLaMA 3 for grounded answers with source attribution.
If your answer came from an image, you get the image path back too.
- 📄 PDF & DOCX text extraction — direct parsing from PDF and Word documents
- 🖼️ OCR for embedded images — extracts text from images inside PDFs using Tesseract, preserving image paths
- 📊 Table parsing — CSV & Excel files converted to searchable text via pandas
- 🔗 Chunking & embedding — powered by Google Generative AI Embeddings
- ⚡ FAISS vector search — fast similarity-based retrieval
- 🤖 Groq LLaMA 3 QA — generates grounded responses from retrieved context
- 🖼️ Image path return — includes the source image path when answers originate from OCR
Documents (PDF, DOCX, CSV, Excel)
│
▼
┌────────────────────────┐
│ Content Extraction │ ← PyMuPDF, python-docx, pandas, Tesseract OCR
└──────────┬─────────────┘
▼
┌────────────────────────┐
│ Chunk + Embed │ ← Google Generative AI Embeddings + LangChain
└──────────┬─────────────┘
▼
┌────────────────────────┐
│ FAISS Vector Store │ ← Similarity search & retrieval
└──────────┬─────────────┘
▼
┌────────────────────────┐
│ Groq LLaMA 3 │ ← Answer generation + source attribution
└────────────────────────┘
Prerequisites: Python 3.11+, Tesseract OCR, a Google API key, and a Groq API key.
# 1. Clone
git clone https://github.com/AnithaKarre/multimodel_RAG.git
cd multimodel_RAG
# 2. Virtual environment
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # Linux/macOS
# 3. Install dependencies
pip install -r requirements.txt
# 4. Set Tesseract path in backend/rag-model.py
# pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# 5. Create backend/assets/.env
# EMBED_MODEL_API_KEY=your_google_api_key
# GROQ_API_KEY=your_groq_api_key
# 6. Run
python backend/rag-model.pyDrop your documents (PDF, DOCX, CSV, Excel) into backend/assets/ before running.
multimodel_RAG/
├── main.py # Entry point (extendable for UI/API)
├── requirements.txt # Python dependencies
├── pyproject.toml # Project configuration
├── uv.lock # Lock file for reproducible installs
├── backend/
│ ├── rag-model.py # Main RAG pipeline
│ ├── app.ipynb # Jupyter notebook for experiments
│ └── assets/ # Input documents + .env file
└── images/ # Extracted images from PDFs (with OCR text)
Core: Python 3.11+ · LangChain · FAISS · Groq LLaMA 3
Document Processing: PyMuPDF (fitz) · python-docx · pandas · pytesseract + PIL
Embeddings: Google Generative AI Embeddings
1. Load documents from backend/assets/
2. Extract text (PDF/DOCX) + parse tables (CSV/Excel)
3. Extract images from PDFs → OCR with Tesseract → save to images/
4. Chunk all extracted text with LangChain splitters
5. Embed chunks → index in FAISS
6. User asks a question
7. Query embedded → similarity search in FAISS
8. Top chunks fed to Groq LLaMA 3 → grounded answer returned
9. If answer came from OCR text → image path included in response
MIT — See LICENSE for details.