Skip to content

AnithaKarre/multimodel_RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔎 Multimodal RAG System

A Python-based Retrieval-Augmented Generation pipeline that understands text, images, and tables — powered by OCR, FAISS, and Groq LLaMA 3.

Python 3.11+ LangChain FAISS Groq License


What is this?

Most RAG systems only handle plain text. This one goes further — it ingests PDFs, Word docs, CSVs, Excel files, and even images embedded inside PDFs (via Tesseract OCR). Everything gets chunked, embedded with Google Generative AI, stored in FAISS, and queried through Groq LLaMA 3 for grounded answers with source attribution.

If your answer came from an image, you get the image path back too.


✨ Features

  • 📄 PDF & DOCX text extraction — direct parsing from PDF and Word documents
  • 🖼️ OCR for embedded images — extracts text from images inside PDFs using Tesseract, preserving image paths
  • 📊 Table parsing — CSV & Excel files converted to searchable text via pandas
  • 🔗 Chunking & embedding — powered by Google Generative AI Embeddings
  • FAISS vector search — fast similarity-based retrieval
  • 🤖 Groq LLaMA 3 QA — generates grounded responses from retrieved context
  • 🖼️ Image path return — includes the source image path when answers originate from OCR

🏗️ Architecture

 Documents (PDF, DOCX, CSV, Excel)
          │
          ▼
 ┌────────────────────────┐
 │   Content Extraction   │  ← PyMuPDF, python-docx, pandas, Tesseract OCR
 └──────────┬─────────────┘
            ▼
 ┌────────────────────────┐
 │  Chunk + Embed         │  ← Google Generative AI Embeddings + LangChain
 └──────────┬─────────────┘
            ▼
 ┌────────────────────────┐
 │  FAISS Vector Store    │  ← Similarity search & retrieval
 └──────────┬─────────────┘
            ▼
 ┌────────────────────────┐
 │  Groq LLaMA 3          │  ← Answer generation + source attribution
 └────────────────────────┘

🚀 Getting Started

Prerequisites: Python 3.11+, Tesseract OCR, a Google API key, and a Groq API key.

# 1. Clone
git clone https://github.com/AnithaKarre/multimodel_RAG.git
cd multimodel_RAG

# 2. Virtual environment
python -m venv .venv
.venv\Scripts\activate          # Windows
# source .venv/bin/activate     # Linux/macOS

# 3. Install dependencies
pip install -r requirements.txt

# 4. Set Tesseract path in backend/rag-model.py
#    pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# 5. Create backend/assets/.env
#    EMBED_MODEL_API_KEY=your_google_api_key
#    GROQ_API_KEY=your_groq_api_key

# 6. Run
python backend/rag-model.py

Drop your documents (PDF, DOCX, CSV, Excel) into backend/assets/ before running.


📂 Project Structure

multimodel_RAG/
├── main.py                # Entry point (extendable for UI/API)
├── requirements.txt       # Python dependencies
├── pyproject.toml         # Project configuration
├── uv.lock                # Lock file for reproducible installs
├── backend/
│   ├── rag-model.py       # Main RAG pipeline
│   ├── app.ipynb          # Jupyter notebook for experiments
│   └── assets/            # Input documents + .env file
└── images/                # Extracted images from PDFs (with OCR text)

🛠️ Tech Stack

Core: Python 3.11+ · LangChain · FAISS · Groq LLaMA 3

Document Processing: PyMuPDF (fitz) · python-docx · pandas · pytesseract + PIL

Embeddings: Google Generative AI Embeddings


🔄 How It Works

1. Load documents from backend/assets/
2. Extract text (PDF/DOCX) + parse tables (CSV/Excel)
3. Extract images from PDFs → OCR with Tesseract → save to images/
4. Chunk all extracted text with LangChain splitters
5. Embed chunks → index in FAISS
6. User asks a question
7. Query embedded → similarity search in FAISS
8. Top chunks fed to Groq LLaMA 3 → grounded answer returned
9. If answer came from OCR text → image path included in response

📜 License

MIT — See LICENSE for details.

About

Multimodal RAG pipeline that ingests PDFs, Word docs, CSVs, Excel files, and embedded images (via Tesseract OCR). Chunks and embeds content with Google Generative AI, indexes in FAISS, and generates grounded answers using Groq LLaMA 3 — with image path attribution.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages