Skip to content

dboghra/embeddings-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔍 PDF RAG — Retrieval-Augmented Generation from Scratch

A self-directed deep dive into RAG (Retrieval-Augmented Generation) — built to understand how modern AI pipelines work under the hood, from raw text ingestion all the way to semantic search.

Status: 🚧 In Development — architecture planned, implementation in progress


💡 What This Project Does

Upload any PDF and ask it questions in plain English. Rather than keyword matching, the system understands meaning — surfacing the most relevant passages using vector similarity search.

PDF → Parse → Chunk → Embed → Store in Vector DB → Query → Answer

This is the same fundamental pipeline behind tools like ChatGPT's file uploads, Notion AI, and enterprise document search systems.


Resources/to-do:

https://realpython.com/chromadb-vector-database/ https://realpython.com/tutorials/ai/

https://www.datacamp.com/ https://huggingface.co/models?library=sentence-transformers&author=sentence-transformers https://www.sbert.net/docs/sentence_transformer/pretrained_models.html


🧱 System Architecture

┌─────────────┐     ┌──────────────┐     ┌─────────────────────┐
│  PDF Input  │────▶│  Text Parser │────▶│   Chunking Engine   │
└─────────────┘     └──────────────┘     └──────────┬──────────┘
                                                     │
                                          ┌──────────▼──────────┐
                                          │  Embedding Model    │
                                          │  (Sentence-BERT)    │
                                          └──────────┬──────────┘
                                                     │
                                          ┌──────────▼──────────┐
                                          │  Vector Store       │
                                          │  (ChromaDB)         │
                                          └──────────┬──────────┘
                                                     │
                              ┌──────────────────────▼────────────────────────┐
                              │  Query Interface — Semantic Similarity Search  │
                              └───────────────────────────────────────────────┘

🛠️ Tech Stack

Component Tool Why
PDF Parsing PyMuPDF / pdfplumber Robust text + layout extraction
Embeddings sentence-transformers Pretrained SBERT models, no fine-tuning needed
Vector Database ChromaDB Lightweight, local-first vector store
Language Python 3.10+ Industry standard for ML pipelines

📐 Key Concepts Explored

  • Text Embeddings — converting unstructured text into dense numerical vectors that encode semantic meaning
  • Chunking Strategy — splitting documents into overlapping segments to preserve context across chunk boundaries
  • Cosine Similarity — measuring vector proximity to rank retrieved passages by relevance
  • RAG Pipeline Design — decoupling retrieval from generation; understanding where each component lives in the stack
  • Vector Databases — indexing, storing, and querying high-dimensional embeddings efficiently

🗂️ Project Structure Goal

pdf-rag/
├── data/               # Sample PDFs for testing
├── src/
│   ├── ingest.py       # PDF loading and text extraction
│   ├── chunker.py      # Text splitting logic
│   ├── embed.py        # Embedding generation via sentence-transformers
│   ├── store.py        # ChromaDB read/write interface
│   └── query.py        # Similarity search and result ranking
├── notebooks/          # Exploratory work and model comparisons
├── requirements.txt
└── README.md

🗺️ Roadmap

  • PDF ingestion and text extraction
  • Chunking with overlap
  • Embedding pipeline with sentence-transformers
  • ChromaDB integration
  • Semantic query interface
  • Notebook with embedding visualizations (UMAP/t-SNE)
  • Swap and benchmark multiple SBERT models
  • Optional: lightweight frontend (Streamlit or React)

🎯 Why I Built This

RAG is one of the most widely deployed patterns in production AI systems today. Rather than treating it as a black box, I wanted to learn it myself. From how embeddings encode meaning, to how vector databases work.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages