🔍 PDF RAG — Retrieval-Augmented Generation from Scratch

A self-directed deep dive into RAG (Retrieval-Augmented Generation) — built to understand how modern AI pipelines work under the hood, from raw text ingestion all the way to semantic search.

Status: 🚧 In Development — architecture planned, implementation in progress

💡 What This Project Does

Upload any PDF and ask it questions in plain English. Rather than keyword matching, the system understands meaning — surfacing the most relevant passages using vector similarity search.

PDF → Parse → Chunk → Embed → Store in Vector DB → Query → Answer

This is the same fundamental pipeline behind tools like ChatGPT's file uploads, Notion AI, and enterprise document search systems.

Resources/to-do:

https://realpython.com/chromadb-vector-database/ https://realpython.com/tutorials/ai/

https://www.datacamp.com/ https://huggingface.co/models?library=sentence-transformers&author=sentence-transformers https://www.sbert.net/docs/sentence_transformer/pretrained_models.html

🧱 System Architecture

┌─────────────┐     ┌──────────────┐     ┌─────────────────────┐
│  PDF Input  │────▶│  Text Parser │────▶│   Chunking Engine   │
└─────────────┘     └──────────────┘     └──────────┬──────────┘
                                                     │
                                          ┌──────────▼──────────┐
                                          │  Embedding Model    │
                                          │  (Sentence-BERT)    │
                                          └──────────┬──────────┘
                                                     │
                                          ┌──────────▼──────────┐
                                          │  Vector Store       │
                                          │  (ChromaDB)         │
                                          └──────────┬──────────┘
                                                     │
                              ┌──────────────────────▼────────────────────────┐
                              │  Query Interface — Semantic Similarity Search  │
                              └───────────────────────────────────────────────┘

🛠️ Tech Stack

Component	Tool	Why
PDF Parsing	`PyMuPDF` / `pdfplumber`	Robust text + layout extraction
Embeddings	`sentence-transformers`	Pretrained SBERT models, no fine-tuning needed
Vector Database	`ChromaDB`	Lightweight, local-first vector store
Language	Python 3.10+	Industry standard for ML pipelines

📐 Key Concepts Explored

Text Embeddings — converting unstructured text into dense numerical vectors that encode semantic meaning
Chunking Strategy — splitting documents into overlapping segments to preserve context across chunk boundaries
Cosine Similarity — measuring vector proximity to rank retrieved passages by relevance
RAG Pipeline Design — decoupling retrieval from generation; understanding where each component lives in the stack
Vector Databases — indexing, storing, and querying high-dimensional embeddings efficiently

🗂️ Project Structure Goal

pdf-rag/
├── data/               # Sample PDFs for testing
├── src/
│   ├── ingest.py       # PDF loading and text extraction
│   ├── chunker.py      # Text splitting logic
│   ├── embed.py        # Embedding generation via sentence-transformers
│   ├── store.py        # ChromaDB read/write interface
│   └── query.py        # Similarity search and result ranking
├── notebooks/          # Exploratory work and model comparisons
├── requirements.txt
└── README.md

🗺️ Roadmap

PDF ingestion and text extraction
Chunking with overlap
Embedding pipeline with sentence-transformers
ChromaDB integration
Semantic query interface
Notebook with embedding visualizations (UMAP/t-SNE)
Swap and benchmark multiple SBERT models
Optional: lightweight frontend (Streamlit or React)

🎯 Why I Built This

RAG is one of the most widely deployed patterns in production AI systems today. Rather than treating it as a black box, I wanted to learn it myself. From how embeddings encode meaning, to how vector databases work.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
scripts		scripts
.gitignore		.gitignore
README.md		README.md
easy.pdf		easy.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 PDF RAG — Retrieval-Augmented Generation from Scratch

💡 What This Project Does

Resources/to-do:

🧱 System Architecture

🛠️ Tech Stack

📐 Key Concepts Explored

🗂️ Project Structure Goal

🗺️ Roadmap

🎯 Why I Built This

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔍 PDF RAG — Retrieval-Augmented Generation from Scratch

💡 What This Project Does

Resources/to-do:

🧱 System Architecture

🛠️ Tech Stack

📐 Key Concepts Explored

🗂️ Project Structure Goal

🗺️ Roadmap

🎯 Why I Built This

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages