Doc Query - RAG Pipeline Demonstration

A Retrieval-Augmented Generation (RAG) pipeline that enables semantic search and question-answering over PDF documents and Wikipedia articles.

What is RAG?

Retrieval-Augmented Generation (RAG) is a technique that enhances Large Language Models (LLMs) by providing them with relevant context from external documents. Instead of relying solely on the model's training data, RAG:

Retrieves relevant document chunks based on semantic similarity to the user's question
Augments the LLM prompt with these retrieved chunks as context
Generates an answer grounded in the actual document content

This approach allows LLMs to answer questions about private documents, recent information, or specialized content they weren't trained on, while providing source citations for transparency.

Features

PDF Document Ingestion: Load and process PDF files using PyPDFLoader
Wikipedia Ingestion: Query and load Wikipedia articles directly
Text Chunking: Split documents into overlapping chunks for optimal retrieval
Vector Embeddings: Generate embeddings using OpenAI's text-embedding-3-small model
Pinecone Vector Store: Store and query embeddings in Pinecone's serverless vector database
Semantic Search: Find relevant document chunks based on meaning, not just keywords
Cross-Encoder Reranking: Two-stage retrieval with reranking for improved accuracy
- Stage 1: Retrieve top-N candidates using fast vector similarity
- Stage 2: Rerank using cross-encoder model for precise relevance scoring
GPT-4o Integration: Generate answers with source citations (page numbers)

Architecture

User Question
      |
      v
[Vector Similarity Search] --> Retrieve top-30 candidate chunks
      |
      v
[Cross-Encoder Reranking] --> Rerank and select top-10 most relevant
      |
      v
[Prompt Construction] --> Combine question + context + instructions
      |
      v
[GPT-4o] --> Generate answer with citations
      |
      v
Answer with [p. nn] citations

Setup

Prerequisites

Python 3.10+
Pinecone account (free tier available)
OpenAI API key

Environment Setup

Clone and navigate to the project:
```
cd doc_query
```

Create and activate virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Configure API keys:

Create a .env file in the project root:

OPENAI_API_KEY=your-openai-api-key
PINECONE_API_KEY=your-pinecone-api-key

Running the Notebook

# Register the virtual environment with Jupyter (first time only)
pip install ipykernel
python -m ipykernel install --user --name=doc_query --display-name "Python (doc_query)"

# Start Jupyter
jupyter notebook doc_query.ipynb

Running the Streamlit App

streamlit run app.py

The app will open in your browser at http://localhost:8501.

Usage

Notebook Usage

Place your PDF document in an accessible location
Update the file path in the load_document() call
Run cells sequentially to:
- Load and chunk the document
- Create/load Pinecone index with embeddings
- Query the document with natural language questions

Streamlit App Usage

Upload a PDF document using the file uploader
Wait for the document to be processed and indexed
Enter your question in the text input
View the answer with source citations

Example Queries and Outputs

Using "Treasure Island" PDF as the source document:

Query: "What does Jim fear most?"

Output:

Jim fears torture the most. This is evident from the excerpt where he expresses his fear to the doctor: "but what I fear is torture. If they come to torture me --" [p. 124]. This statement highlights that, despite the various dangers and threats he faces, the prospect of being tortured is what he dreads the most.

Query: "What is Long John Silver's motivation?"

Output:

Long John Silver's primary motivation appears to be the acquisition of treasure. This is evident from his direct statement: "Well, here it is," said Silver. "We want that treasure, and we'll have it -- that's our point!" [p. 80]. His desire for the treasure is further emphasized when he insists on obtaining the chart from the captain: "What I mean is, we want your chart. Now, I never meant you no harm, myself." [p. 80].

Cost Estimate

Costs per document ingestion and query session:

Operation	Model	Cost
Embeddings	text-embedding-3-small	~$0.002 per 100K tokens
LLM Queries	GPT-4o	~$0.005 per query (1.5K input + 100 output tokens)
Pinecone	Serverless (free tier)	Free for small indexes

Example: Processing a 142-page novel (Treasure Island):

Embedding cost: ~$0.002 (113K tokens)
Per query cost: ~$0.005
Total for 10 queries: ~$0.05

Deployment to Streamlit Cloud

Push to GitHub:

git init
git add .
git commit -m "Initial commit"
git remote add origin https://github.com/yourusername/doc-query.git
git push -u origin main

Deploy on Streamlit Cloud:
- Go to share.streamlit.io
- Click "New app"
- Connect your GitHub repository
- Set the main file path to app.py
- Add secrets in the Streamlit Cloud dashboard:
  - OPENAI_API_KEY
  - PINECONE_API_KEY
Secrets Configuration:

In Streamlit Cloud, add secrets via the dashboard (Settings > Secrets):
```
OPENAI_API_KEY = "your-openai-api-key"
PINECONE_API_KEY = "your-pinecone-api-key"
```

Project Structure

doc_query/
├── app.py              # Streamlit web application
├── doc_query.ipynb     # Jupyter notebook with full pipeline
├── requirements.txt    # Python dependencies
├── README.md           # This file
├── .env                # API keys (not committed)
├── .gitignore          # Git ignore rules
└── venv/               # Virtual environment (not committed)

Key Technologies

LangChain: Document loaders, text splitters, prompt templates, and chains
OpenAI: Embeddings (text-embedding-3-small) and LLM (GPT-4o)
Pinecone: Serverless vector database for similarity search
Sentence Transformers: Cross-encoder models for reranking
Streamlit: Web application framework

License

This project is for educational purposes as part of the ZTM LLM Web Apps course.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Doc Query - RAG Pipeline Demonstration

What is RAG?

Features

Architecture

Setup

Prerequisites

Environment Setup

Running the Notebook

Running the Streamlit App

Usage

Notebook Usage

Streamlit App Usage

Example Queries and Outputs

Cost Estimate

Deployment to Streamlit Cloud

Project Structure

Key Technologies

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
app.py		app.py
doc_query.ipynb		doc_query.ipynb
requirements.txt		requirements.txt

markgewhite/doc_query

Folders and files

Latest commit

History

Repository files navigation

Doc Query - RAG Pipeline Demonstration

What is RAG?

Features

Architecture

Setup

Prerequisites

Environment Setup

Running the Notebook

Running the Streamlit App

Usage

Notebook Usage

Streamlit App Usage

Example Queries and Outputs

Cost Estimate

Deployment to Streamlit Cloud

Project Structure

Key Technologies

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages