A Retrieval-Augmented Generation (RAG) pipeline that enables semantic search and question-answering over PDF documents and Wikipedia articles.
Retrieval-Augmented Generation (RAG) is a technique that enhances Large Language Models (LLMs) by providing them with relevant context from external documents. Instead of relying solely on the model's training data, RAG:
- Retrieves relevant document chunks based on semantic similarity to the user's question
- Augments the LLM prompt with these retrieved chunks as context
- Generates an answer grounded in the actual document content
This approach allows LLMs to answer questions about private documents, recent information, or specialized content they weren't trained on, while providing source citations for transparency.
- PDF Document Ingestion: Load and process PDF files using PyPDFLoader
- Wikipedia Ingestion: Query and load Wikipedia articles directly
- Text Chunking: Split documents into overlapping chunks for optimal retrieval
- Vector Embeddings: Generate embeddings using OpenAI's text-embedding-3-small model
- Pinecone Vector Store: Store and query embeddings in Pinecone's serverless vector database
- Semantic Search: Find relevant document chunks based on meaning, not just keywords
- Cross-Encoder Reranking: Two-stage retrieval with reranking for improved accuracy
- Stage 1: Retrieve top-N candidates using fast vector similarity
- Stage 2: Rerank using cross-encoder model for precise relevance scoring
- GPT-4o Integration: Generate answers with source citations (page numbers)
User Question
|
v
[Vector Similarity Search] --> Retrieve top-30 candidate chunks
|
v
[Cross-Encoder Reranking] --> Rerank and select top-10 most relevant
|
v
[Prompt Construction] --> Combine question + context + instructions
|
v
[GPT-4o] --> Generate answer with citations
|
v
Answer with [p. nn] citations
- Python 3.10+
- Pinecone account (free tier available)
- OpenAI API key
-
Clone and navigate to the project:
cd doc_query -
Create and activate virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Configure API keys:
Create a
.envfile in the project root:OPENAI_API_KEY=your-openai-api-key PINECONE_API_KEY=your-pinecone-api-key
# Register the virtual environment with Jupyter (first time only)
pip install ipykernel
python -m ipykernel install --user --name=doc_query --display-name "Python (doc_query)"
# Start Jupyter
jupyter notebook doc_query.ipynbstreamlit run app.pyThe app will open in your browser at http://localhost:8501.
- Place your PDF document in an accessible location
- Update the file path in the
load_document()call - Run cells sequentially to:
- Load and chunk the document
- Create/load Pinecone index with embeddings
- Query the document with natural language questions
- Upload a PDF document using the file uploader
- Wait for the document to be processed and indexed
- Enter your question in the text input
- View the answer with source citations
Using "Treasure Island" PDF as the source document:
Query: "What does Jim fear most?"
Output:
Jim fears torture the most. This is evident from the excerpt where he expresses his fear to the doctor: "but what I fear is torture. If they come to torture me --" [p. 124]. This statement highlights that, despite the various dangers and threats he faces, the prospect of being tortured is what he dreads the most.
Query: "What is Long John Silver's motivation?"
Output:
Long John Silver's primary motivation appears to be the acquisition of treasure. This is evident from his direct statement: "Well, here it is," said Silver. "We want that treasure, and we'll have it -- that's our point!" [p. 80]. His desire for the treasure is further emphasized when he insists on obtaining the chart from the captain: "What I mean is, we want your chart. Now, I never meant you no harm, myself." [p. 80].
Costs per document ingestion and query session:
| Operation | Model | Cost |
|---|---|---|
| Embeddings | text-embedding-3-small | ~$0.002 per 100K tokens |
| LLM Queries | GPT-4o | ~$0.005 per query (1.5K input + 100 output tokens) |
| Pinecone | Serverless (free tier) | Free for small indexes |
Example: Processing a 142-page novel (Treasure Island):
- Embedding cost: ~$0.002 (113K tokens)
- Per query cost: ~$0.005
- Total for 10 queries: ~$0.05
-
Push to GitHub:
git init git add . git commit -m "Initial commit" git remote add origin https://github.com/yourusername/doc-query.git git push -u origin main
-
Deploy on Streamlit Cloud:
- Go to share.streamlit.io
- Click "New app"
- Connect your GitHub repository
- Set the main file path to
app.py - Add secrets in the Streamlit Cloud dashboard:
OPENAI_API_KEYPINECONE_API_KEY
-
Secrets Configuration:
In Streamlit Cloud, add secrets via the dashboard (Settings > Secrets):
OPENAI_API_KEY = "your-openai-api-key" PINECONE_API_KEY = "your-pinecone-api-key"
doc_query/
├── app.py # Streamlit web application
├── doc_query.ipynb # Jupyter notebook with full pipeline
├── requirements.txt # Python dependencies
├── README.md # This file
├── .env # API keys (not committed)
├── .gitignore # Git ignore rules
└── venv/ # Virtual environment (not committed)
- LangChain: Document loaders, text splitters, prompt templates, and chains
- OpenAI: Embeddings (text-embedding-3-small) and LLM (GPT-4o)
- Pinecone: Serverless vector database for similarity search
- Sentence Transformers: Cross-encoder models for reranking
- Streamlit: Web application framework
This project is for educational purposes as part of the ZTM LLM Web Apps course.