FastAPI service for PDF ingestion, indexing, retrieval, and citation grounded chat.
This project builds a document RAG system over PDF files.
- It chunks PDF content into structured units.
- It runs OCR for scanned pages and visual regions.
- It stores embeddings in Chroma.
- It stores searchable text in SQLite with FTS5 BM25.
- It serves retrieval and chat APIs on top of indexed data.
This repository has two RAG implementations:
- Linear RAG implementation in
src/agents/ragexposed byPOST /chat. - Agentic RAG with conversational memory in
src/agents/agentic_rag(agent_rag) exposed byPOST /agent_chat.
Main path of this project is the agentic RAG endpoint POST /agent_chat.
Linear RAG is a fixed sequence graph.
retrievergets vector and text candidates.- If
images=true,image_blob_loaderreads image bytes from disk. llmgenerates the answer using retrieved context.citation_validationremoves invalid citations and enforces strict grounding.outputreturns final answer, sources, usage, and image send stats.
Agentic RAG is a tool calling graph with memory.
inputloads conversation history from in memory cache if not fetches from Supabase.llm_nodedecides whether to call tools or answer directly for greeting messages.tool_nodeexecutesretrievetool calls (up to 5 tool calls in one run).citation_validationvalidates the final answer against tool retrieved sources.save_nodewrites the final turn to cache and optional Supabase.
Conversational memory behavior:
- Recent turns are cached in process memory by
chat_id. - On each request, prior turns are loaded and inserted into the graph input.
- If Supabase is configured, turns are persisted and can be restored after restart.
If evidence is missing, both chat implementations return:
Not found in the document.
Chunking entrypoint is PDFChunker and supports ppt, non_ppt, and image_only.
Standard profile for this project documentation is pdf_type=ppt.
Other valid values are non_ppt and image_only.
- Processes page by page.
- Uses page screenshot and OCR for visual heavy slides.
- Produces
page_ocrchunks and text chunks as available. - Stores full page image references for multimodal answering.
- Detects mixed pages with text, tables, and visual regions.
- Extracts text chunks by configured granularity (
page,paragraph,heading,fixed). - Extracts table chunks with
extract_then_markdownusingpdfplumber,camelot, orcascade. - Supports
ocr_first_then_constructfor OCR on cropped table images. - Extracts visual chunks (
chunk_type=image) for charts and diagrams.
- Targets scanned documents with minimal text layer.
- Runs OCR on rendered page images.
- Produces OCR driven searchable chunks.
After chunking, indexing runs through embed_and_store.
- Clean chunk text for embedding.
- Create embeddings with Google embedding model.
- Upsert vectors and lightweight metadata to Chroma.
- Insert full text and metadata JSON to SQLite.
- Maintain FTS5 index via triggers for BM25 retrieval.
Storage breakdown:
- Chroma stores embeddings.
- Chroma stores lightweight metadata (
doc_id,page,chunk_type,image_path, and document fields). - SQLite stores full chunk text.
- SQLite stores cleaned text indexed in FTS5.
- SQLite stores rich metadata JSON including bbox and image dimensions.
- Filesystem stores extracted images under
data/images/{pdf_stem}. - Filesystem stores uploaded PDFs in
data/tmp.
Default paths under DATA_DIR (./data by default):
data/
chroma/ # Chroma vector store
db/
rag_chunks.db # SQLite tables + FTS5 index
images/ # Extracted page and crop images
tmp/ # Uploaded PDFs kept on disk
logs/
app.log # Rotating logs
- Chroma stores vectors and lightweight metadata per chunk.
- SQLite stores full chunk text and JSON metadata for BM25 search.
- Image files are saved on disk and linked by relative
image_path. - Chat memory for
agent_chatis cached in process memory. - If Supabase is configured and available, chat turns are also persisted to DB.
src/
app.py # FastAPI app and core endpoints
api/agent_chat.py # /agent_chat router
config.py # Settings and ChunkingConfig
components/
ingestion/
chunker/ # PDF chunking, OCR, image capture
store/ # Chroma and SQLite write/read layers
retriever/ # Retrieval orchestration
utils/embeddings.py # Embedding helpers
agents/
rag/ # Fixed graph chat flow for /chat
agentic_rag/ # Tool calling chat flow for /agent_chat
llm/llm.py # Gemini client
log/logs.py # Logger config
- Python 3.11 recommended.
- Google API key for LLM and embedding models.
- PDF and OCR dependencies from
requirements.txt.
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
cp .env.example .envMinimum required .env:
GOOGLE_API_KEY=your_google_api_key_here
LLM_MODEL=models/gemini-2.5-flash
LLM_MAX_TOKENS=10000
LLM_TEMPERATURE=0uvicorn src.app:app --reloadDocs:
http://127.0.0.1:8000/docs
Primary chat endpoint is POST /agent_chat.
multipart/form-data:
- required:
file(PDF) - optional: all chunking controls in
ChunkingConfig
Example:
curl -X POST "http://127.0.0.1:8000/chunk" \
-F "file=@/absolute/path/report.pdf" \
-F "pdf_type=ppt" \
-F "granularity=page"Exact response JSON format:
{
"doc_id": "string",
"source_file": "/absolute/path/to/saved/upload.pdf",
"pdf_type": "ppt",
"granularity": "page",
"table_strategy": "extract_then_markdown",
"table_engine": "pdfplumber",
"ocr_engine": "doctr",
"image_ocr": true,
"total_chunks": 0,
"chunks_by_type": {},
"chunks_by_page": {},
"extraction_confidence": {},
"avg_token_count": 0,
"max_token_count": 0,
"generated_at": "2026-03-07T00:00:00+00:00"
}Request JSON:
{
"query": "string",
"top_k": 5,
"images": true,
"image_payload": "ref",
"text_search": true
}Exact response JSON key format when images=true and text_search=true:
{
"vector_results": [
{
"chunk_id": "string",
"text": "string",
"score": 0.0,
"doc_id": "string",
"page": 0,
"chunk_type": "text",
"image_path": "pdf_stem/p01_page_ocr_00.webp",
"section_title": "string",
"token_count": 0,
"extraction_confidence": "high",
"image_width_px": 0,
"image_height_px": 0,
"bbox": {
"x0": 0.0,
"y0": 0.0,
"x1": 0.0,
"y1": 0.0
},
"source_uri": "file:///abs/path/file.pdf",
"source_file": "/abs/path/file.pdf",
"filename": "original_upload.pdf",
"total_pages": 0,
"pdf_type": "non_ppt",
"created_at": "2026-03-07T00:00:00+00:00",
"doc_metadata": {},
"retrieval_type": "vector",
"images": {
"page_image": {
"key": "pdf_stem/p01_page_ocr_00.webp",
"path": "/abs/path/data/images/pdf_stem/p01_page_ocr_00.webp",
"size_bytes": 0,
"media_type": "image/webp"
},
"inline_images": [
{
"name": "image_0.webp",
"key": "pdf_stem/image_0.webp",
"path": "/abs/path/data/images/pdf_stem/image_0.webp",
"size_bytes": 0,
"media_type": "image/webp"
}
]
}
}
],
"text_results": [
{
"chunk_id": "string",
"text": "string",
"score": 0.0,
"doc_id": "string",
"page": 0,
"chunk_type": "text",
"image_path": "pdf_stem/p01_page_ocr_00.webp",
"section_title": "string",
"token_count": 0,
"extraction_confidence": "high",
"bbox": {
"x0": 0.0,
"y0": 0.0,
"x1": 0.0,
"y1": 0.0
},
"image_width_px": 0,
"image_height_px": 0,
"source_uri": "file:///abs/path/file.pdf",
"source_file": "/abs/path/file.pdf",
"total_pages": 0,
"pdf_type": "non_ppt",
"created_at": "2026-03-07T00:00:00+00:00",
"doc_metadata": {},
"retrieval_type": "text",
"images": {
"page_image": {
"key": "pdf_stem/p01_page_ocr_00.webp",
"path": "/abs/path/data/images/pdf_stem/p01_page_ocr_00.webp",
"size_bytes": 0,
"media_type": "image/webp"
},
"inline_images": []
}
}
]
}Notes:
- If
images=false,imageskey is not attached to results. - If
text_search=false, output may contain onlyvector_results. - If
image_payload="blob", each image object includes"blob"bytes.
Request JSON:
{
"message": "string",
"top_k": 3,
"images": true,
"include_text": false
}Exact response JSON format:
{
"answer": "string",
"metadata": {
"sources": [
{
"citation": 1,
"source_file": "/abs/path/file.pdf",
"filename": "original_upload.pdf",
"page": 0,
"chunk_type": "text",
"chunk_id": "string",
"bbox": {
"x0": 0.0,
"y0": 0.0,
"x1": 0.0,
"y1": 0.0
},
"text": "string"
}
],
"used_citations": [1],
"usage": {
"input_tokens": 0,
"output_tokens": 0,
"total_tokens": 0
},
"images_sent": {
"enabled": true,
"selected_count": 0,
"selected_citations": [],
"mode": "media_bytes",
"total_image_bytes": 0
}
}
}Request JSON:
{
"message": "string",
"top_k": 3,
"images": true,
"include_text": false,
"text_search": true,
"config": {
"chat_id": "string",
"session_id": "string"
}
}Exact response JSON format:
{
"answer": "string",
"metadata": {
"history_turns_loaded": 0,
"sources": [
{
"citation": 1,
"source_file": "/abs/path/file.pdf",
"filename": "original_upload.pdf",
"page": 0,
"chunk_type": "text",
"chunk_id": "string",
"bbox": {
"x0": 0.0,
"y0": 0.0,
"x1": 0.0,
"y1": 0.0
},
"text": "string"
}
],
"used_citations": [1],
"rephrased_queries": ["string"],
"citation_validation": {
"is_valid": true,
"issues": [],
"available_citations": [1],
"found_citations": [1],
"invalid_citations": []
},
"usage_metadata": {
"input_tokens": 0,
"output_tokens": 0,
"total_tokens": 0,
"llm_calls": 0
},
"images_sent": {
"enabled": true,
"selected_count": 1,
"selected_citations": [1],
"mode": "tool_message_image_blocks",
"total_image_bytes": 12345
}
}
}Optional usage_metadata keys that may appear:
internal_tokensfinal_output_tokensllm_calls_with_usageinput_token_detailsoutput_token_detailsper_callaggregated_input_token_detailsaggregated_output_token_details
- Uploaded PDFs in
data/tmpare intentionally kept on disk. - Clean
data/tmpand other runtime data based on your retention policy. - For
agent_chatDB persistence, installsupabaseand setSUPABASE_URLandSUPABASE_KEY.