Python + React PDF RAG framework

This project is a minimal full-stack RAG app with recursive PDF ingestion, a React chatbot UI, OpenAI embeddings using text-embedding-ada-002, GPT-4o answers, clickable source citations, and an in-memory vector store.

Features

Reads PDFs recursively from backend/data/pdfs and all subfolders.
Provides a React chatbot built with Vite.
Exposes FastAPI endpoints that call OpenAI embeddings and chat models by API key.
Stores vector embeddings in memory using a small NumPy cosine-similarity vector database.
Turns model citations like [1] into clickable links to the cited source PDF and page.

Run backend

cd backend
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
export OPENAI_API_KEY="sk-your-key-here"
python run.py

The API runs at http://localhost:8000.

Run frontend

cd frontend
npm install
npm run dev

The UI runs at http://localhost:5173.

Add PDFs

Drop PDFs into backend/data/pdfs, including nested folders like backend/data/pdfs/policies or backend/data/pdfs/contracts/client-a. Then click Re-index PDFs in the UI or call:

curl -X POST "http://localhost:8000/ingest?reset=true"

Main API endpoints

GET /health checks API and in-memory vector count.
GET /pdfs lists PDFs found recursively.
GET /source?path=folder/file.pdf serves a PDF source file. Browser page anchors are added as #page=3.
POST /upload uploads a PDF into data/pdfs/uploads.
POST /ingest?reset=true extracts, chunks, embeds, and stores vectors in memory.
POST /chat retrieves relevant chunks and answers with GPT-4o.

How citation links work

The backend stores metadata.source_url for every chunk, for example /source?path=policies/handbook.pdf#page=7. The prompt asks GPT-4o to cite using bracket numbers like [1]. The React app detects [1], [2], etc. in assistant messages and converts them into links using the matching item from the returned sources array.

Important note

Because this uses an in-memory vector store, embeddings disappear when the backend restarts. That matches the requested requirement. For production, swap InMemoryVectorStore for Qdrant, Postgres pgvector, Pinecone, Weaviate, Chroma, or Redis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python + React PDF RAG framework

Features

Run backend

Run frontend

Add PDFs

Main API endpoints

How citation links work

Important note

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Python + React PDF RAG framework

Features

Run backend

Run frontend

Add PDFs

Main API endpoints

How citation links work

Important note