Skip to content

Latest commit

 

History

History
60 lines (42 loc) · 2.28 KB

File metadata and controls

60 lines (42 loc) · 2.28 KB

Python + React PDF RAG framework

This project is a minimal full-stack RAG app with recursive PDF ingestion, a React chatbot UI, OpenAI embeddings using text-embedding-ada-002, GPT-4o answers, clickable source citations, and an in-memory vector store.

Features

  1. Reads PDFs recursively from backend/data/pdfs and all subfolders.
  2. Provides a React chatbot built with Vite.
  3. Exposes FastAPI endpoints that call OpenAI embeddings and chat models by API key.
  4. Stores vector embeddings in memory using a small NumPy cosine-similarity vector database.
  5. Turns model citations like [1] into clickable links to the cited source PDF and page.

Run backend

cd backend
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
export OPENAI_API_KEY="sk-your-key-here"
python run.py

The API runs at http://localhost:8000.

Run frontend

cd frontend
npm install
npm run dev

The UI runs at http://localhost:5173.

Add PDFs

Drop PDFs into backend/data/pdfs, including nested folders like backend/data/pdfs/policies or backend/data/pdfs/contracts/client-a. Then click Re-index PDFs in the UI or call:

curl -X POST "http://localhost:8000/ingest?reset=true"

Main API endpoints

  • GET /health checks API and in-memory vector count.
  • GET /pdfs lists PDFs found recursively.
  • GET /source?path=folder/file.pdf serves a PDF source file. Browser page anchors are added as #page=3.
  • POST /upload uploads a PDF into data/pdfs/uploads.
  • POST /ingest?reset=true extracts, chunks, embeds, and stores vectors in memory.
  • POST /chat retrieves relevant chunks and answers with GPT-4o.

How citation links work

The backend stores metadata.source_url for every chunk, for example /source?path=policies/handbook.pdf#page=7. The prompt asks GPT-4o to cite using bracket numbers like [1]. The React app detects [1], [2], etc. in assistant messages and converts them into links using the matching item from the returned sources array.

Important note

Because this uses an in-memory vector store, embeddings disappear when the backend restarts. That matches the requested requirement. For production, swap InMemoryVectorStore for Qdrant, Postgres pgvector, Pinecone, Weaviate, Chroma, or Redis.