A Retrieval-Augmented Generation (RAG) system for intelligent contract analysis, powered by PDF embeddings, PostgreSQL/pgvector, and a LangChain ReAct agent with a Streamlit conversational interface.
The project follows a three-stage ETL pipeline:
- PDF Partitioning — Splits large contract PDFs into manageable chunks with configurable page overlap
- Embedding Generation — Generates vector embeddings for each partition using Ollama
- Database Loading — Stores embeddings in PostgreSQL with the pgvector extension for semantic search
A Streamlit web interface backed by a LangChain ReAct agent allows users to query contracts in natural language.
- Python 3.10+
- Ollama installed and running
- PostgreSQL with pgvector extension
- Conda (optional, for environment management)
# Clone the repository
git clone https://github.com/fmanc23/contract-rag-system.git
cd contract-rag-system
# Install dependencies
pip install -r requirements.txt
# Set up environment variables
cp .env.example .env
# Edit .env with your actual values
# Initialize the pgvector database
python setup_pgvector.py
# Run the full pipeline (partitioning + embedding + loading)
python main.py
# Launch the web interface
streamlit run interface.py| File | Description |
|---|---|
main.py |
Entry point for the ETL pipeline |
interface.py |
Streamlit UI for querying contracts |
agent_setup.py |
LangChain ReAct agent configuration |
contract_tools.py |
Agent tools (contract search, chat history) |
retriever.py |
Semantic retriever powered by pgvector |
PDFPartitioner.py |
PDF partitioning with page overlap |
PDFEmbeddingGenerator.py |
Embedding generation via Ollama |
EmbeddingLoader.py |
Embedding loader for database ingestion |
connection.py |
Database connection manager |
setup_pgvector.py |
Initial pgvector database setup |
models.py |
SQLAlchemy models |
config.py |
Centralized configuration |
chat_history.py |
Conversation history management |
log.py |
Logging utility |
run.sh |
Bash script for automated execution |
Create a .env file in the project root (see .env.example):
DATABASE_URL=postgresql://postgres:password@localhost:5432/postgres
INPUT_PATH_PARTITIONS=/path/to/contract.pdf
CONTRACT_FILES=./contract_files
EMBEDDING_FILES=./embedding_files- LLM: Ollama (local inference)
- Framework: LangChain (ReAct agent)
- Vector Store: PostgreSQL + pgvector
- Frontend: Streamlit
- Embeddings: Ollama embedding models
- ORM: SQLAlchemy
This project was developed as part of a Master's thesis.