📄 Contract RAG System

A Retrieval-Augmented Generation (RAG) system for intelligent contract analysis, powered by PDF embeddings, PostgreSQL/pgvector, and a LangChain ReAct agent with a Streamlit conversational interface.

🏗️ Architecture

The project follows a three-stage ETL pipeline:

PDF Partitioning — Splits large contract PDFs into manageable chunks with configurable page overlap
Embedding Generation — Generates vector embeddings for each partition using Ollama
Database Loading — Stores embeddings in PostgreSQL with the pgvector extension for semantic search

A Streamlit web interface backed by a LangChain ReAct agent allows users to query contracts in natural language.

📋 Prerequisites

Python 3.10+
Ollama installed and running
PostgreSQL with pgvector extension
Conda (optional, for environment management)

🚀 Getting Started

# Clone the repository
git clone https://github.com/fmanc23/contract-rag-system.git
cd contract-rag-system

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env
# Edit .env with your actual values

# Initialize the pgvector database
python setup_pgvector.py

# Run the full pipeline (partitioning + embedding + loading)
python main.py

# Launch the web interface
streamlit run interface.py

📁 Project Structure

File	Description
`main.py`	Entry point for the ETL pipeline
`interface.py`	Streamlit UI for querying contracts
`agent_setup.py`	LangChain ReAct agent configuration
`contract_tools.py`	Agent tools (contract search, chat history)
`retriever.py`	Semantic retriever powered by pgvector
`PDFPartitioner.py`	PDF partitioning with page overlap
`PDFEmbeddingGenerator.py`	Embedding generation via Ollama
`EmbeddingLoader.py`	Embedding loader for database ingestion
`connection.py`	Database connection manager
`setup_pgvector.py`	Initial pgvector database setup
`models.py`	SQLAlchemy models
`config.py`	Centralized configuration
`chat_history.py`	Conversation history management
`log.py`	Logging utility
`run.sh`	Bash script for automated execution

⚙️ Configuration

Create a .env file in the project root (see .env.example):

DATABASE_URL=postgresql://postgres:password@localhost:5432/postgres
INPUT_PATH_PARTITIONS=/path/to/contract.pdf
CONTRACT_FILES=./contract_files
EMBEDDING_FILES=./embedding_files

🛠️ Tech Stack

LLM: Ollama (local inference)
Framework: LangChain (ReAct agent)
Vector Store: PostgreSQL + pgvector
Frontend: Streamlit
Embeddings: Ollama embedding models
ORM: SQLAlchemy

📝 License

This project was developed as part of a Master's thesis.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
database		database
langchain		langchain
pdf_analyzer		pdf_analyzer
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📄 Contract RAG System

🏗️ Architecture

📋 Prerequisites

🚀 Getting Started

📁 Project Structure

⚙️ Configuration

🛠️ Tech Stack

📝 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📄 Contract RAG System

🏗️ Architecture

📋 Prerequisites

🚀 Getting Started

📁 Project Structure

⚙️ Configuration

🛠️ Tech Stack

📝 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages