Skip to content

fmanc23/ContractRagSystem

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📄 Contract RAG System

A Retrieval-Augmented Generation (RAG) system for intelligent contract analysis, powered by PDF embeddings, PostgreSQL/pgvector, and a LangChain ReAct agent with a Streamlit conversational interface.

🏗️ Architecture

The project follows a three-stage ETL pipeline:

  1. PDF Partitioning — Splits large contract PDFs into manageable chunks with configurable page overlap
  2. Embedding Generation — Generates vector embeddings for each partition using Ollama
  3. Database Loading — Stores embeddings in PostgreSQL with the pgvector extension for semantic search

A Streamlit web interface backed by a LangChain ReAct agent allows users to query contracts in natural language.

📋 Prerequisites

  • Python 3.10+
  • Ollama installed and running
  • PostgreSQL with pgvector extension
  • Conda (optional, for environment management)

🚀 Getting Started

# Clone the repository
git clone https://github.com/fmanc23/contract-rag-system.git
cd contract-rag-system

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env
# Edit .env with your actual values

# Initialize the pgvector database
python setup_pgvector.py

# Run the full pipeline (partitioning + embedding + loading)
python main.py

# Launch the web interface
streamlit run interface.py

📁 Project Structure

File Description
main.py Entry point for the ETL pipeline
interface.py Streamlit UI for querying contracts
agent_setup.py LangChain ReAct agent configuration
contract_tools.py Agent tools (contract search, chat history)
retriever.py Semantic retriever powered by pgvector
PDFPartitioner.py PDF partitioning with page overlap
PDFEmbeddingGenerator.py Embedding generation via Ollama
EmbeddingLoader.py Embedding loader for database ingestion
connection.py Database connection manager
setup_pgvector.py Initial pgvector database setup
models.py SQLAlchemy models
config.py Centralized configuration
chat_history.py Conversation history management
log.py Logging utility
run.sh Bash script for automated execution

⚙️ Configuration

Create a .env file in the project root (see .env.example):

DATABASE_URL=postgresql://postgres:password@localhost:5432/postgres
INPUT_PATH_PARTITIONS=/path/to/contract.pdf
CONTRACT_FILES=./contract_files
EMBEDDING_FILES=./embedding_files

🛠️ Tech Stack

  • LLM: Ollama (local inference)
  • Framework: LangChain (ReAct agent)
  • Vector Store: PostgreSQL + pgvector
  • Frontend: Streamlit
  • Embeddings: Ollama embedding models
  • ORM: SQLAlchemy

📝 License

This project was developed as part of a Master's thesis.

About

RAG-based system for intelligent contract analysis using PDF embeddings, pgvector, and LangChain agents with Streamlit UI

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors