Skip to content

mhosigiri/DocuExtract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

18 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🏠 Document & Policy Assistant

AI-powered mortgage document extraction and intelligent policy Q&A system with RAG (Retrieval-Augmented Generation).


πŸš€ Features

  • πŸ“€ Smart Document Upload - Multi-file drag & drop with automatic processing
  • πŸ” Intelligent Text Extraction - OCR for images, PDF parsing, key-value pair extraction
  • πŸ’¬ Unified AI Assistant - Query uploaded documents & official mortgage policies
  • πŸ“š Policy Knowledge Base - Pre-trained on Fannie Mae, FHA, USDA, Freddie Mac guidelines (14,383+ chunks)
  • πŸ”Š Text-to-Speech - Optional voice responses with ElevenLabs AI
  • 🌐 Web Search Fallback - Live mortgage data when documents don't contain the answer
  • 🎯 Smart Document Routing - Single-page: key-value extraction | Multi-page: RAG indexing

πŸ› οΈ Tech Stack

Backend

Technology Purpose
FastAPI High-performance Python web framework
Uvicorn ASGI server
Python 3.9+ Runtime

AI & Machine Learning

Technology Purpose
Google Gemini 2.0 Flash Large language model for intelligent responses
Sentence Transformers Text embeddings (all-MiniLM-L6-v2)
ChromaDB Vector database for document similarity search
LangChain Text splitting and chunking utilities
ElevenLabs Text-to-speech AI (multilingual)
SerpAPI Real-time web search integration

Document Processing

Technology Purpose
Google Document AI Advanced OCR & entity extraction (optional)
PyPDF2 / pdfplumber PDF text extraction
Pytesseract OCR for images
Pillow Image processing

Frontend

Technology Purpose
React 18 UI framework
TypeScript Type-safe JavaScript
Tailwind CSS Utility-first styling
React Router Client-side routing

πŸ“¦ Installation

Prerequisites

# Python 3.9+
python3 --version

# Node.js 16+
node --version

# Tesseract OCR (for image text extraction)
brew install tesseract  # macOS
# or: sudo apt-get install tesseract-ocr  # Linux

Backend Setup

cd backend

# Install Python dependencies
pip install -r requirements.txt

# Configure environment variables
cp .env.example .env
# Edit .env with your API keys:
# - GEMINI_API_KEY
# - SERPAPI_API_KEY
# - ELEVENLABS_API_KEY (optional, for TTS)

# Start backend server
python3 -m uvicorn main:app --reload --host 0.0.0.0 --port 8000

Frontend Setup

cd frontend

# Install dependencies
npm install

# Build production version
npm run build

# Serve static build
npx serve -s build -l 3000

# Or run development server
npm start

πŸ”‘ API Keys Required

Essential

Optional

  • ELEVENLABS_API_KEY - Get from ElevenLabs (for TTS)
  • DOCAI_PROJECT_ID - Google Cloud Document AI (fallback uses free OCR)

Add these to backend/.env:

GEMINI_API_KEY="your_key_here"
SERPAPI_API_KEY="your_key_here"
ELEVENLABS_API_KEY="your_key_here"  # Optional

πŸ—οΈ Architecture

System Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Frontend (React + TypeScript + Tailwind)          β”‚
β”‚  http://localhost:3000                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
                     ↓ HTTP/REST API
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Backend (FastAPI + Python)                         β”‚
β”‚  http://localhost:8000                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚
β”‚  β”‚ Document AI     β”‚  β”‚ Mortgage KB     β”‚          β”‚
β”‚  β”‚ OCR + Extractionβ”‚  β”‚ RAG System      β”‚          β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
β”‚                                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚
β”‚  β”‚ ChromaDB Vector Database             β”‚           β”‚
β”‚  β”‚ β€’ Policy Collection: 14,383 chunks   β”‚           β”‚
β”‚  β”‚ β€’ User Collection: Dynamic           β”‚           β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚
β”‚                                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚
β”‚  β”‚ Gemini   β”‚  β”‚ SerpAPI  β”‚  β”‚ ElevenLabsβ”‚         β”‚
β”‚  β”‚ AI       β”‚  β”‚ Search   β”‚  β”‚ TTS       β”‚         β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Document Processing Pipeline

Upload File
    ↓
Page Count Detection
    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                       β”‚
1 Page              2+ Pages
β”‚                       β”‚
Key-Value           Key-Value +
Extraction          RAG Indexing
    ↓                   ↓
Display          Queryable in Chat

πŸ“Š API Endpoints

Document Management

POST   /api/documents/upload          Upload files
POST   /api/documents/{id}/process    Process & extract
GET    /api/documents                 List all documents
GET    /api/documents/{id}            Get document details
DELETE /api/documents/{id}            Delete document

Knowledge Base & Chat

POST   /api/mortgage-kb/query         Query unified KB (ChromaDB β†’ Web)
GET    /api/mortgage-kb/stats         Get KB statistics
POST   /api/mortgage-kb/tts           Text-to-speech conversion

Legacy Endpoints (Still Available)

POST   /api/rag/query                 RAG query with web search
GET    /api/rag/stats                 RAG statistics

πŸ—‚οΈ Project Structure

HackUTA/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ main.py                    # FastAPI application
β”‚   β”œβ”€β”€ routers.py                 # API endpoints
β”‚   β”œβ”€β”€ mortgage_kb_service.py     # Unified RAG + TTS
β”‚   β”œβ”€β”€ document_ai_service.py     # OCR & extraction
β”‚   β”œβ”€β”€ rag_service.py             # RAG with web search
β”‚   β”œβ”€β”€ storage_service.py         # Google Cloud Storage
β”‚   β”œβ”€β”€ audio_cache/               # TTS audio files
β”‚   β”œβ”€β”€ uploads/                   # User uploaded files
β”‚   β”œβ”€β”€ chroma_storage/            # Vector DB
β”‚   └── RAG/
β”‚       └── documents/             # Policy PDFs (14K+ chunks)
β”‚
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ components/
β”‚   β”‚   β”‚   β”œβ”€β”€ DocumentUpload.tsx
β”‚   β”‚   β”‚   β”œβ”€β”€ MortgageKnowledgeBase.tsx  # Unified chat
β”‚   β”‚   β”‚   β”œβ”€β”€ ExtractedDataView.tsx
β”‚   β”‚   β”‚   └── Header.tsx
β”‚   β”‚   β”œβ”€β”€ pages/
β”‚   β”‚   β”‚   └── Dashboard.tsx
β”‚   β”‚   └── config/
β”‚   β”‚       └── constants.ts
β”‚   └── build/                     # Production build
β”‚
β”œβ”€β”€ requirements.txt               # Python dependencies
└── README.md                      # This file

πŸ’‘ How It Works

1. Document Upload

  • User uploads PDFs, images, or text files
  • System detects page count automatically
  • Single-page: Extract key-values (invoices, IDs)
  • Multi-page: Extract + add to RAG for Q&A

2. Intelligent Query System

Priority Order:

  1. πŸ₯‡ ChromaDB - Search user documents + policy documents first
  2. πŸ₯ˆ Web Search - Only if no relevant docs found in ChromaDB
  3. πŸ₯‰ Gemini AI - Generate intelligent answers from context

3. Text-to-Speech (Optional)

  • Toggle TTS on/off in UI
  • Responses read aloud with professional voice
  • Cached for instant replay (saves API calls)

🎯 Use Cases

For Mortgage Professionals

  • βœ… Upload client loan applications
  • βœ… Extract key information automatically
  • βœ… Query multi-page loan agreements
  • βœ… Compare against policy guidelines

For Homebuyers

  • βœ… Understand document requirements
  • βœ… Get mortgage process guidance
  • βœ… Ask policy-related questions
  • βœ… Organize loan documents

πŸ§ͺ Testing

Test Document Upload

# Visit http://localhost:3000
# Upload a PDF or image
# Watch automatic processing
# View extracted key-value pairs

Test Knowledge Base

# In the Document & Policy Assistant:
# Ask: "What are Fannie Mae debt-to-income requirements?"
# Get answer with policy document citations

Test TTS

# Enable TTS toggle (purple checkbox)
# Ask any question
# Hear response read aloud

πŸ“ˆ Performance

Document Processing

  • Single-page extraction: < 2 seconds
  • Multi-page RAG indexing: ~1 second per page
  • OCR fallback: 2-5 seconds (images)

Query Response Time

  • ChromaDB search: < 500ms
  • Gemini generation: 1-3 seconds
  • Cached TTS: < 50ms (instant!)
  • New TTS: 2-4 seconds

Knowledge Base

  • 14,383+ policy document chunks indexed
  • Semantic search across 500+ pages
  • Sub-second retrieval

πŸ”’ Security Notes

  • βœ… CORS configured for local development
  • βœ… API keys stored in .env (never committed)
  • βœ… File uploads validated and sanitized
  • βœ… UUID-based file naming prevents collisions
  • ⚠️ For production: Add authentication, rate limiting, and HTTPS

πŸ› Troubleshooting

Backend won't start

# Check Python version
python3 --version  # Must be 3.9+

# Install dependencies
pip install -r requirements.txt

# Check API keys
grep -E "GEMINI|SERP" backend/.env

TTS not working

# Verify ElevenLabs key is set
grep ELEVENLABS backend/.env

# Check backend logs
tail -20 backend/backend.log | grep -i elevenlabs

Documents not queryable

  • Single-page docs are NOT queryable (by design)
  • Only multi-page PDFs are added to RAG
  • Check processing logs for "added to RAG" message

πŸ“š Dependencies

Python (Backend)

google-generativeai  # Gemini AI
chromadb            # Vector database
sentence-transformers  # Embeddings
elevenlabs          # Text-to-speech
google-search-results  # SerpAPI
fastapi             # Web framework
pdfplumber          # PDF processing
pytesseract         # OCR

Node.js (Frontend)

react              # UI framework
typescript         # Type safety
tailwindcss        # Styling
react-router-dom   # Routing

🎨 Screenshots

Main Dashboard

  • Document upload with drag & drop
  • Unified AI assistant interface
  • TTS toggle for voice responses
  • Document list with status tracking

Knowledge Base Chat

  • Color-coded sources (blue = user docs, green = policies)
  • Real-time query with source citations
  • Suggested questions
  • Auto-play audio responses

🀝 Contributing

This project was built for HackUTA. Key features:

  • Clean, modular architecture
  • Comprehensive error handling
  • Persistent vector storage
  • Intelligent fallback systems
  • Production-ready codebase

πŸ“ License

MIT License - Built for HackUTA 2025


🌟 Credits

Technologies Used:

  • Google Gemini AI
  • ElevenLabs TTS
  • ChromaDB Vector Database
  • Sentence Transformers
  • SerpAPI
  • FastAPI
  • React + TypeScript

Built with ❀️ at HackUTA


πŸš€ Quick Start

# 1. Clone and install
cd backend && pip install -r requirements.txt
cd frontend && npm install

# 2. Configure API keys in backend/.env
GEMINI_API_KEY="your_key"
SERPAPI_API_KEY="your_key"
ELEVENLABS_API_KEY="your_key"  # Optional

# 3. Start services
cd backend && python3 -m uvicorn main:app --reload --host 0.0.0.0 --port 8000 &
cd frontend && npx serve -s build -l 3000 &

# 4. Open browser
open http://localhost:3000

You're ready to go! πŸŽ‰


For detailed documentation, see:

About

AI-powered mortgage document extraction and intelligent policy Q&A system with RAG

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors