An AI-powered system that extracts structured metadata from rental/lease agreement documents (.docx and .png files) using Large Language Models (Google Gemini) with few-shot prompting.
┌─────────────┐ ┌──────────────────┐ ┌─────────────────────┐ ┌────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ │ │ │ │ │ │ │ │ │ │ │
│Document │ │ Text Extraction │ │ Prompt Construction │ │ LLM Inference │ │ Post-Processing │ │ JSON/CSV Output │
│(.docx /.png)│───▶│ python docx/OCR │────▶│ 5 few-shot examples │───▶│ Google Gemini │───▶│ Title stripping │───▶│ Structured │
│ │ │ Tesseract+OpenCV│ │ Chain-of-thought │ │ Multi-model │ │ Date & value │ │ metadata │
│ │ │ │ │ │ │ retry │ │ cleanup │ │ │
└─────────────┘ └──────────────────┘ └─────────────────────┘ └────────────────┘ └──────────────────┘ └──────────────────┘
|
Training Set Recall (10 documents)
|
Test Set Recall (4 documents)
|
| # | Reason | Impact |
|---|---|---|
| 1 | 24158401 file is missing from train folder |
0% recall for that document |
| 2 | Train has harder documents (scanned images + corrupted .docx) |
Garbled OCR → wrong extractions |
| 3 | Test set is smaller (4 docs vs 10) | Each correct prediction has bigger weight |
| 4 | Test files are cleaner (2 .pdf.docx vs noisy scans) |
Higher extraction accuracy |
# Clone the repository
git clone <repo-url>
cd metadata-extraction
# Install dependencies
pip install -r requirements.txt
# Create .env file with your API key
echo "GEMINI_API_KEY=your_api_key_here" > .env
# Optional: add a second key for rotation
echo "GEMINI_API_KEY_2=your_second_key_here" >> .envpython main.pyThis will prompt you to choose:
- Validate on training data — processes train/ documents and computes recall against train.csv
- Predict on test data — processes test/ documents and saves predictions.csv
- Both — runs validation then prediction
# Start the API server
uvicorn api.app:app --host 0.0.0.0 --port 8000Then upload a document:
curl -X POST "http://localhost:8000/extract" \
-F "file=@path/to/document.docx"Response:
{
"filename": "document.docx",
"metadata": {
"Agreement Value": "8000",
"Agreement Start Date": "01.04.2011",
"Agreement End Date": "31.03.2012",
"Renewal Notice (Days)": "90",
"Party One": "K. Parthasarathy",
"Party Two": "Veerabrahmam Bathini"
},
"status": "success"
}metadata-extraction/
├── main.py
├── requirements.txt
├── README.md
├── Dockerfile
├── render.yaml
├── .gitignore
├── .dockerignore
├── .env
├── predictions.csv
├── train_predictions.csv
├── api/
│ └── app.py
├── data/
│ ├── train.csv
│ ├── test.csv
│ ├── train/
│ └── test/
├── src/
│ ├── __init__.py
│ ├── text_extractor.py
│ ├── prompt_builder.py
│ ├── llm_client.py
│ ├── post_processor.py
│ └── evaluate.py
└── notebooks/
Where we are today, Where we're headed
┌───────────────────┐ ┌───────────────────────────────────┐
│ .docx & .png │ ───── Format ──────▶ │ PDF, TIFF, handwritten scans │
│ English only │ ───── Language ────▶ │ Hindi, Tamil, Telugu & more │
│ Tesseract OCR │ ───── Engine ──────▶ │ Google Document AI / AWS Textract│
│ 6 metadata fields│ ───── Coverage ────▶ │ 20+ fields (deposit, address…) │
│ Single file │ ───── Scale ───────▶ │ Batch upload + async queue │
│ Gemini only │ ───── Intelligence ─▶ │ Multi-LLM ensemble + RAG │
│ Rental agreements│ ───── Domain ──────▶ │ Any legal document type │
└───────────────────┘ └───────────────────────────────────┘
- Fine-tune open-source LLMs
- RAG-powered few-shot selection
- Multimodal models
- Self-improving prompts
> Note: The first request after idle may take ~30-60 seconds.

