An open financial document processing system that extracts entities and relationships from financial documents (PDFs, Excel files) and enables intelligent querying through a hybrid RAG (Retrieval-Augmented Generation) system using Knowledge Graphs.
- Multi-format Document Processing: Support for PDF and Excel files
- Financial Entity Extraction: Automated extraction of financial entities (amounts, account numbers, companies, etc.)
- Relationship Discovery: Hybrid approach combining rule-based, table-based, LLM-powered, and proximity-based relationship extraction
- Knowledge Graph Storage: SQLite-based knowledge graph with comprehensive relationship support
- RAG-Powered Querying: Natural language querying with specialized handlers for financial queries
- REST API: Complete FastAPI-based REST API for all operations
- Command Line Interface: Full CLI for document processing and querying
- Audit System: Built-in data quality checks and validation
- Vector Store: ChromaDB integration for semantic search
┌─────────────────────────────────────────────────────────────────┐
│ Processing Engine │
├─────────────────────────────────────────────────────────────────┤
│ • Model Loading & Management │
│ • Pipeline Orchestration │
│ • Memory Management │
│ • Error Handling & Recovery │
│ • Configuration Management │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Document Parsers │
├─────────────────────────────────────────────────────────────────┤
│ PDF Parser │ Excel Parser │
│ ├─ Text Extraction │ ├─ Sheet Processing │
│ ├─ Table Detection │ ├─ Financial Data Detection │
│ ├─ Metadata Extraction │ ├─ Cell Type Recognition │
│ └─ Structure Analysis │ └─ Formula Extraction │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Entity Extraction │
├─────────────────────────────────────────────────────────────────┤
│ NER Models │ Pattern Matching │
│ ├─ BERT-based Models │ ├─ Financial Regex │
│ ├─ Token Classification │ ├─ Account Numbers │
│ ├─ Confidence Scoring │ ├─ Currency Detection │
│ └─ Entity Properties │ └─ Date/Time Patterns │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Relationship Extraction │
├─────────────────────────────────────────────────────────────────┤
│ Rule-based │ Table-based │ LLM-powered │ Proximity │
│ ├─ Pattern Rules │ ├─ Row Analysis │ ├─ Context │ ├─ Distance│
│ ├─ Financial │ ├─ Column │ ├─ Semantic │ ├─ Co-occur│
│ │ Logic │ │ Relationships │ │ Analysis │ │ Analysis│
│ └─ Domain Rules │ └─ Cell Links │ └─ Generate │ └─ Scoring │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Knowledge Graph Storage │
├─────────────────────────────────────────────────────────────────┤
│ SQLite Database │
│ ├─ Entities Table (id, type, text, confidence, properties) │
│ ├─ Relationships Table (id, source, target, type, confidence) │
│ ├─ Documents Table (id, filename, content, metadata) │
│ └─ Indexes (entity_type, relationship_type, confidence) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ RAG System │
├─────────────────────────────────────────────────────────────────┤
│ Vector Store (ChromaDB) │ Query Processing │
│ ├─ Document Embeddings │ ├─ Question Analysis │
│ ├─ Semantic Search │ ├─ Intent Recognition │
│ ├─ Similarity Matching │ ├─ Handler Selection │
│ └─ Retrieval Scoring │ └─ Response Generation │
└─────────────────────────────────────────────────────────────────┘
- Python 3.8+
- 4GB+ RAM (8GB+ recommended for GPU usage)
- CUDA-compatible GPU (optional, but recommended)
The system uses open-source models and libraries:
- Language Models: Transformers-based models (configurable)
- NER Models: Token classification models for entity extraction
- Embeddings: Sentence transformers for semantic search
- Vector Store: ChromaDB for document retrieval
- Database: SQLite for knowledge graph storage
- Clone the repository:
git clone https://github.com/abtonmoy/Financial-Knowledge-Graph.git
cd financial-knowledge-graph- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Initialize the system:
python -m financial_kg.main test# Process a PDF or Excel file
python -m financial_kg.main process /path/to/financial_document.pdf# Ask questions about your documents
python -m financial_kg.main query "What is the highest amount in the statement?"
python -m financial_kg.main query "Who are the account holders?"# Start the FastAPI server
python -m financial_kg.main serverAccess the API documentation at http://localhost:8000/docs
curl -X POST "http://localhost:8000/api/v1/upload" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@your_document.pdf"curl -X POST "http://localhost:8000/api/v1/query" \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-d '{"question": "What is the total amount of all transactions?"}'curl -X GET "http://localhost:8000/api/v1/entities?entity_type=MONEY&limit=10"curl -X GET "http://localhost:8000/api/v1/relationships?limit=10"-
Processing Engine (
processing_engine.py)- Orchestrates the entire processing pipeline
- Manages model loading and inference
- Coordinates between all system components
-
Document Parsers (
parsers/)- PDF Parser: Extracts text and tables from PDF files
- Excel Parser: Processes Excel spreadsheets with financial data detection
-
Entity Extraction (
extractors/entity_extractor.py)- Combines NER models with regex patterns
- Extracts financial entities: amounts, account numbers, companies, etc.
- Provides confidence scoring and property extraction
-
Relationship Extraction (
extractors/relationship_extractor.py)- Rule-based: Pattern matching for common financial relationships
- Table-based: Infers relationships from table structures
- LLM-powered: Uses language models for complex relationship understanding
- Proximity-based: Finds relationships based on entity proximity in text
-
Knowledge Graph (
storage/knowledge_graph.py)- SQLite-based storage with full relationship support
- Entity and relationship querying with multiple filters
- Statistical analysis and data validation
-
RAG System (
rag/generator.py)- Hybrid retrieval combining vector search and knowledge graph facts
- Specialized query handlers for financial questions
- Local LLM integration for answer generation
- Entity: Represents financial entities with type, text, confidence, and properties
- Relationship: Connects entities with type, confidence, and metadata
- Document: Stores processed documents with extracted content
- QueryResult: Encapsulates query responses with sources and metadata
The system is configured through config.py. Key settings include:
# Model Configuration
MODELS = {
"llm": {"name": "microsoft/DialoGPT-medium"},
"ner": {"name": "dbmdz/bert-large-cased-finetuned-conll03-english"},
"embeddings": {"name": "all-MiniLM-L6-v2"}
}
# Processing Settings
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50
MAX_GENERATION_LENGTH = 200
# Storage Paths
DATABASE_PATH = "./data/knowledge_graph.db"
VECTOR_STORE_PATH = "./data/vector_store"The system includes specialized handlers for different types of financial queries:
- "What is the highest amount?"
- "Find the maximum payment"
- Automatically analyzes all monetary entities and finds extremes
- "What account numbers are mentioned?"
- "Who are the companies involved?"
- Targeted entity retrieval with type-specific filtering
- Uses hybrid RAG approach for complex questions
- Combines document search with knowledge graph facts
python -m financial_kg.main statspython -m financial_kg.main auditcurl -X GET "http://localhost:8000/api/v1/health"The system extracts rich properties for different entity types:
- Money Entities: Amount, currency, category (small/medium/large), readable format
- Account Numbers: Validation, masking, length checks
- Percentages: Decimal values, categorization
- Routing Numbers: Format validation
- PAYMENT: Payment relationships between entities
- OWNERSHIP: Account ownership relationships
- TRANSACTION: Financial transactions
- EMPLOYMENT: Employment relationships
- ASSOCIATION: General associations
# Export entities and relationships
curl -X POST "http://localhost:8000/api/v1/export/graph" \
-H "Content-Type: application/json" \
-d '{"format": "json"}'- Account number masking for sensitive data
- Input validation and sanitization
- Error handling and graceful degradation
- No external API dependencies for core functionality
Run the built-in test with sample data:
python -m financial_kg.main test| Command | Description |
|---|---|
server |
Start the FastAPI web server |
process <file> |
Process a single document |
query '<question>' |
Query the knowledge base |
entities [type] |
List entities, optionally filtered by type |
audit |
Run data quality audit checks |
stats |
Display system statistics |
test |
Run system test with sample data |
clear |
Clear all data from the system |
- Memory Issues: Reduce batch sizes or use CPU-only mode
- Model Loading Errors: Check internet connection for initial model downloads
- CUDA Issues: Ensure CUDA drivers are properly installed
- Use GPU acceleration when available
- Adjust chunk sizes based on document complexity
- Configure model quantization for memory efficiency
Business Source License 1.1 (BSL 1.1)
- Abdul Basit Tonmoy
- Hugging Face for transformer models
- ChromaDB for vector storage
- FastAPI for the web framework
- The open-source ML community

