Financial Knowledge Graph

An open financial document processing system that extracts entities and relationships from financial documents (PDFs, Excel files) and enables intelligent querying through a hybrid RAG (Retrieval-Augmented Generation) system using Knowledge Graphs.

Features

Multi-format Document Processing: Support for PDF and Excel files
Financial Entity Extraction: Automated extraction of financial entities (amounts, account numbers, companies, etc.)
Relationship Discovery: Hybrid approach combining rule-based, table-based, LLM-powered, and proximity-based relationship extraction
Knowledge Graph Storage: SQLite-based knowledge graph with comprehensive relationship support
RAG-Powered Querying: Natural language querying with specialized handlers for financial queries
REST API: Complete FastAPI-based REST API for all operations
Command Line Interface: Full CLI for document processing and querying
Audit System: Built-in data quality checks and validation
Vector Store: ChromaDB integration for semantic search

Demo

Document Processing:

Query:

System Architecture

                ┌─────────────────────────────────────────────────────────────────┐
                │                    Processing Engine                            │
                ├─────────────────────────────────────────────────────────────────┤
                │ • Model Loading & Management                                    │
                │ • Pipeline Orchestration                                        │
                │ • Memory Management                                             │
                │ • Error Handling & Recovery                                     │
                │ • Configuration Management                                      │
                └─────────────────────────────────────────────────────────────────┘
                                                    │
                                                    ▼
                ┌─────────────────────────────────────────────────────────────────┐
                │                    Document Parsers                             │
                ├─────────────────────────────────────────────────────────────────┤
                │ PDF Parser                  │  Excel Parser                     │
                │ ├─ Text Extraction         │  ├─ Sheet Processing               │
                │ ├─ Table Detection         │  ├─ Financial Data Detection       │
                │ ├─ Metadata Extraction     │  ├─ Cell Type Recognition          │
                │ └─ Structure Analysis      │  └─ Formula Extraction             │
                └─────────────────────────────────────────────────────────────────┘
                                                    │
                                                    ▼
                ┌─────────────────────────────────────────────────────────────────┐
                │                    Entity Extraction                            │
                ├─────────────────────────────────────────────────────────────────┤
                │ NER Models                  │  Pattern Matching                 │
                │ ├─ BERT-based Models       │  ├─ Financial Regex                │
                │ ├─ Token Classification    │  ├─ Account Numbers                │
                │ ├─ Confidence Scoring      │  ├─ Currency Detection             │
                │ └─ Entity Properties       │  └─ Date/Time Patterns             │
                └─────────────────────────────────────────────────────────────────┘
                                                    │
                                                    ▼
                ┌─────────────────────────────────────────────────────────────────┐
                │                 Relationship Extraction                         │
                ├─────────────────────────────────────────────────────────────────┤
                │ Rule-based        │ Table-based      │ LLM-powered │ Proximity  │
                │ ├─ Pattern Rules  │ ├─ Row Analysis  │ ├─ Context  │ ├─ Distance│
                │ ├─ Financial      │ ├─ Column        │ ├─ Semantic │ ├─ Co-occur│
                │ │  Logic          │ │  Relationships │ │  Analysis │ │  Analysis│
                │ └─ Domain Rules   │ └─ Cell Links    │ └─ Generate │ └─ Scoring │
                └─────────────────────────────────────────────────────────────────┘
                                                    │
                                                    ▼
                ┌─────────────────────────────────────────────────────────────────┐
                │                    Knowledge Graph Storage                      │
                ├─────────────────────────────────────────────────────────────────┤
                │ SQLite Database                                                 │
                │ ├─ Entities Table (id, type, text, confidence, properties)      │
                │ ├─ Relationships Table (id, source, target, type, confidence)   │
                │ ├─ Documents Table (id, filename, content, metadata)            │
                │ └─ Indexes (entity_type, relationship_type, confidence)         │
                └─────────────────────────────────────────────────────────────────┘
                                                    │
                                                    ▼
                ┌─────────────────────────────────────────────────────────────────┐
                │                       RAG System                                │
                ├─────────────────────────────────────────────────────────────────┤
                │ Vector Store (ChromaDB)     │  Query Processing                 │
                │ ├─ Document Embeddings     │  ├─ Question Analysis              │
                │ ├─ Semantic Search         │  ├─ Intent Recognition             │
                │ ├─ Similarity Matching     │  ├─ Handler Selection              │
                │ └─ Retrieval Scoring       │  └─ Response Generation            │
                └─────────────────────────────────────────────────────────────────┘

Requirements

System Requirements

Python 3.8+
4GB+ RAM (8GB+ recommended for GPU usage)
CUDA-compatible GPU (optional, but recommended)

Dependencies

The system uses open-source models and libraries:

Language Models: Transformers-based models (configurable)
NER Models: Token classification models for entity extraction
Embeddings: Sentence transformers for semantic search
Vector Store: ChromaDB for document retrieval
Database: SQLite for knowledge graph storage

Installation

Clone the repository:

git clone https://github.com/abtonmoy/Financial-Knowledge-Graph.git
cd financial-knowledge-graph

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Initialize the system:

python -m financial_kg.main test

Quick Start

1. Process a Document

# Process a PDF or Excel file
python -m financial_kg.main process /path/to/financial_document.pdf

2. Query the Knowledge Base

# Ask questions about your documents
python -m financial_kg.main query "What is the highest amount in the statement?"
python -m financial_kg.main query "Who are the account holders?"

3. Start the Web API

# Start the FastAPI server
python -m financial_kg.main server

Access the API documentation at http://localhost:8000/docs

API Usage

Upload and Process Documents

curl -X POST "http://localhost:8000/api/v1/upload" \
     -H "accept: application/json" \
     -H "Content-Type: multipart/form-data" \
     -F "file=@your_document.pdf"

Query Documents

curl -X POST "http://localhost:8000/api/v1/query" \
     -H "accept: application/json" \
     -H "Content-Type: application/json" \
     -d '{"question": "What is the total amount of all transactions?"}'

Get Entities

curl -X GET "http://localhost:8000/api/v1/entities?entity_type=MONEY&limit=10"

Get Relationships

curl -X GET "http://localhost:8000/api/v1/relationships?limit=10"

Architecture

Core Components

Processing Engine (processing_engine.py)
- Orchestrates the entire processing pipeline
- Manages model loading and inference
- Coordinates between all system components
Document Parsers (parsers/)
- PDF Parser: Extracts text and tables from PDF files
- Excel Parser: Processes Excel spreadsheets with financial data detection
Entity Extraction (extractors/entity_extractor.py)
- Combines NER models with regex patterns
- Extracts financial entities: amounts, account numbers, companies, etc.
- Provides confidence scoring and property extraction
Relationship Extraction (extractors/relationship_extractor.py)
- Rule-based: Pattern matching for common financial relationships
- Table-based: Infers relationships from table structures
- LLM-powered: Uses language models for complex relationship understanding
- Proximity-based: Finds relationships based on entity proximity in text
Knowledge Graph (storage/knowledge_graph.py)
- SQLite-based storage with full relationship support
- Entity and relationship querying with multiple filters
- Statistical analysis and data validation
RAG System (rag/generator.py)
- Hybrid retrieval combining vector search and knowledge graph facts
- Specialized query handlers for financial questions
- Local LLM integration for answer generation

Data Models

Entity: Represents financial entities with type, text, confidence, and properties
Relationship: Connects entities with type, confidence, and metadata
Document: Stores processed documents with extracted content
QueryResult: Encapsulates query responses with sources and metadata

Configuration

The system is configured through config.py. Key settings include:

# Model Configuration
MODELS = {
    "llm": {"name": "microsoft/DialoGPT-medium"},
    "ner": {"name": "dbmdz/bert-large-cased-finetuned-conll03-english"},
    "embeddings": {"name": "all-MiniLM-L6-v2"}
}

# Processing Settings
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50
MAX_GENERATION_LENGTH = 200

# Storage Paths
DATABASE_PATH = "./data/knowledge_graph.db"
VECTOR_STORE_PATH = "./data/vector_store"

Specialized Query Features

The system includes specialized handlers for different types of financial queries:

Comparative Amount Queries

"What is the highest amount?"
"Find the maximum payment"
Automatically analyzes all monetary entities and finds extremes

Specific Entity Queries

"What account numbers are mentioned?"
"Who are the companies involved?"
Targeted entity retrieval with type-specific filtering

General Financial Queries

Uses hybrid RAG approach for complex questions
Combines document search with knowledge graph facts

System Statistics and Monitoring

View Statistics

python -m financial_kg.main stats

Run Audit Checks

python -m financial_kg.main audit

Health Check

curl -X GET "http://localhost:8000/api/v1/health"

Advanced Features

Entity Properties

The system extracts rich properties for different entity types:

Money Entities: Amount, currency, category (small/medium/large), readable format
Account Numbers: Validation, masking, length checks
Percentages: Decimal values, categorization
Routing Numbers: Format validation

Relationship Types

PAYMENT: Payment relationships between entities
OWNERSHIP: Account ownership relationships
TRANSACTION: Financial transactions
EMPLOYMENT: Employment relationships
ASSOCIATION: General associations

Data Export

# Export entities and relationships
curl -X POST "http://localhost:8000/api/v1/export/graph" \
     -H "Content-Type: application/json" \
     -d '{"format": "json"}'

Security Features

Account number masking for sensitive data
Input validation and sanitization
Error handling and graceful degradation
No external API dependencies for core functionality

Testing

Run the built-in test with sample data:

python -m financial_kg.main test

CLI Commands

Command	Description
`server`	Start the FastAPI web server
`process <file>`	Process a single document
`query '<question>'`	Query the knowledge base
`entities [type]`	List entities, optionally filtered by type
`audit`	Run data quality audit checks
`stats`	Display system statistics
`test`	Run system test with sample data
`clear`	Clear all data from the system

Troubleshooting

Common Issues

Memory Issues: Reduce batch sizes or use CPU-only mode
Model Loading Errors: Check internet connection for initial model downloads
CUDA Issues: Ensure CUDA drivers are properly installed

Performance Optimization

Use GPU acceleration when available
Adjust chunk sizes based on document complexity
Configure model quantization for memory efficiency

License

Business Source License 1.1 (BSL 1.1)

Developer

Abdul Basit Tonmoy

Acknowledgments

Hugging Face for transformer models
ChromaDB for vector storage
FastAPI for the web framework
The open-source ML community

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
api		api
assests		assests
cli		cli
engine		engine
extractors		extractors
models		models
parsers		parsers
rag		rag
storage		storage
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
__init__.py		__init__.py
config.py		config.py
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt
visualization.py		visualization.py

Folders and files

Latest commit

History

Repository files navigation

Financial Knowledge Graph

Features

Demo

Document Processing:

Query:

System Architecture

Requirements

System Requirements

Dependencies

Installation

Quick Start

1. Process a Document

2. Query the Knowledge Base

3. Start the Web API

API Usage

Upload and Process Documents

Query Documents

Get Entities

Get Relationships

Architecture

Core Components

Data Models

Configuration

Specialized Query Features

Comparative Amount Queries

Specific Entity Queries

General Financial Queries

System Statistics and Monitoring

View Statistics

Run Audit Checks

Health Check

Advanced Features

Entity Properties

Relationship Types

Data Export

Security Features

Testing

CLI Commands

Troubleshooting

Common Issues

Performance Optimization

License

Developer

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages