Synthetic fraud graph generator for training and benchmarking graph-based fraud detection models in financial services.
gen_fraud_graph is an open-source Python tool that generates massive synthetic financial transaction graphs with injected fraud patterns and optional vector embeddings. It produces CSV datasets ready for ingestion into graph databases (TigerGraph, Neptune, Neo4j, JanusGraph) or for training graph neural networks (GNN).
The generator creates three types of data:
- Account nodes — synthetic customer accounts with balance, risk score, and optional embedding vectors
- Transaction edges — normal financial transactions between accounts
- Fraud rings — cyclic money-laundering patterns with suspicious transaction descriptions
- Massive scale — Generate from 1K to 100M+ accounts with configurable scale factor
- Fraud pattern injection — Cyclic money-laundering rings with configurable depth (4–7 hops)
- Parallel generation — Multi-process workers for fast generation on high-core machines
- Vector embeddings — Three providers:
fake(random, fast),local(SentenceTransformers),openai(API) - Multiple formats — Generic CSV or AWS Neptune bulk-load format
- Resume support — Interrupted generation can resume from where it left off
- Privacy by design — All data is 100% synthetic; no real financial data is used
- Training and evaluating graph neural networks (GNN) for fraud detection
- Benchmarking anti-money laundering (AML) detection algorithms
- Load-testing graph databases (TigerGraph, Neptune, JanusGraph, NebulaGraph, FalkorDB)
- Research in financial crime detection and anomaly detection on graphs
- Generating labeled datasets for deep learning on graph-structured data
pip install gen-fraud-graphWith optional embedding providers:
pip install 'gen-fraud-graph[local]' # SentenceTransformers (local model)
pip install 'gen-fraud-graph[openai]' # OpenAI API embeddings
pip install 'gen-fraud-graph[all]' # Everything including dev toolsOr from source using uv:
git clone https://github.com/SantanderAI/gen-fraud-graph.git
cd gen-fraud-graph
uv venv && source .venv/bin/activate
uv pip install -e '.[dev]'# Quick test (~1K accounts, ~9K transactions, fake embeddings)
gen-fraud-graph --scale 0.0001 --provider fake --output ./data
# Medium scale (~100K accounts, parallelized)
gen-fraud-graph --scale 0.01 --workers 4 --output ./data
# Full benchmark (~10M accounts, ~90M transactions)
gen-fraud-graph --scale 1.0 --workers 24 --output ./data
# Neptune bulk-load format
gen-fraud-graph --scale 0.01 --format neptune --output ./neptune_data
# Resume interrupted generation (skips completed files)
gen-fraud-graph --scale 1.0 --workers 24 --skip-accounts --output ./data| Flag | Default | Description |
|---|---|---|
--scale |
1.0 |
Scale factor. 1.0 = ~10M accounts / ~90M transactions. 0.01 = ~100K accounts. |
--provider |
fake |
Embedding provider: fake (random vectors), local (SentenceTransformers), openai. |
--output |
data |
Output directory for generated CSV files. |
--workers |
1 |
Number of parallel worker processes. |
--batches |
1 |
Number of file chunks per worker. |
--format |
csv |
Output format: csv (generic) or neptune (AWS Neptune bulk-load). |
--fraud-rings |
auto | Number of fraud rings. Default: auto-scaled from --scale. |
--compress |
off | ZIP-compress output CSV files. |
--skip-accounts |
off | Skip account generation (useful when resuming). |
# Copyright (c) 2026 Santander Group
# SPDX-License-Identifier: Apache-2.0
from gen_fraud_graph import Config, FraudGraphGenerator
config = Config(
scale_factor=0.001, # ~10K accounts, ~90K transactions
num_fraud_rings=50, # 50 cyclic fraud patterns
embedding_provider="fake", # random vectors (fast, no model needed)
workers=2, # 2 parallel processes
output_dir="./output",
)
generator = FraudGraphGenerator(config)
generator.run()python -m gen_fraud_graph.verify --data-dir ./datadata/
├── accounts/
│ ├── accounts_0_0.csv # Account nodes (worker 0, batch 0)
│ └── accounts_1_0.csv # Account nodes (worker 1, batch 0)
├── transactions/
│ ├── transactions_0_0.csv # Transaction edges (worker 0, batch 0)
│ └── transactions_1_0.csv # Transaction edges (worker 1, batch 0)
└── fraud/
├── transactions_fraud.csv # Fraud ring transaction edges
└── fraud_cases.csv # Fraud ring metadata (pattern_id, accounts, depth)
accounts (accounts_*.csv)
| Column | Type | Description |
|---|---|---|
account_id |
string | Unique account identifier (acc_0, acc_1, ...) |
customer_name |
string | Synthetic customer name |
balance |
float | Account balance (100 – 100,000) |
risk_score |
float | Risk score (0.0 – 1.0) |
creation_date |
string | Account creation date |
transactions (transactions_*.csv)
| Column | Type | Description |
|---|---|---|
tx_id |
string | Unique transaction identifier |
src_id |
string | Source account |
dst_id |
string | Destination account |
amount |
float | Transaction amount (10 – 500 for normal, 9999 for fraud) |
timestamp |
string | Transaction timestamp |
description |
string | Transaction description |
embedding |
string | Pipe-separated embedding vector |
fraud_cases (fraud/fraud_cases.csv)
| Column | Type | Description |
|---|---|---|
pattern_id |
string | Pattern identifier (pat_0, pat_1, ...) |
start_acc_id |
string | First account in the ring |
pattern_type |
string | Always "cycle" |
depth |
int | Number of hops in the ring (4–7) |
involved_accounts |
string | Pipe-separated list of accounts |
| Scale | Accounts | Transactions | Fraud Rings | Approx. Size |
|---|---|---|---|---|
0.0001 |
1,000 | 9,000 | 10 | ~2 MB |
0.001 |
10,000 | 90,000 | 10 | ~20 MB |
0.01 |
100,000 | 900,000 | 10 | ~200 MB |
0.1 |
1,000,000 | 9,000,000 | 100 | ~2 GB |
1.0 |
10,000,000 | 90,000,000 | 1,000 | ~20 GB |
gen_fraud_graph/
├── src/gen_fraud_graph/
│ ├── __init__.py # Package entry point
│ ├── cli.py # CLI (gen-fraud-graph command)
│ ├── config.py # Configuration dataclass
│ ├── embeddings.py # Embedding providers (fake/local/openai)
│ ├── exporters.py # CSV/ZIP output writers
│ ├── generator.py # Core 3-phase pipeline orchestrator
│ ├── typologies.py # Fraud ring generator
│ └── verify.py # Pattern verification utility
├── tests/
│ └── test_generator.py # Unit and integration tests
├── examples/
│ └── basic_usage.py # Minimal Python API example
├── .github/
│ ├── workflows/ # CI (ci, codeql, dep-scan, license-check,
│ │ # pattern-check, cla, stale, release)
│ ├── ISSUE_TEMPLATE/ # Bug + feature templates
│ ├── PULL_REQUEST_TEMPLATE.md
│ ├── dependabot.yml # Weekly Python + Actions updates
│ └── pattern-check-allowlist.txt
├── pyproject.toml # Package metadata and tool config
├── LICENSE # Apache 2.0
├── NOTICE # Apache 2.0 attribution
├── CONTRIBUTING.md # Contribution guidelines
├── CODE_OF_CONDUCT.md # Contributor Covenant v2.1
├── SECURITY.md # Vulnerability disclosure policy
├── CODEOWNERS # Maintainer approvals
└── CHANGELOG.md # Release history
Core (always installed):
- Python >= 3.10
- NumPy >= 1.24
- Pandas >= 2.0
- tqdm >= 4.65
Optional:
sentence-transformers >= 2.2— for--provider localopenai >= 1.0— for--provider openai
We welcome contributions from the community. Please read our CONTRIBUTING.md before submitting a pull request.
By contributing, you agree to the terms of our Contributor License Agreement (CLA).
To report a security vulnerability, please follow the process described in SECURITY.md. Do not open a public issue for security vulnerabilities.
This project is licensed under the Apache License 2.0 — see the LICENSE file for details.
Copyright (c) 2026 Santander Group
SPDX-License-Identifier: Apache-2.0
If you use this tool in your research, please cite:
@software{gen_fraud_graph,
title = {gen\_fraud\_graph: Synthetic Fraud Graph Generator},
author = {Santander AI Lab},
year = {2026},
url = {https://github.com/SantanderAI/gen-fraud-graph},
license = {Apache-2.0}
}