🚀 Now available on PyPI! Install with:
pip install calm-data-generator
CALM-Data-Generator is a comprehensive Python library for synthetic data generation with advanced features for:
- Clinical/Medical Data - Generate realistic patient demographics, genes, proteins
- Tabular Data Synthesis - CTGAN, TVAE, CART, and more
- Time Series - TimeGAN, DGAN
- Single-Cell - scVI, GEARS (Perturbation Prediction)
- Drift Injection - Test ML model robustness with controlled drift
- Privacy Assessment - DCR metrics for re-identification risk
- Scenario Evolution - Feature evolution and target construction
Calm-Data-Generator is optimized for structured tabular data. It is designed to handle:
- ✅ Classification (Binary & Multi-class)
- ✅ Regression (Continuous variables)
- ✅ Multi-label problems
- ✅ Clustering (Preserving natural groupings)
- ✅ Time Series (Temporal correlations and patterns)
- ✅ Single-Cell / Genomics (scRNA-seq expression data)
Important
This library is NOT designed for unstructured data such as Images, Videos, or Audio. It does not include Computer Vision or Signal Processing models.
CALM-Data-Generator is not just another synthetic data tool—it's a unified ecosystem that brings together the best open-source libraries under a single, consistent API:
Instead of learning and managing multiple complex libraries separately, CALM-Data-Generator provides:
- One API for 15+ synthesis methods from different sources (Synthcity, scvi-tools, GEARS, imbalanced-learn, etc.)
- Seamless interoperability between tabular, time-series, streaming, and genomic data generators
- Consistent configuration across all methods with automatic parameter validation
- Integrated reporting with YData Profiling for all generation methods
The DriftInjector module is one of the most comprehensive drift simulation tools available:
- 14+ drift types: Feature drift (gradual, abrupt, incremental, recurrent), label drift, concept drift, correlation drift, outlier injection, and more
- Correlation-aware drift: Propagate realistic drift across correlated features (e.g., increase income → increase spending)
- Multi-modal drift profiles: Sigmoid, linear, cosine transitions for gradual drift
- Conditional drift: Apply drift only to specific data subsets based on business rules
- Integrated with generators: Inject drift directly during synthesis or post-hoc on existing data
- Perfect for MLOps: Test data drift monitoring, concept drift detection, and model robustness before production
In summary: While other tools focus on a single approach (e.g., just GANs, just statistical methods), CALM-Data-Generator unifies the ecosystem and adds production-grade drift simulation that most libraries don't offer.
This library leverages and unifies best-in-class open-source tools to provide a seamless data generation experience:
- Synthcity: The core engine for tabular deep learning models (CTGAN, TVAE) and privacy. Included by default.
- River: Powers the streaming generation capabilities (
[stream]extra).. - YData Profiling: Generates comprehensive automated quality reports.
Calm-Data-Generator includes a set of Presets designed to give you a head start. These are pre-configured settings for common use cases.
Tip
Presets are Baselines: They are intended to be used as a starting point. You should import a preset, modify it to fit your specific data needs (e.g., change columns, tweak thresholds), and then pass it to the generator.
Located in calm_data_generator/presets/, easy to import and iterate upon:
FastPrototypePreset: Optimized for speed (fewer epochs, simple models) to test pipelines quickly.HighFidelityPreset: Tuned for maximum quality (CTGAN/TVAE with more training) for production data.ClinicalDataGenerator(and presets): Specialized for healthcare data (demographics + omics).ImbalancePreset: Configured to handle and rebalance highly skewed datasets.TimeSeriesPreset: setup for sequential data generation.
Example Usage:
from calm_data_generator.presets import FastPrototypePreset
from calm_data_generator import RealGenerator
# 1. Load the preset config
config = FastPrototypePreset
# 2. Modify it for your specific case
config["epochs"] = 50 # Overwrite default
# 3. Use it
gen = RealGenerator()
gen.generate(data, **config)| Library | Role | Usage in Calm-Data-Generator |
|---|---|---|
| Synthcity | Deep Learning Engine | Powers CTGAN, TVAE, DDPM, TimeGAN. Handling privacy & fidelity. |
| scvi-tools | Single-Cell Analysis | Powers scvi method for high-dimensional genomic/transcriptomic data. |
| GEARS | Graph Perturbation | Powers gears method for predicting single-cell perturbation effects. |
| River | Streaming ML | Powers StreamGenerator for concept drift simulation and real-time data flow. |
| YData Profiling | Reporting | Generates automated quality reports (QualityReporter). |
| Pydantic | Validation | Ensures strict type checking and configuration management. |
| PyTorch | Backend | Underlying tensor computation for all deep learning models. |
| Copulae | Statistical Modeling | Powers the copula method for multivariate dependence modeling. |
A key advantage of Calm-Data-Generator is enabling the use of private data in public or collaborative environments:
- Private Origin: You start with sensitive data (e.g., GDPR/HIPAA restricted) that cannot leave your secure environment.
- Synthetic Twin: The library generates a synthetic dataset that statistically mirrors the original but contains no real individuals.
- Safe Distribution: Once validated (using
QualityReporter's privacy checks), this synthetic dataset allows for risk-free sharing, model training, and testing without exposing confidential information.
- MLOps Monitoring Validation: Use StreamGenerator and DriftInjector to simulate data drift (gradual, abrupt) and verify if your monitoring alerts trigger correctly before deployment.
- Biomedical Research (HealthTech): Generate synthetic patient cohorts with ClinicalDataGenerator that preserve complex biological correlations (e.g., gene-age relationships) for collaborative studies without compromising patient privacy.
- Stress Testing ("What-If" Analysis): Use ScenarioInjector to simulate future scenarios (e.g., "What if the customer age base increases by 10 years?") and measure model performance degradation under stress.
- Correlation-Aware Drift: Inject drift that realistically propagates to correlated features (e.g., increasing income also proportionally increases spending) using the
correlations=Trueparameter. - Development Data: Provide developers with high-fidelity synthetic replicas of production databases, allowing them to build and test features safely without accessing sensitive real-world data.
Minimalist view of the system's core components and data flow.
Warning
The installation might be heavy (~2-3 GB) and take a few minutes depending on your internet connection. We strongly recommend using a fresh virtual environment.
- GitHub (Recommended for latest features): The
mainbranch contains the most up-to-date version with the latest bug fixes and features. - PyPI (Stable): Releases on PyPI are stable versions updated less frequently for major changes.
The library is available on PyPI. For the most stable experience, we recommend using a virtual environment:
# 1. Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate
# 2. Upgrade pip, setuptools and wheel (Crucial for successful installation)
pip install --upgrade pip setuptools wheel
# 3. Install the core library (optimized for speed)
pip install calm-data-generatorDepending on your needs, you can install additional capabilities:
# For Stream Generator (River)
pip install "calm-data-generator[stream]"
# Full suite
pip install "calm-data-generator[full]"Note
Performance Note: We have optimized the dependency tree in version 1.0.0 by pinning specific versions of pydantic, xgboost, and cloudpickle. This significantly reduces the initial installation time from ~40 minutes to just a few minutes. 🚀
From source (GitHub - Latest Updates): Use this method to get the latest bug fixes and features not yet released on PyPI.
# Option A: Install directly from GitHub
pip install git+https://github.com/AlejandroBeldaFernandez/Calm-Data_Generator.git
# Option B: Clone and install (for development)
git clone https://github.com/AlejandroBeldaFernandez/Calm-Data_Generator.git
cd Calm-Data_Generator
pip install .Zsh shell (macOS/Linux): If brackets cause errors, use quotes:
pip install "calm-data-generator[stream]"River compilation errors (Linux/macOS):
# Ubuntu/Debian
sudo apt install build-essential python3-dev
# macOS
xcode-select --install
# Then retry
pip install calm-data-generatorWindows users: Install Visual Studio Build Tools first:
- Download Visual Studio Build Tools
- Install "Desktop development with C++"
- Then retry installation
PyTorch CPU-only (no GPU):
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install calm-data-generatorDependency conflicts: Use a clean virtual environment:
python -m venv venv
source venv/bin/activate # Linux/macOS
# or: venv\Scripts\activate # Windows
pip install calm-data-generatorfrom calm_data_generator import RealGenerator
import pandas as pd
# Your real dataset (can be a DataFrame, path to .csv, .h5, or .h5ad)
data = pd.read_csv("your_data.csv") # or "your_data.h5ad"
# Initialize generator
gen = RealGenerator()
# Generate 1000 synthetic samples using CTGAN
synthetic = gen.generate(
data=data,
n_samples=1000,
method='ctgan',
target_col='label',
epochs=300,
batch_size=500,
discriminator_steps=1,
)
print(f"Generated {len(synthetic)} samples")Methods with GPU support:
| Method | GPU Support | Parameter |
|---|---|---|
ctgan, tvae |
✅ CUDA/MPS | enable_gpu=True |
diffusion |
✅ PyTorch | Auto-detected |
ddpm |
✅ PyTorch + Synthcity | Auto-detected |
timegan |
✅ PyTorch + Synthcity | Auto-detected |
timevae |
✅ PyTorch + Synthcity | Auto-detected |
| smote, adasyn, cart, rf, lgbm, gmm, copula | ❌ CPU only | - |
synthetic = gen.generate(
data=data,
n_samples=1000,
method='ctgan',
epochs=300,
enable_gpu=True,
)Note: Requires PyTorch with CUDA support:
pip install torch --index-url https://download.pytorch.org/whl/cu118
from calm_data_generator import ClinicalDataGenerator
from calm_data_generator.generators.configs import DateConfig
gen = ClinicalDataGenerator()
# Generate patient data with genes and proteins
result = gen.generate(
n_samples=100,
n_genes=500,
n_proteins=200,
date_config=DateConfig(start_date="2024-01-01")
)
demographics = result['demographics']
genes = result['genes']
proteins = result['proteins']Option 1: Directly from generate() (recommended)
from calm_data_generator import RealGenerator
gen = RealGenerator()
# Generate synthetic data WITH drift in one call
synthetic = gen.generate(
data=real_data,
n_samples=1000,
method='ctgan',
target_col='label',
drift_injection_config=[
{
"method": "inject_drift",
"params": {
"columns": ["age", "income", "label"],
"drift_mode": "gradual", # Auto-detects column types
"drift_magnitude": 0.3,
"center": 500,
"width": 200
}
}
]
)Option 2: Standalone DriftInjector
from calm_data_generator import DriftInjector
injector = DriftInjector()
# Unified drift injection (auto-detects types)
drifted_data = injector.inject_drift(
df=data,
columns=['feature1', 'feature2', 'status'],
drift_mode='gradual',
drift_magnitude=0.5,
# Optional specific configs
numeric_operation='shift',
categorical_operation='frequency',
boolean_operation='flip'
)
**Available drift methods:** `inject_feature_drift`, `inject_feature_drift_gradual`, `inject_feature_drift_incremental`, `inject_feature_drift_recurrent`, `inject_label_drift`, `inject_concept_drift`, `inject_categorical_frequency_drift`, and more. See [DRIFT_INJECTOR_REFERENCE.md](calm_data_generator/docs/DRIFT_INJECTOR_REFERENCE.md).
### Single-Cell / Gene Expression Data
Generate synthetic single-cell RNA-seq-like data using specialized VAE models:
```python
from calm_data_generator import RealGenerator
gen = RealGenerator()
# scVI: Generate new cells from scratch
synthetic = gen.generate(
data="expression_data.h5ad", # Paths to .h5 or .h5ad are supported directly
n_samples=1000,
method='scvi',
target_col='cell_type',
epochs=100,
n_latent=10,
)
| Method | Use Case |
|---|---|
scvi |
Generate new cells from learned distribution |
from calm_data_generator import StreamGenerator
stream_gen = StreamGenerator()
# Generate a data stream with Concept Drift
stream_data = stream_gen.generate(
n_chunks=10,
chunk_size=1000,
concept_drift=True, # Simulate concept drift over time
n_features=10
)
print(f"Generated stream with {len(stream_data)} total samples")from calm_data_generator import QualityReporter
# Generate a quality report comparing real vs synthetic data
reporter = QualityReporter()
reporter.generate_report(
real_data=data,
synthetic_data=synthetic,
output_dir="./quality_report",
target_col="target"
)
# Report saved to ./quality_report/report.html| Module | Import | Description |
|---|---|---|
| Tabular | generators.tabular |
RealGenerator, QualityReporter |
| Clinical | generators.clinical |
ClinicalDataGenerator, ClinicalDataGeneratorBlock |
| Stream | `generators.stream` | StreamGenerator, StreamBlockGenerator |
| Blocks | generators.tabular |
RealBlockGenerator |
| Drift | generators.drift |
DriftInjector |
| Dynamics | generators.dynamics |
ScenarioInjector |
| Reports | reports |
Visualizer |
| Method | Type | Description | Requirements / Notes |
|---|---|---|---|
cart |
ML | CART-based iterative synthesis (fast) | Base installation |
rf |
ML | Random Forest synthesis | Base installation |
lgbm |
ML | LightGBM-based synthesis | Base installation (Requires lightgbm) |
ctgan |
DL | Conditional GAN for tabular data | Requires synthcity |
tvae |
DL | Variational Autoencoder | Requires synthcity |
diffusion |
DL | Tabular Diffusion (custom, fast) | Base installation (PyTorch) |
ddpm |
DL | Synthcity TabDDPM (advanced) | Requires synthcity |
timegan |
Time Series | TimeGAN for sequential data | Requires synthcity |
timevae |
Time Series | TimeVAE for sequential data | Requires synthcity |
smote |
Augmentation | SMOTE oversampling | Base installation |
adasyn |
Augmentation | ADASYN adaptive sampling | Base installation |
| copula | Copula | Copula-based synthesis | Base installation |
| gmm | Statistical | Gaussian Mixture Models | Base installation |
| scvi | Single-Cell | scVI (Variational Inference) for RNA-seq | Requires scvi-tools |
# List all tutorials
calm-data-generator tutorials
# Show a specific tutorial
calm-data-generator tutorials show 1
# Run a tutorial
calm-data-generator tutorials run 1
# Show version
calm-data-generator version| # | Tutorial | Description |
|---|---|---|
| 1 | Real Generator | Tabular data synthesis |
| 2 | Clinical Generator | Clinical/medical data |
| 3 | Drift Injector | Drift injection for ML |
| 4 | Stream Generator | Stream-based generation |
| 5 | Scenario Injector | Feature evolution |
Explore the full documentation in the calm_data_generator/docs/ directory:
| Document | Description |
|---|---|
| DOCUMENTATION.md | Main User Guide. Comprehensive manual covering all modules, concepts, and advanced usage. |
| REAL_GENERATOR_REFERENCE.md | API Reference for RealGenerator. Detailed parameters for all synthesis methods (ctgan, lgbm, scvi, etc.). |
| DRIFT_INJECTOR_REFERENCE.md | API Reference for DriftInjector. Guide to using inject_drift and specialized drift capabilities. |
| STREAM_GENERATOR_REFERENCE.md | API Reference for StreamGenerator. Details on stream simulation and drift integration. |
| CLINICAL_GENERATOR_REFERENCE.md | API Reference for ClinicalGenerator. Configuration for genes, proteins, and patient data. |
| API.md | Technical API Index. High-level index of classes and functions. |
MIT License - see LICENSE file
We stand on the shoulders of giants. This library is possible thanks to these amazing open-source projects:
- Synthcity (Apache 2.0) - The engine behind our deep learning models.
- River (BSD-3-Clause) - Powering our streaming capabilities.
- YData Profiling (MIT) - Providing comprehensive data reporting.
- scvi-tools (BSD-3-Clause) - Enabling single-cell analysis.
- GEARS (MIT) - Supporting graph-based perturbation prediction.
- Imbalanced-learn (MIT) - Providing SMOTE and ADASYN implementations.
- SDMetrics (MIT) - Powering the standardized metrics in our QualityReporter.
- Copulae (MIT) - Enabling multivariate dependence modeling via Gaussian Copulas.
- AnnData (BSD-3-Clause) - Providing the core data structure for single-cell and omics integration.
- LightGBM (MIT) - Powering our gradient boosting synthesis method.
- PyTorch (BSD-3-Clause) - The deep learning framework powering our generative models.
- PyTorch Geometric (MIT) - Enabling Graph Neural Network operations for relational data.
- XGBoost (Apache-2.0) - optimized distributed gradient boosting library.
- Opacus (Apache-2.0) - Enabling differential privacy training for PyTorch models.
- Hugging Face Hub (Apache-2.0) - Facilitating model sharing and versioning.
- Plotly (MIT) - Enabling interactive data visualizations.
