Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 70 additions & 0 deletions docs/evaluation_standards.md
Original file line number Diff line number Diff line change
Expand Up @@ -235,3 +235,73 @@ Nightly runs re-evaluate models on:
- Current test set (detect data drift)
- New edge cases (expanding test coverage)
- Robustness sweeps (track stability)

## Synthetic Tabular Data Target Standard

This section defines the target evaluation standard for the planned `projects/synthetic_data_tabular/` project. It describes the intended evaluation protocol and does not imply that synthetic data generators, metrics, notebooks, MLflow runs, or reports are currently implemented.

### Utility Metrics

Synthetic tabular data should be evaluated by downstream task performance, not only by visual similarity.

Target utility checks:

- Train-on-synthetic-test-on-real (TSTR)
- Train-on-real-test-on-synthetic (TRTS)
- Train-on-real-test-on-real reference baseline
- Classification metrics such as ROC-AUC, PR-AUC, F1, Brier score, and calibration where applicable
- Regression metrics such as RMSE, MAE, and interval coverage where applicable
- Slice-level utility for missingness bands, rare categories, outlier bands, and class imbalance

### Fidelity Metrics

Fidelity checks should measure whether synthetic data preserves useful structure without assuming that closer is always safer.

Target fidelity checks:

- Marginal distributions for numerical and categorical columns
- Category frequency preservation, especially rare categories
- Pairwise correlations and dependency structure
- Missing-value pattern similarity
- Target distribution preservation
- Comparison of real and synthetic feature interactions

### Privacy Metrics

Synthetic data privacy evaluation should use multiple risk indicators.

Target privacy checks:

- Distance to Closest Record (DCR)
- Duplicate and near-duplicate detection
- Membership inference attack simulation
- Singling-out and outlier-risk analysis using rare real-data equivalence classes
- Optional l-diversity and t-closeness style checks for sensitive attributes within quasi-identifier groups

These checks are risk indicators and do not constitute a formal differential privacy guarantee.

### Validity Metrics

Synthetic rows must satisfy the declared dataset contract before utility or privacy scores are trusted.

Target validity checks:

- Schema conformity
- Parquet type preservation
- Allowed category validation
- Range constraints
- Business-rule constraints
- Missing-value semantics
- Target leakage checks
- Train/test split isolation checks

### Target Artifacts

Future implemented synthetic tabular experiments should produce:

- `metrics.json` with utility, fidelity, privacy, and validity summaries
- Slice-level utility outputs
- Validity report for generated datasets
- Privacy risk report
- Fidelity plots or tables
- Reproducible generator run metadata
2 changes: 1 addition & 1 deletion projects/bayesian_optimization/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ Implement and compare Bayesian optimization frameworks for sample-efficient hype
## References

- [Optuna Documentation](https://optuna.readthedocs.io/)
- [BoTorch Tutorials](https://botorch.org/tutorials/)
- [BoTorch Tutorials](https://botorch.org/docs/tutorials)
- [Bayesian Optimization Book](https://bayesoptbook.com/)
- [AutoML Book Chapter](https://www.automl.org/book/)

Expand Down
3 changes: 3 additions & 0 deletions projects/quantum_ml/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,9 @@ quantum_ml/
- [Cirq Documentation](https://quantumai.google/cirq)

### Papers
- [Qiskit Machine Learning: an open-source library for quantum machine learning tasks at scale on quantum hardware and classical simulators](https://arxiv.org/html/2505.17756v1)
- [Quantum Machine Learning](https://arxiv.org/abs/1611.09347) - Schuld & Killoran

- [Variational Quantum Eigensolver (VQE)](https://arxiv.org/abs/1304.3061) - Peruzzo et al.
- [Quantum Machine Learning](https://arxiv.org/abs/1611.09347) - Schuld & Killoran
- [Barren Plateaus in QML](https://arxiv.org/abs/1803.11173) - McClean et al.
Expand Down
176 changes: 155 additions & 21 deletions projects/synthetic_data_tabular/README.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,163 @@
# Synthetic Tabular Data

**Status**: 📋 Planned
**Status**: 📋 Planned priority track

Experimental evaluation of synthetic data generation for tabular datasets, focusing on utility vs privacy tradeoffs.
## Current State

## Planned Features
This project is currently in the planning and design phase. No runnable scripts, notebooks, trained generators, synthetic datasets, MLflow runs, privacy reports, or benchmark results are implemented yet.

- Generator comparison:
- CTGAN baseline
- Modern alternatives (TabDDPM, etc.)
- Use Diffusion Models to generate tabular data
- Utility evaluation:
- Train-on-synthetic-test-on-real (TSTR)
- Train-on-real-test-on-synthetic (TRTS)
- Distributional metrics (marginals, pairwise correlations)
- Privacy risk assessment:
- Nearest neighbor distance leakage
- Membership inference attack simulation
- Decision framework for when to use synthetic data
The goal is to build this project around an evaluation-first question:

## Coming Soon
> When does synthetic tabular data preserve useful downstream signal without creating unacceptable privacy risk?

This project will demonstrate:
## Purpose

- Practical synthetic data evaluation
- Privacy-utility tradeoff analysis
- Guidance on when synthetic data helps vs harms
- Reproducible experimental framework
This project will evaluate synthetic data generation for tabular datasets through a utility, fidelity, privacy, and validity lens. The emphasis is not only on trying modern generators, but on building an evaluation ladder that makes it clear when a synthetic dataset is useful, when it is misleading, and when it may create privacy risk.

## Planned Evaluation Ladder

### Phase 0: Design and Contracts

- Define dataset contracts for tabular inputs and generator outputs.
- Define train/validation/test split ownership.
- Define utility, fidelity, privacy, and validity metrics.
- Document generator isolation strategy for dependency-sensitive tooling.

### Phase 1: Simple Baselines

Planned simple baselines:

- Bootstrap / row resampling
- Independent marginal sampling, for example with [DataSynthesizer](https://github.com/DataResponsibly/DataSynthesizer) or [SDV](https://github.com/sdv-dev/SDV) classical models
- Gaussian copula or other classical statistical synthesizers

These baselines will establish whether more complex generators provide value beyond simple distributional approximations.

### Phase 2: Neural Baselines

Planned neural baselines:

- Autoencoder or variational autoencoder style generator
- [CTGAN](https://github.com/sdv-dev/CTGAN)
- TVAE or similar tabular VAE baseline, likely through SDV tooling

These models are planned as practical synthetic-data baselines, likely through isolated generator environments when dependency compatibility requires it.

### Phase 3: Modern Generators

Planned modern approaches:

- Tabular diffusion models such as TabDDPM-style methods
- [Synthcity](https://github.com/vanderschaarlab/synthcity)-style generator comparisons, if dependency support is practical

These approaches are planned only after the evaluation protocol and simple baselines are stable.

## Planned Data Contract Architecture

The main repository targets Python 3.13. Some synthetic data libraries may lag behind current Python versions or require incompatible dependency sets.

To avoid forcing the main repository environment around one generator library, this project plans to use an isolated generator architecture:

- The main evaluation pipeline stays in the repository's Python 3.13 environment.
- Complex generators may run in isolated `uv` environments, Docker images, or other reproducible execution contexts.
- Generators communicate with the evaluation core through a strict tabular data contract.
- The data contract uses Parquet files plus a metadata schema file, not ad hoc CSV exchange, to strictly preserve data types such as dates and nullable integers and prevent silent drift.

Planned contract files:

- `train.parquet`: real training data made available to the generator.
- `dataset_contract.yaml`: schema, roles, constraints, target definition, split policy, and privacy-relevant metadata.
- `synthetic.parquet`: generated synthetic rows returned by the generator.
- `run_manifest.yaml`: generator name, version, seed, input row count, output row count, and environment metadata.

The generator must not receive validation or test rows unless a future experiment explicitly documents that choice.

## Planned Evaluation Protocol

### Utility

Target utility checks:

- Train-on-synthetic-test-on-real (TSTR)
- Train-on-real-test-on-synthetic (TRTS)
- Real-train-real-test reference baseline
- Downstream classification or regression performance
- Slice-level performance for missingness, rare categories, outliers, and class imbalance

### Fidelity

Target fidelity checks:

- Marginal distributions
- Category frequencies
- Pairwise correlations or dependency structure
- Missing-value patterns
- Class balance preservation
- Constraint preservation

### Privacy

Target privacy checks:

- Distance to Closest Record (DCR)
- Duplicate and near-duplicate detection
- Membership inference attack simulation, targeting tools like Statice's [Anonymeter](https://github.com/statice/anonymeter)
- Singling-out and outlier-risk analysis using rare real-data equivalence classes

These checks are risk indicators, not formal privacy guarantees. Formal differential privacy is out of scope unless a future generator explicitly implements and documents it.

### Validity

Target validity checks:

- Schema conformity
- Type preservation
- Allowed category validation
- Range and business-rule constraints
- Missing-value semantics
- Target leakage checks

## Planned Artifacts

Future implemented versions of this project may produce:

- Evaluation metrics in `metrics.json`
- Fidelity and privacy summary reports
- Slice-level utility outputs
- Validity reports for generated datasets
- MLflow experiment runs
- A report notebook summarizing results

These artifacts are planned and are not currently present.

## Planned Tooling References

Candidate tools and libraries to evaluate:

- [DataSynthesizer](https://github.com/DataResponsibly/DataSynthesizer) for classical statistical synthetic-data baselines.
- [SDV](https://github.com/sdv-dev/SDV) for Gaussian copula, CTGAN, and TVAE-style tabular generation.
- [CTGAN](https://github.com/sdv-dev/CTGAN) as a recognizable neural tabular generation baseline.
- [Synthcity](https://github.com/vanderschaarlab/synthcity) for broader synthetic-data benchmarking and possible diffusion-style comparisons.
- [Anonymeter](https://github.com/statice/anonymeter) for privacy attack and risk evaluation, if dependency compatibility is practical.

These references identify candidate tooling for the planned project. They are not currently integrated into this repository.

## Non-Goals

The initial project scope excludes:

- Fine-tuning LLMs for tabular generation
- Commercial synthetic-data platforms
- Claims of formal anonymization
- Claims of differential privacy without an implemented DP mechanism
- Broad benchmark claims before reproducible experiments exist

## Roadmap

1. Finalize README, ADR, and privacy evaluation design.
2. Define the Parquet + metadata dataset contract.
3. Implement one small tabular dataset and one simple baseline.
4. Add utility and fidelity evaluation.
5. Add privacy and validity checks.
6. Add CTGAN or VAE-style generator through an isolated environment.
7. Evaluate whether diffusion-based generators add measurable value.
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# ADR 0001: Isolated Generator Environments

**Status**: Planned
**Project**: Synthetic Tabular Data

## Context

The main repository targets Python 3.13 as a deliberate modern Python baseline. Some synthetic data libraries, especially legacy or research-oriented tabular generation tools, may lag behind current Python versions or require dependency versions that conflict with the main repository environment.

The synthetic data project should be able to evaluate generators such as CTGAN, VAE-style models, and tabular diffusion models without forcing the entire repository to adopt each generator's dependency constraints.

## Decision

Complex synthetic data generators will run in isolated environments when needed. These environments may use separate `uv` environments, Docker images, or another reproducible execution boundary.

The main evaluation core will remain in the repository's Python 3.13 environment.

Generator environments will communicate with the evaluation core through a strict data contract:

- Input data: `train.parquet`
- Metadata contract: `dataset_contract.yaml`
- Generated output: `synthetic.parquet`
- Optional run metadata: `run_manifest.yaml`

Generators must not receive validation or test data unless a future experiment explicitly documents and justifies that choice.

## Rationale

Parquet is preferred over CSV because tabular synthetic data evaluation depends on preserving types and missing-value semantics. CSV can silently change dates, nullable integers, booleans, categorical identifiers, leading zeros, and null representations. These changes can create false evaluation results or hide generator failures.

Parquet provides stronger native support for typed tabular exchange and reduces accidental schema drift between the evaluation core and isolated generator environments.

The metadata sidecar is still required because file types alone do not describe the full dataset contract. `dataset_contract.yaml` will define information such as:

- Column roles: numerical, categorical, ordinal, datetime, ID, target, sensitive, quasi-identifier
- Allowed categories
- Missing-value semantics
- Target column and prediction task
- Train/validation/test split policy
- Business rules and range constraints
- Privacy-relevant quasi-identifiers and sensitive attributes
- Whether labels may be used by the generator

## Consequences

This architecture keeps the main repository environment stable while allowing dependency-sensitive generator experiments.

It also creates extra validation work. Every generator output must be checked against the dataset contract before utility, fidelity, or privacy metrics are trusted.

This decision does not imply that any generator environment, adapter, or evaluation pipeline has been implemented yet.
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Design 0001: Privacy Evaluation Protocol

**Status**: Planned
**Project**: Synthetic Tabular Data

## Context

Synthetic data does not automatically provide privacy. A generator can memorize training records, recreate rare individuals, leak sensitive attributes, or produce rows that are dangerously close to real records.

This project needs a defense-in-depth privacy evaluation protocol before synthetic datasets can be treated as useful portfolio artifacts.

## Decision

The planned privacy evaluation will combine multiple risk indicators rather than relying on a single metric.

### 1. Distance to Closest Record

Distance to Closest Record (DCR) will be used to identify synthetic rows that are exact or near copies of real training rows.

The planned evaluation will compare synthetic-to-real distances against real-to-real distance distributions where appropriate, so that unusually close synthetic records can be flagged for review.

### 2. Membership Inference Attacks

Membership inference attack (MIA) simulations will be used to estimate whether an attacker can infer that a specific real record was included in the generator's training data.

Tools such as Anonymeter may be evaluated for this purpose if they are compatible with the project architecture and dependency constraints.

### 3. Singling-Out and Outlier-Risk Analysis

Classic k-anonymity, l-diversity, and t-closeness are not treated here as privacy guarantees for synthetic data.

Instead, this project plans to use rare equivalence classes as a singling-out and outlier-risk signal:

1. Identify quasi-identifiers in the original real training dataset.
2. Find highly unique or low-k equivalence classes in that real dataset.
3. Cross-reference generated synthetic rows against those rare real groups.
4. Flag cases where the generator appears to recreate rare, isolated, or sensitive real-data patterns.

This analysis is intended to detect whether the generator copied or over-preserved unusual real individuals or groups.

## Limitations

These checks are risk indicators, not formal privacy proofs.

The protocol does not claim differential privacy unless a future generator explicitly implements a documented DP mechanism with a stated privacy budget.

Privacy results must be interpreted alongside utility, fidelity, and validity metrics. A dataset with high utility but high memorization risk should not be considered successful.