From d77934c41936fb88d69fe06e6b9be3e5c8867c18 Mon Sep 17 00:00:00 2001 From: Martin Date: Fri, 22 May 2026 22:55:19 +0200 Subject: [PATCH 1/2] docs: outline synthetic tabular evaluation strategy Co-authored-by: Cursor --- docs/evaluation_standards.md | 70 +++++++ projects/synthetic_data_tabular/README.md | 176 +++++++++++++++--- .../0001-isolated-generator-environments.md | 50 +++++ .../0001-privacy-evaluation-protocol.md | 47 +++++ 4 files changed, 322 insertions(+), 21 deletions(-) create mode 100644 projects/synthetic_data_tabular/docs/adr/0001-isolated-generator-environments.md create mode 100644 projects/synthetic_data_tabular/docs/design/0001-privacy-evaluation-protocol.md diff --git a/docs/evaluation_standards.md b/docs/evaluation_standards.md index dada940..77fa1b9 100644 --- a/docs/evaluation_standards.md +++ b/docs/evaluation_standards.md @@ -235,3 +235,73 @@ Nightly runs re-evaluate models on: - Current test set (detect data drift) - New edge cases (expanding test coverage) - Robustness sweeps (track stability) + +## Synthetic Tabular Data Target Standard + +This section defines the target evaluation standard for the planned `projects/synthetic_data_tabular/` project. It describes the intended evaluation protocol and does not imply that synthetic data generators, metrics, notebooks, MLflow runs, or reports are currently implemented. + +### Utility Metrics + +Synthetic tabular data should be evaluated by downstream task performance, not only by visual similarity. + +Target utility checks: + +- Train-on-synthetic-test-on-real (TSTR) +- Train-on-real-test-on-synthetic (TRTS) +- Train-on-real-test-on-real reference baseline +- Classification metrics such as ROC-AUC, PR-AUC, F1, Brier score, and calibration where applicable +- Regression metrics such as RMSE, MAE, and interval coverage where applicable +- Slice-level utility for missingness bands, rare categories, outlier bands, and class imbalance + +### Fidelity Metrics + +Fidelity checks should measure whether synthetic data preserves useful structure without assuming that closer is always safer. + +Target fidelity checks: + +- Marginal distributions for numerical and categorical columns +- Category frequency preservation, especially rare categories +- Pairwise correlations and dependency structure +- Missing-value pattern similarity +- Target distribution preservation +- Comparison of real and synthetic feature interactions + +### Privacy Metrics + +Synthetic data privacy evaluation should use multiple risk indicators. + +Target privacy checks: + +- Distance to Closest Record (DCR) +- Duplicate and near-duplicate detection +- Membership inference attack simulation +- Singling-out and outlier-risk analysis using rare real-data equivalence classes +- Optional l-diversity and t-closeness style checks for sensitive attributes within quasi-identifier groups + +These checks are risk indicators and do not constitute a formal differential privacy guarantee. + +### Validity Metrics + +Synthetic rows must satisfy the declared dataset contract before utility or privacy scores are trusted. + +Target validity checks: + +- Schema conformity +- Parquet type preservation +- Allowed category validation +- Range constraints +- Business-rule constraints +- Missing-value semantics +- Target leakage checks +- Train/test split isolation checks + +### Target Artifacts + +Future implemented synthetic tabular experiments should produce: + +- `metrics.json` with utility, fidelity, privacy, and validity summaries +- Slice-level utility outputs +- Validity report for generated datasets +- Privacy risk report +- Fidelity plots or tables +- Reproducible generator run metadata diff --git a/projects/synthetic_data_tabular/README.md b/projects/synthetic_data_tabular/README.md index 1d527af..46e7a46 100644 --- a/projects/synthetic_data_tabular/README.md +++ b/projects/synthetic_data_tabular/README.md @@ -1,29 +1,163 @@ # Synthetic Tabular Data -**Status**: 📋 Planned +**Status**: 📋 Planned priority track -Experimental evaluation of synthetic data generation for tabular datasets, focusing on utility vs privacy tradeoffs. +## Current State -## Planned Features +This project is currently in the planning and design phase. No runnable scripts, notebooks, trained generators, synthetic datasets, MLflow runs, privacy reports, or benchmark results are implemented yet. -- Generator comparison: - - CTGAN baseline - - Modern alternatives (TabDDPM, etc.) - - Use Diffusion Models to generate tabular data -- Utility evaluation: - - Train-on-synthetic-test-on-real (TSTR) - - Train-on-real-test-on-synthetic (TRTS) - - Distributional metrics (marginals, pairwise correlations) -- Privacy risk assessment: - - Nearest neighbor distance leakage - - Membership inference attack simulation -- Decision framework for when to use synthetic data +The goal is to build this project around an evaluation-first question: -## Coming Soon +> When does synthetic tabular data preserve useful downstream signal without creating unacceptable privacy risk? -This project will demonstrate: +## Purpose -- Practical synthetic data evaluation -- Privacy-utility tradeoff analysis -- Guidance on when synthetic data helps vs harms -- Reproducible experimental framework +This project will evaluate synthetic data generation for tabular datasets through a utility, fidelity, privacy, and validity lens. The emphasis is not only on trying modern generators, but on building an evaluation ladder that makes it clear when a synthetic dataset is useful, when it is misleading, and when it may create privacy risk. + +## Planned Evaluation Ladder + +### Phase 0: Design and Contracts + +- Define dataset contracts for tabular inputs and generator outputs. +- Define train/validation/test split ownership. +- Define utility, fidelity, privacy, and validity metrics. +- Document generator isolation strategy for dependency-sensitive tooling. + +### Phase 1: Simple Baselines + +Planned simple baselines: + +- Bootstrap / row resampling +- Independent marginal sampling, for example with [DataSynthesizer](https://github.com/DataResponsibly/DataSynthesizer) or [SDV](https://github.com/sdv-dev/SDV) classical models +- Gaussian copula or other classical statistical synthesizers + +These baselines will establish whether more complex generators provide value beyond simple distributional approximations. + +### Phase 2: Neural Baselines + +Planned neural baselines: + +- Autoencoder or variational autoencoder style generator +- [CTGAN](https://github.com/sdv-dev/CTGAN) +- TVAE or similar tabular VAE baseline, likely through SDV tooling + +These models are planned as practical synthetic-data baselines, likely through isolated generator environments when dependency compatibility requires it. + +### Phase 3: Modern Generators + +Planned modern approaches: + +- Tabular diffusion models such as TabDDPM-style methods +- [Synthcity](https://github.com/vanderschaarlab/synthcity)-style generator comparisons, if dependency support is practical + +These approaches are planned only after the evaluation protocol and simple baselines are stable. + +## Planned Data Contract Architecture + +The main repository targets Python 3.13. Some synthetic data libraries may lag behind current Python versions or require incompatible dependency sets. + +To avoid forcing the main repository environment around one generator library, this project plans to use an isolated generator architecture: + +- The main evaluation pipeline stays in the repository's Python 3.13 environment. +- Complex generators may run in isolated `uv` environments, Docker images, or other reproducible execution contexts. +- Generators communicate with the evaluation core through a strict tabular data contract. +- The data contract uses Parquet files plus a metadata schema file, not ad hoc CSV exchange, to strictly preserve data types such as dates and nullable integers and prevent silent drift. + +Planned contract files: + +- `train.parquet`: real training data made available to the generator. +- `dataset_contract.yaml`: schema, roles, constraints, target definition, split policy, and privacy-relevant metadata. +- `synthetic.parquet`: generated synthetic rows returned by the generator. +- `run_manifest.yaml`: generator name, version, seed, input row count, output row count, and environment metadata. + +The generator must not receive validation or test rows unless a future experiment explicitly documents that choice. + +## Planned Evaluation Protocol + +### Utility + +Target utility checks: + +- Train-on-synthetic-test-on-real (TSTR) +- Train-on-real-test-on-synthetic (TRTS) +- Real-train-real-test reference baseline +- Downstream classification or regression performance +- Slice-level performance for missingness, rare categories, outliers, and class imbalance + +### Fidelity + +Target fidelity checks: + +- Marginal distributions +- Category frequencies +- Pairwise correlations or dependency structure +- Missing-value patterns +- Class balance preservation +- Constraint preservation + +### Privacy + +Target privacy checks: + +- Distance to Closest Record (DCR) +- Duplicate and near-duplicate detection +- Membership inference attack simulation, targeting tools like Statice's [Anonymeter](https://github.com/statice/anonymeter) +- Singling-out and outlier-risk analysis using rare real-data equivalence classes + +These checks are risk indicators, not formal privacy guarantees. Formal differential privacy is out of scope unless a future generator explicitly implements and documents it. + +### Validity + +Target validity checks: + +- Schema conformity +- Type preservation +- Allowed category validation +- Range and business-rule constraints +- Missing-value semantics +- Target leakage checks + +## Planned Artifacts + +Future implemented versions of this project may produce: + +- Evaluation metrics in `metrics.json` +- Fidelity and privacy summary reports +- Slice-level utility outputs +- Validity reports for generated datasets +- MLflow experiment runs +- A report notebook summarizing results + +These artifacts are planned and are not currently present. + +## Planned Tooling References + +Candidate tools and libraries to evaluate: + +- [DataSynthesizer](https://github.com/DataResponsibly/DataSynthesizer) for classical statistical synthetic-data baselines. +- [SDV](https://github.com/sdv-dev/SDV) for Gaussian copula, CTGAN, and TVAE-style tabular generation. +- [CTGAN](https://github.com/sdv-dev/CTGAN) as a recognizable neural tabular generation baseline. +- [Synthcity](https://github.com/vanderschaarlab/synthcity) for broader synthetic-data benchmarking and possible diffusion-style comparisons. +- [Anonymeter](https://github.com/statice/anonymeter) for privacy attack and risk evaluation, if dependency compatibility is practical. + +These references identify candidate tooling for the planned project. They are not currently integrated into this repository. + +## Non-Goals + +The initial project scope excludes: + +- Fine-tuning LLMs for tabular generation +- Commercial synthetic-data platforms +- Claims of formal anonymization +- Claims of differential privacy without an implemented DP mechanism +- Broad benchmark claims before reproducible experiments exist + +## Roadmap + +1. Finalize README, ADR, and privacy evaluation design. +2. Define the Parquet + metadata dataset contract. +3. Implement one small tabular dataset and one simple baseline. +4. Add utility and fidelity evaluation. +5. Add privacy and validity checks. +6. Add CTGAN or VAE-style generator through an isolated environment. +7. Evaluate whether diffusion-based generators add measurable value. diff --git a/projects/synthetic_data_tabular/docs/adr/0001-isolated-generator-environments.md b/projects/synthetic_data_tabular/docs/adr/0001-isolated-generator-environments.md new file mode 100644 index 0000000..82b1c5c --- /dev/null +++ b/projects/synthetic_data_tabular/docs/adr/0001-isolated-generator-environments.md @@ -0,0 +1,50 @@ +# ADR 0001: Isolated Generator Environments + +**Status**: Planned +**Project**: Synthetic Tabular Data + +## Context + +The main repository targets Python 3.13 as a deliberate modern Python baseline. Some synthetic data libraries, especially legacy or research-oriented tabular generation tools, may lag behind current Python versions or require dependency versions that conflict with the main repository environment. + +The synthetic data project should be able to evaluate generators such as CTGAN, VAE-style models, and tabular diffusion models without forcing the entire repository to adopt each generator's dependency constraints. + +## Decision + +Complex synthetic data generators will run in isolated environments when needed. These environments may use separate `uv` environments, Docker images, or another reproducible execution boundary. + +The main evaluation core will remain in the repository's Python 3.13 environment. + +Generator environments will communicate with the evaluation core through a strict data contract: + +- Input data: `train.parquet` +- Metadata contract: `dataset_contract.yaml` +- Generated output: `synthetic.parquet` +- Optional run metadata: `run_manifest.yaml` + +Generators must not receive validation or test data unless a future experiment explicitly documents and justifies that choice. + +## Rationale + +Parquet is preferred over CSV because tabular synthetic data evaluation depends on preserving types and missing-value semantics. CSV can silently change dates, nullable integers, booleans, categorical identifiers, leading zeros, and null representations. These changes can create false evaluation results or hide generator failures. + +Parquet provides stronger native support for typed tabular exchange and reduces accidental schema drift between the evaluation core and isolated generator environments. + +The metadata sidecar is still required because file types alone do not describe the full dataset contract. `dataset_contract.yaml` will define information such as: + +- Column roles: numerical, categorical, ordinal, datetime, ID, target, sensitive, quasi-identifier +- Allowed categories +- Missing-value semantics +- Target column and prediction task +- Train/validation/test split policy +- Business rules and range constraints +- Privacy-relevant quasi-identifiers and sensitive attributes +- Whether labels may be used by the generator + +## Consequences + +This architecture keeps the main repository environment stable while allowing dependency-sensitive generator experiments. + +It also creates extra validation work. Every generator output must be checked against the dataset contract before utility, fidelity, or privacy metrics are trusted. + +This decision does not imply that any generator environment, adapter, or evaluation pipeline has been implemented yet. diff --git a/projects/synthetic_data_tabular/docs/design/0001-privacy-evaluation-protocol.md b/projects/synthetic_data_tabular/docs/design/0001-privacy-evaluation-protocol.md new file mode 100644 index 0000000..9d2442f --- /dev/null +++ b/projects/synthetic_data_tabular/docs/design/0001-privacy-evaluation-protocol.md @@ -0,0 +1,47 @@ +# Design 0001: Privacy Evaluation Protocol + +**Status**: Planned +**Project**: Synthetic Tabular Data + +## Context + +Synthetic data does not automatically provide privacy. A generator can memorize training records, recreate rare individuals, leak sensitive attributes, or produce rows that are dangerously close to real records. + +This project needs a defense-in-depth privacy evaluation protocol before synthetic datasets can be treated as useful portfolio artifacts. + +## Decision + +The planned privacy evaluation will combine multiple risk indicators rather than relying on a single metric. + +### 1. Distance to Closest Record + +Distance to Closest Record (DCR) will be used to identify synthetic rows that are exact or near copies of real training rows. + +The planned evaluation will compare synthetic-to-real distances against real-to-real distance distributions where appropriate, so that unusually close synthetic records can be flagged for review. + +### 2. Membership Inference Attacks + +Membership inference attack (MIA) simulations will be used to estimate whether an attacker can infer that a specific real record was included in the generator's training data. + +Tools such as Anonymeter may be evaluated for this purpose if they are compatible with the project architecture and dependency constraints. + +### 3. Singling-Out and Outlier-Risk Analysis + +Classic k-anonymity, l-diversity, and t-closeness are not treated here as privacy guarantees for synthetic data. + +Instead, this project plans to use rare equivalence classes as a singling-out and outlier-risk signal: + +1. Identify quasi-identifiers in the original real training dataset. +2. Find highly unique or low-k equivalence classes in that real dataset. +3. Cross-reference generated synthetic rows against those rare real groups. +4. Flag cases where the generator appears to recreate rare, isolated, or sensitive real-data patterns. + +This analysis is intended to detect whether the generator copied or over-preserved unusual real individuals or groups. + +## Limitations + +These checks are risk indicators, not formal privacy proofs. + +The protocol does not claim differential privacy unless a future generator explicitly implements a documented DP mechanism with a stated privacy budget. + +Privacy results must be interpreted alongside utility, fidelity, and validity metrics. A dataset with high utility but high memorization risk should not be considered successful. From 6354f8a0f8886366c54032eea82c76009fa4c05e Mon Sep 17 00:00:00 2001 From: Martin Date: Mon, 1 Jun 2026 16:34:48 +0200 Subject: [PATCH 2/2] added readmes --- projects/bayesian_optimization/README.md | 2 +- projects/quantum_ml/README.md | 3 +++ 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/projects/bayesian_optimization/README.md b/projects/bayesian_optimization/README.md index f897a06..e3bd79a 100644 --- a/projects/bayesian_optimization/README.md +++ b/projects/bayesian_optimization/README.md @@ -53,7 +53,7 @@ Implement and compare Bayesian optimization frameworks for sample-efficient hype ## References - [Optuna Documentation](https://optuna.readthedocs.io/) -- [BoTorch Tutorials](https://botorch.org/tutorials/) +- [BoTorch Tutorials](https://botorch.org/docs/tutorials) - [Bayesian Optimization Book](https://bayesoptbook.com/) - [AutoML Book Chapter](https://www.automl.org/book/) diff --git a/projects/quantum_ml/README.md b/projects/quantum_ml/README.md index 2e33cd8..d331d8f 100644 --- a/projects/quantum_ml/README.md +++ b/projects/quantum_ml/README.md @@ -139,6 +139,9 @@ quantum_ml/ - [Cirq Documentation](https://quantumai.google/cirq) ### Papers +- [Qiskit Machine Learning: an open-source library for quantum machine learning tasks at scale on quantum hardware and classical simulators](https://arxiv.org/html/2505.17756v1) +- [Quantum Machine Learning](https://arxiv.org/abs/1611.09347) - Schuld & Killoran + - [Variational Quantum Eigensolver (VQE)](https://arxiv.org/abs/1304.3061) - Peruzzo et al. - [Quantum Machine Learning](https://arxiv.org/abs/1611.09347) - Schuld & Killoran - [Barren Plateaus in QML](https://arxiv.org/abs/1803.11173) - McClean et al.