From d77934c41936fb88d69fe06e6b9be3e5c8867c18 Mon Sep 17 00:00:00 2001
From: Martin <your.email@example.com>
Date: Fri, 22 May 2026 22:55:19 +0200
Subject: [PATCH 1/2] docs: outline synthetic tabular evaluation strategy

Co-authored-by: Cursor <cursoragent@cursor.com>
---
 docs/evaluation_standards.md                  |  70 +++++++
 projects/synthetic_data_tabular/README.md     | 176 +++++++++++++++---
 .../0001-isolated-generator-environments.md   |  50 +++++
 .../0001-privacy-evaluation-protocol.md       |  47 +++++
 4 files changed, 322 insertions(+), 21 deletions(-)
 create mode 100644 projects/synthetic_data_tabular/docs/adr/0001-isolated-generator-environments.md
 create mode 100644 projects/synthetic_data_tabular/docs/design/0001-privacy-evaluation-protocol.md

diff --git a/docs/evaluation_standards.md b/docs/evaluation_standards.md
index dada940..77fa1b9 100644
--- a/docs/evaluation_standards.md
+++ b/docs/evaluation_standards.md
@@ -235,3 +235,73 @@ Nightly runs re-evaluate models on:
 - Current test set (detect data drift)
 - New edge cases (expanding test coverage)
 - Robustness sweeps (track stability)
+
+## Synthetic Tabular Data Target Standard
+
+This section defines the target evaluation standard for the planned `projects/synthetic_data_tabular/` project. It describes the intended evaluation protocol and does not imply that synthetic data generators, metrics, notebooks, MLflow runs, or reports are currently implemented.
+
+### Utility Metrics
+
+Synthetic tabular data should be evaluated by downstream task performance, not only by visual similarity.
+
+Target utility checks:
+
+- Train-on-synthetic-test-on-real (TSTR)
+- Train-on-real-test-on-synthetic (TRTS)
+- Train-on-real-test-on-real reference baseline
+- Classification metrics such as ROC-AUC, PR-AUC, F1, Brier score, and calibration where applicable
+- Regression metrics such as RMSE, MAE, and interval coverage where applicable
+- Slice-level utility for missingness bands, rare categories, outlier bands, and class imbalance
+
+### Fidelity Metrics
+
+Fidelity checks should measure whether synthetic data preserves useful structure without assuming that closer is always safer.
+
+Target fidelity checks:
+
+- Marginal distributions for numerical and categorical columns
+- Category frequency preservation, especially rare categories
+- Pairwise correlations and dependency structure
+- Missing-value pattern similarity
+- Target distribution preservation
+- Comparison of real and synthetic feature interactions
+
+### Privacy Metrics
+
+Synthetic data privacy evaluation should use multiple risk indicators.
+
+Target privacy checks:
+
+- Distance to Closest Record (DCR)
+- Duplicate and near-duplicate detection
+- Membership inference attack simulation
+- Singling-out and outlier-risk analysis using rare real-data equivalence classes
+- Optional l-diversity and t-closeness style checks for sensitive attributes within quasi-identifier groups
+
+These checks are risk indicators and do not constitute a formal differential privacy guarantee.
+
+### Validity Metrics
+
+Synthetic rows must satisfy the declared dataset contract before utility or privacy scores are trusted.
+
+Target validity checks:
+
+- Schema conformity
+- Parquet type preservation
+- Allowed category validation
+- Range constraints
+- Business-rule constraints
+- Missing-value semantics
+- Target leakage checks
+- Train/test split isolation checks
+
+### Target Artifacts
+
+Future implemented synthetic tabular experiments should produce:
+
+- `metrics.json` with utility, fidelity, privacy, and validity summaries
+- Slice-level utility outputs
+- Validity report for generated datasets
+- Privacy risk report
+- Fidelity plots or tables
+- Reproducible generator run metadata
diff --git a/projects/synthetic_data_tabular/README.md b/projects/synthetic_data_tabular/README.md
index 1d527af..46e7a46 100644
--- a/projects/synthetic_data_tabular/README.md
+++ b/projects/synthetic_data_tabular/README.md
@@ -1,29 +1,163 @@
 # Synthetic Tabular Data
 
-**Status**: 📋 Planned
+**Status**: 📋 Planned priority track
 
-Experimental evaluation of synthetic data generation for tabular datasets, focusing on utility vs privacy tradeoffs.
+## Current State
 
-## Planned Features
+This project is currently in the planning and design phase. No runnable scripts, notebooks, trained generators, synthetic datasets, MLflow runs, privacy reports, or benchmark results are implemented yet.
 
-- Generator comparison:
-  - CTGAN baseline
-  - Modern alternatives (TabDDPM, etc.)
-  - Use Diffusion Models to generate tabular data
-- Utility evaluation:
-  - Train-on-synthetic-test-on-real (TSTR)
-  - Train-on-real-test-on-synthetic (TRTS)
-  - Distributional metrics (marginals, pairwise correlations)
-- Privacy risk assessment:
-  - Nearest neighbor distance leakage
-  - Membership inference attack simulation
-- Decision framework for when to use synthetic data
+The goal is to build this project around an evaluation-first question:
 
-## Coming Soon
+> When does synthetic tabular data preserve useful downstream signal without creating unacceptable privacy risk?
 
-This project will demonstrate:
+## Purpose
 
-- Practical synthetic data evaluation
-- Privacy-utility tradeoff analysis
-- Guidance on when synthetic data helps vs harms
-- Reproducible experimental framework
+This project will evaluate synthetic data generation for tabular datasets through a utility, fidelity, privacy, and validity lens. The emphasis is not only on trying modern generators, but on building an evaluation ladder that makes it clear when a synthetic dataset is useful, when it is misleading, and when it may create privacy risk.
+
+## Planned Evaluation Ladder
+
+### Phase 0: Design and Contracts
+
+- Define dataset contracts for tabular inputs and generator outputs.
+- Define train/validation/test split ownership.
+- Define utility, fidelity, privacy, and validity metrics.
+- Document generator isolation strategy for dependency-sensitive tooling.
+
+### Phase 1: Simple Baselines
+
+Planned simple baselines:
+
+- Bootstrap / row resampling
+- Independent marginal sampling, for example with [DataSynthesizer](https://github.com/DataResponsibly/DataSynthesizer) or [SDV](https://github.com/sdv-dev/SDV) classical models
+- Gaussian copula or other classical statistical synthesizers
+
+These baselines will establish whether more complex generators provide value beyond simple distributional approximations.
+
+### Phase 2: Neural Baselines
+
+Planned neural baselines:
+
+- Autoencoder or variational autoencoder style generator
+- [CTGAN](https://github.com/sdv-dev/CTGAN)
+- TVAE or similar tabular VAE baseline, likely through SDV tooling
+
+These models are planned as practical synthetic-data baselines, likely through isolated generator environments when dependency compatibility requires it.
+
+### Phase 3: Modern Generators
+
+Planned modern approaches:
+
+- Tabular diffusion models such as TabDDPM-style methods
+- [Synthcity](https://github.com/vanderschaarlab/synthcity)-style generator comparisons, if dependency support is practical
+
+These approaches are planned only after the evaluation protocol and simple baselines are stable.
+
+## Planned Data Contract Architecture
+
+The main repository targets Python 3.13. Some synthetic data libraries may lag behind current Python versions or require incompatible dependency sets.
+
+To avoid forcing the main repository environment around one generator library, this project plans to use an isolated generator architecture:
+
+- The main evaluation pipeline stays in the repository's Python 3.13 environment.
+- Complex generators may run in isolated `uv` environments, Docker images, or other reproducible execution contexts.
+- Generators communicate with the evaluation core through a strict tabular data contract.
+- The data contract uses Parquet files plus a metadata schema file, not ad hoc CSV exchange, to strictly preserve data types such as dates and nullable integers and prevent silent drift.
+
+Planned contract files:
+
+- `train.parquet`: real training data made available to the generator.
+- `dataset_contract.yaml`: schema, roles, constraints, target definition, split policy, and privacy-relevant metadata.
+- `synthetic.parquet`: generated synthetic rows returned by the generator.
+- `run_manifest.yaml`: generator name, version, seed, input row count, output row count, and environment metadata.
+
+The generator must not receive validation or test rows unless a future experiment explicitly documents that choice.
+
+## Planned Evaluation Protocol
+
+### Utility
+
+Target utility checks:
+
+- Train-on-synthetic-test-on-real (TSTR)
+- Train-on-real-test-on-synthetic (TRTS)
+- Real-train-real-test reference baseline
+- Downstream classification or regression performance
+- Slice-level performance for missingness, rare categories, outliers, and class imbalance
+
+### Fidelity
+
+Target fidelity checks:
+
+- Marginal distributions
+- Category frequencies
+- Pairwise correlations or dependency structure
+- Missing-value patterns
+- Class balance preservation
+- Constraint preservation
+
+### Privacy
+
+Target privacy checks:
+
+- Distance to Closest Record (DCR)
+- Duplicate and near-duplicate detection
+- Membership inference attack simulation, targeting tools like Statice's [Anonymeter](https://github.com/statice/anonymeter)
+- Singling-out and outlier-risk analysis using rare real-data equivalence classes
+
+These checks are risk indicators, not formal privacy guarantees. Formal differential privacy is out of scope unless a future generator explicitly implements and documents it.
+
+### Validity
+
+Target validity checks:
+
+- Schema conformity
+- Type preservation
+- Allowed category validation
+- Range and business-rule constraints
+- Missing-value semantics
+- Target leakage checks
+
+## Planned Artifacts
+
+Future implemented versions of this project may produce:
+
+- Evaluation metrics in `metrics.json`
+- Fidelity and privacy summary reports
+- Slice-level utility outputs
+- Validity reports for generated datasets
+- MLflow experiment runs
+- A report notebook summarizing results
+
+These artifacts are planned and are not currently present.
+
+## Planned Tooling References
+
+Candidate tools and libraries to evaluate:
+
+- [DataSynthesizer](https://github.com/DataResponsibly/DataSynthesizer) for classical statistical synthetic-data baselines.
+- [SDV](https://github.com/sdv-dev/SDV) for Gaussian copula, CTGAN, and TVAE-style tabular generation.
+- [CTGAN](https://github.com/sdv-dev/CTGAN) as a recognizable neural tabular generation baseline.
+- [Synthcity](https://github.com/vanderschaarlab/synthcity) for broader synthetic-data benchmarking and possible diffusion-style comparisons.
+- [Anonymeter](https://github.com/statice/anonymeter) for privacy attack and risk evaluation, if dependency compatibility is practical.
+
+These references identify candidate tooling for the planned project. They are not currently integrated into this repository.
+
+## Non-Goals
+
+The initial project scope excludes:
+
+- Fine-tuning LLMs for tabular generation
+- Commercial synthetic-data platforms
+- Claims of formal anonymization
+- Claims of differential privacy without an implemented DP mechanism
+- Broad benchmark claims before reproducible experiments exist
+
+## Roadmap
+
+1. Finalize README, ADR, and privacy evaluation design.
+2. Define the Parquet + metadata dataset contract.
+3. Implement one small tabular dataset and one simple baseline.
+4. Add utility and fidelity evaluation.
+5. Add privacy and validity checks.
+6. Add CTGAN or VAE-style generator through an isolated environment.
+7. Evaluate whether diffusion-based generators add measurable value.
diff --git a/projects/synthetic_data_tabular/docs/adr/0001-isolated-generator-environments.md b/projects/synthetic_data_tabular/docs/adr/0001-isolated-generator-environments.md
new file mode 100644
index 0000000..82b1c5c
--- /dev/null
+++ b/projects/synthetic_data_tabular/docs/adr/0001-isolated-generator-environments.md
@@ -0,0 +1,50 @@
+# ADR 0001: Isolated Generator Environments
+
+**Status**: Planned
+**Project**: Synthetic Tabular Data
+
+## Context
+
+The main repository targets Python 3.13 as a deliberate modern Python baseline. Some synthetic data libraries, especially legacy or research-oriented tabular generation tools, may lag behind current Python versions or require dependency versions that conflict with the main repository environment.
+
+The synthetic data project should be able to evaluate generators such as CTGAN, VAE-style models, and tabular diffusion models without forcing the entire repository to adopt each generator's dependency constraints.
+
+## Decision
+
+Complex synthetic data generators will run in isolated environments when needed. These environments may use separate `uv` environments, Docker images, or another reproducible execution boundary.
+
+The main evaluation core will remain in the repository's Python 3.13 environment.
+
+Generator environments will communicate with the evaluation core through a strict data contract:
+
+- Input data: `train.parquet`
+- Metadata contract: `dataset_contract.yaml`
+- Generated output: `synthetic.parquet`
+- Optional run metadata: `run_manifest.yaml`
+
+Generators must not receive validation or test data unless a future experiment explicitly documents and justifies that choice.
+
+## Rationale
+
+Parquet is preferred over CSV because tabular synthetic data evaluation depends on preserving types and missing-value semantics. CSV can silently change dates, nullable integers, booleans, categorical identifiers, leading zeros, and null representations. These changes can create false evaluation results or hide generator failures.
+
+Parquet provides stronger native support for typed tabular exchange and reduces accidental schema drift between the evaluation core and isolated generator environments.
+
+The metadata sidecar is still required because file types alone do not describe the full dataset contract. `dataset_contract.yaml` will define information such as:
+
+- Column roles: numerical, categorical, ordinal, datetime, ID, target, sensitive, quasi-identifier
+- Allowed categories
+- Missing-value semantics
+- Target column and prediction task
+- Train/validation/test split policy
+- Business rules and range constraints
+- Privacy-relevant quasi-identifiers and sensitive attributes
+- Whether labels may be used by the generator
+
+## Consequences
+
+This architecture keeps the main repository environment stable while allowing dependency-sensitive generator experiments.
+
+It also creates extra validation work. Every generator output must be checked against the dataset contract before utility, fidelity, or privacy metrics are trusted.
+
+This decision does not imply that any generator environment, adapter, or evaluation pipeline has been implemented yet.
diff --git a/projects/synthetic_data_tabular/docs/design/0001-privacy-evaluation-protocol.md b/projects/synthetic_data_tabular/docs/design/0001-privacy-evaluation-protocol.md
new file mode 100644
index 0000000..9d2442f
--- /dev/null
+++ b/projects/synthetic_data_tabular/docs/design/0001-privacy-evaluation-protocol.md
@@ -0,0 +1,47 @@
+# Design 0001: Privacy Evaluation Protocol
+
+**Status**: Planned
+**Project**: Synthetic Tabular Data
+
+## Context
+
+Synthetic data does not automatically provide privacy. A generator can memorize training records, recreate rare individuals, leak sensitive attributes, or produce rows that are dangerously close to real records.
+
+This project needs a defense-in-depth privacy evaluation protocol before synthetic datasets can be treated as useful portfolio artifacts.
+
+## Decision
+
+The planned privacy evaluation will combine multiple risk indicators rather than relying on a single metric.
+
+### 1. Distance to Closest Record
+
+Distance to Closest Record (DCR) will be used to identify synthetic rows that are exact or near copies of real training rows.
+
+The planned evaluation will compare synthetic-to-real distances against real-to-real distance distributions where appropriate, so that unusually close synthetic records can be flagged for review.
+
+### 2. Membership Inference Attacks
+
+Membership inference attack (MIA) simulations will be used to estimate whether an attacker can infer that a specific real record was included in the generator's training data.
+
+Tools such as Anonymeter may be evaluated for this purpose if they are compatible with the project architecture and dependency constraints.
+
+### 3. Singling-Out and Outlier-Risk Analysis
+
+Classic k-anonymity, l-diversity, and t-closeness are not treated here as privacy guarantees for synthetic data.
+
+Instead, this project plans to use rare equivalence classes as a singling-out and outlier-risk signal:
+
+1. Identify quasi-identifiers in the original real training dataset.
+2. Find highly unique or low-k equivalence classes in that real dataset.
+3. Cross-reference generated synthetic rows against those rare real groups.
+4. Flag cases where the generator appears to recreate rare, isolated, or sensitive real-data patterns.
+
+This analysis is intended to detect whether the generator copied or over-preserved unusual real individuals or groups.
+
+## Limitations
+
+These checks are risk indicators, not formal privacy proofs.
+
+The protocol does not claim differential privacy unless a future generator explicitly implements a documented DP mechanism with a stated privacy budget.
+
+Privacy results must be interpreted alongside utility, fidelity, and validity metrics. A dataset with high utility but high memorization risk should not be considered successful.

From 6354f8a0f8886366c54032eea82c76009fa4c05e Mon Sep 17 00:00:00 2001
From: Martin <your.email@example.com>
Date: Mon, 1 Jun 2026 16:34:48 +0200
Subject: [PATCH 2/2] added readmes

---
 projects/bayesian_optimization/README.md | 2 +-
 projects/quantum_ml/README.md            | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/projects/bayesian_optimization/README.md b/projects/bayesian_optimization/README.md
index f897a06..e3bd79a 100644
--- a/projects/bayesian_optimization/README.md
+++ b/projects/bayesian_optimization/README.md
@@ -53,7 +53,7 @@ Implement and compare Bayesian optimization frameworks for sample-efficient hype
 ## References
 
 - [Optuna Documentation](https://optuna.readthedocs.io/)
-- [BoTorch Tutorials](https://botorch.org/tutorials/)
+- [BoTorch Tutorials](https://botorch.org/docs/tutorials)
 - [Bayesian Optimization Book](https://bayesoptbook.com/)
 - [AutoML Book Chapter](https://www.automl.org/book/)
 
diff --git a/projects/quantum_ml/README.md b/projects/quantum_ml/README.md
index 2e33cd8..d331d8f 100644
--- a/projects/quantum_ml/README.md
+++ b/projects/quantum_ml/README.md
@@ -139,6 +139,9 @@ quantum_ml/
 - [Cirq Documentation](https://quantumai.google/cirq)
 
 ### Papers
+- [Qiskit Machine Learning: an open-source library for quantum machine learning tasks at scale on quantum hardware and classical simulators](https://arxiv.org/html/2505.17756v1)
+- [Quantum Machine Learning](https://arxiv.org/abs/1611.09347) - Schuld & Killoran
+
 - [Variational Quantum Eigensolver (VQE)](https://arxiv.org/abs/1304.3061) - Peruzzo et al.
 - [Quantum Machine Learning](https://arxiv.org/abs/1611.09347) - Schuld & Killoran
 - [Barren Plateaus in QML](https://arxiv.org/abs/1803.11173) - McClean et al.