Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,18 +194,18 @@ Both code paths support the following DP mechanisms:

## Further Documentation

Detailed guides are available in the [`documentation/`](dpsynth/documentation/)
Detailed guides are available in the [`docs/`](docs/)
directory:

* **In-Memory DataFrame API Guide** (`documentation/in_memory_api.md`):
* **In-Memory DataFrame API Guide** (`docs/in_memory_api.md`):
Detailed guide to using the Pandas-based API and local CLI.
* **Scalable Pipeline API Guide** (`documentation/scalable_beam_api.md`):
* **Scalable Pipeline API Guide** (`docs/scalable_beam_api.md`):
Guide for distributed data generation.
* **Data Model & Terminology** (`documentation/data_and_terminology.md`):
* **Data Model & Terminology** (`docs/data_and_terminology.md`):
Attributes, schema specifications, and `domain.yaml` format.
* **Processing Lifecycle** (`documentation/processing_lifecycle.md`):
* **Processing Lifecycle** (`docs/processing_lifecycle.md`):
The 5-stage mathematical lifecycle shared by both code paths.
* **Contributor Guide** (`documentation/contributors_guide.md`):
* **Contributor Guide** (`docs/contributors_guide.md`):
Architecture, PipelineBackend programming rules, and evaluation framework.

*This is not an officially supported Google product. This project is
Expand Down
7 changes: 6 additions & 1 deletion docs/data_and_terminology.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,12 @@ string categories. * **Boolean (`BOOL`)**: True/False binary flags. * **Enum

### 3. Record Independence (Differential Privacy Assumption)

It is assumed that each **record** comes from different **privacy unit**.
> [!IMPORTANT]
> DPSynth provides record-level differential privacy: each **record** is assumed
> to come from a different **privacy unit**. If one person or entity can
> contribute multiple rows, callers must enforce the appropriate user-level
> contribution bounds before running DPSynth; otherwise the guarantee is not
> user-level DP.

## Supported Attribute Classifications

Expand Down
13 changes: 6 additions & 7 deletions docs/in_memory_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ synthetic_df = dpsynth.generate(
epsilon: float,
delta: float,
*,
discrete_config: discrete_mechanisms.DiscreteMechanismConfig = discrete_mechanisms.MSTConfig(),
discrete_config: discrete_mechanisms.DiscreteMechanism = discrete_mechanisms.MSTMechanism(),
numerical_bins: int = 32,
one_way_marginal_budget_fraction: float = 0.1,
cross_attribute_constraints: list = (),
Expand Down Expand Up @@ -63,8 +63,7 @@ synthetic_df = dpsynth.generate(
## End-to-End Python Example

Here is a complete Python script demonstrating how to load data, parse a domain
YAML file, configure the AIM mechanism with a fixed random seed, and generate
synthetic records.
YAML file, configure the AIM mechanism, and generate synthetic records.

```python
import dpsynth
Expand All @@ -80,8 +79,7 @@ attribute_domains = domain.from_yaml_file("transaction_domain.yaml")

# 3. Configure the synthesis mechanism (AIM)
aim_config = discrete_mechanisms.AIMConfig(
seed=42,
rounds=50,
max_rounds=50,
pgm_iters=1000,
)

Expand Down Expand Up @@ -130,8 +128,9 @@ python3 bin/main.py \
* `--epsilon`, `--delta`: Total DP privacy budget.
* `--mechanism`: Supported options are `mst`, `aim`, `independent`, and
`aim_gdp`.
* `--seed`: Integer seed for reproducible randomness across DP sampling and
PGM inference.
* `--seed`: Seeds NumPy's legacy global random state. The in-memory generator
also creates `np.random.default_rng()` internally, so identical CLI
invocations are not guaranteed to be bit-for-bit reproducible.
* `--output_path`: Destination filepath where the synthetic CSV will be
written.

Expand Down