mivertowski · mivertowski · May 20, 2026 · May 20, 2026 · May 21, 2026 · May 21, 2026
diff --git a/experiments/ml/.gitignore b/experiments/ml/.gitignore
@@ -0,0 +1,32 @@
+# DataSynth ML experiments — keep ALL corpus-derived artifacts out of git.
+# Only code (*.py) and specs (*.md, requirements.txt) are tracked.
+
+# Exported training tensors / parquet (corpus-derived)
+data/
+*.parquet
+*.npz
+*.npy
+*.pt
+*.pth
+*.ckpt
+*.safetensors
+
+# Trained weights + run outputs (treat as sensitive as the corpus until
+# memorization-reviewed — see README § Privacy)
+weights/
+runs/
+checkpoints/
+lightning_logs/
+wandb/
+*.log
+
+# Any run config that carries a corpus path
+*.local.yaml
+*.local.json
+config.local.*
+
+# Python env
+.venv/
+__pycache__/
+*.pyc
+.ipynb_checkpoints/
diff --git a/experiments/ml/FINDINGS.md b/experiments/ml/FINDINGS.md
@@ -0,0 +1,140 @@
+# Corpus → synthetic gap: what's missing, and what the learning tracks recover
+
+A100 study (2026-05-21), DataSynth v5.27. Goal: **learn from the corpus what
+the synthetic generator is missing** — the aggregated corpus (53.4M JE lines,
+11.8M JEs) vs the v5.27 engine. All learning is on
+the corpus on the private box; weights stay on-box (memorization rule). Paper
+grounding + generator-optimization targets.
+
+## 1. What's missing (descriptive, corpus vs synthetic)
+
+Raw observables — interpretable units, not normalized DRs:
+
+| Observable | Corpus | Synthetic | Gap |
+|---|--:|--:|---|
+| Source diversity (entropy / count) | 3.37 / 4,504 | 0.75 / 4 | synthetic **far too concentrated** (one source ≈ 75%) |
+| Inter-event-time **std** (days) | 0.0169 | 0.00028 | synthetic **~60× too regular** (irregular-gap structure absent) |
+| Amount **p99** | $33k | $542k | synthetic tail **~16× too fat** |
+| log-amount std / skew | 2.46 / 0.56 | 3.43 / 0.99 | synthetic **over-dispersed, over-skewed** |
+| Lines per JE (mean) | 4.5 | 10.3 | synthetic JEs **~2.3× too large** |
+| Benford MAD | 0.0081 | 0.0057 | synthetic slightly *more* Benford-clean than the corpus |
+
+Top generator-optimization targets: **(a)** amount density (tail + spread),
+**(b)** IET-variance / lines-per-JE structure, **(c)** source-mix breadth.
+
+## 2. Methodological finding — the DR eval degenerates at full corpus scale
+
+`behavioral score` on the 53.4M-line corpus returns `is_degenerate_baseline =
+true` for **every** metric: the corpus-vs-corpus 50/50 noise floor is ≈0, so
+each degradation ratio divides by ~0 and saturates at the 100 cap. The
+normalized composite is therefore uninformative at this scale — the descriptive
+comparison (§1) is the actionable signal. **For the paper:** the DR noise-floor
+needs a resampling scheme that stays non-degenerate at large N (e.g. per-entity
+block bootstrap), or the composite should fall back to raw distances when the
+baseline underflows.
+
+## 3. Learning tracks — recovering the missing structure (corpus-trained)
+
+### Flow (amount density) — `flow/`
+Conditional neural-spline flow over `signed log1p(|amount|)`, conditioned on
+account-class (COA join, 294/294 accounts matched). **Bug found + fixed:** the
+NSF default spline domain (~[-5,5]) cannot represent corpus log-amounts (which
+reach ~10.4), collapsing learned p99 to ~$142 (v1). Standardizing `y` before
+the flow fixes it (v2).
+
+| | log-amt mean | std | skew | p99 | Benford MAD |
+|---|--:|--:|--:|--:|--:|
+| Corpus (held-out) | 3.91 | 2.45 | 0.54 | $32,950 | 0.0086 |
+| Flow v1 (un-standardized) | 2.81 | 1.45 | −0.39 | $142 | 0.0182 |
+| **Flow v2 (standardized y)** | **3.89** | **2.46** | **0.54** | **$31,754** | **0.0081** |
+| Synthetic (3-comp mixture) | 3.65 | 3.43 | 0.99 | $541,617 | 0.0057 |
+
+**v2 matches the corpus amount density almost exactly** — NLL 8.96 → 0.67;
+p99 $31,754 vs corpus $33,688 (within ~6%), std/skew spot-on — whereas the
+current 3-component mixture overshoots p99 by ~16× and is 1.4× over-dispersed.
+Headline result: a learned per-account-class flow recovers the corpus amount
+distribution the shipped mixture misses. Handoff: export spline knots → candle
+`AmountSampler`, or keep as a build-time density artifact.
+
+### Sequence (event-stream temporal) — `sequence/`
+Decoder-only transformer over per-(client, source) event-token streams (Δt /
+line-count / account-class / weekday buckets), factorized heads. **Trains
+cleanly** (loss 1.99 → 1.93 / 25 epochs over 2,500 streams) — the corpus event
+structure *is* learnable. Finding: corpus dt-bucket **lag-1 autocorr = −0.118**
+(only 11.6% of streams positively autocorrelated), so the corpus is **not**
+strongly *sequentially* bursty at this granularity — the §1 "60×" gap is
+inter-event-time **variance**, a distinct axis from autocorrelation. **Held-out
+lift over an iid per-field marginal sampler: +3.37 nats/token** (account_class
++1.55, weekday +1.36, lines +0.58; Δt ≈ flat at −0.12 — Δt really is near
+memoryless). So the autoregressive model captures the joint
+source→account-class→line-count→weekday structure the current per-event
+marginal sampler discards — the concrete case for an AR event scheduler.
+Data-quality
+note: the corpus COA `Account Class` carries encoding-mangled label variants
+inflating the class count to 397 — a cleaning target.
+
+### Inverse SBI — run the engine backward — `inverse/`
+Amortized neural posterior `q(θ | x)` (zuko NSF) over 3 tier-1 knobs
+(`fraud_rate`, amount `mu`, amount `sigma`), trained on **1,000 forward-simulated
+`(θ, GL-summary)` pairs (0 failures)**, validated on held-out synthetic with
+simulation-based calibration + 90% credible-interval coverage:
+
+| knob | MAE (norm) | 90% coverage | verdict |
+|---|--:|--:|---|
+| **amount_mu** | 0.049 | **0.92** | strongly identifiable |
+| **fraud.fraud_rate** | 0.078 | **0.88** | identifiable, calibrated |
+| amount_sigma | 0.209 | 0.77 | poorly identified (honest) |
+
+A GL's amount **location** and **fraud rate** are recoverable with calibrated
+uncertainty; amount **width** is not (other variance sources swamp the single
+component's σ). This is the audit-analytics direction — *"the GL most likely
+came from these process parameters"* — validated on synthetic before any
+out-of-sample-GL use. Identifiability is gated by forward-model fidelity (the §1 gap), so
+the flow/sequence work directly improves how much an inverse can recover.
+
+**Capstone — posterior applied to the corpus.** Feeding the corpus's
+summary into the SBC-calibrated `q(θ|x)` returns a **degenerate, boundary-pinned
+posterior** — `fraud_rate→0.100`, `amount_mu→3.0`, `amount_sigma→2.6`, all with
+**zero-width 90% CIs** — i.e. confidently *wrong* (corpus log-amount mean 3.92
+implies `amount_mu≈6.2`). The corpus `x` is **out-of-distribution** for the
+synthetic-trained posterior: the §1 gaps (source entropy, lines-per-JE tail, IET
+variance) put out-of-sample GLs outside the manifold the forward model produces, so the
+flow extrapolates to the prior bounds and collapses its uncertainty. This is
+**"distribution shift = the BF gap" made empirical**: the inverse is
+well-calibrated on synthetic (cov 0.92) yet *untrustworthy on the corpus
+until the forward-fidelity gap is closed*. It is the single strongest argument
+for the flow/sequence fidelity work — closing §1 is precisely what makes
+backward inference on out-of-sample GLs valid. (Methodology lands; the headline number is
+the negative transfer, not a recovered θ.)
+
+### Surrogate / tuning loop — `surrogate/`
+Grounded CMA-ES: MLP surrogate `θ → distance-to-corpus` over 10 robust
+observables, fit on the campaign, searched by CMA-ES. **Machinery runs
+end-to-end on campaign data** (vs the scaffold `optimize.py`'s synthetic-seed
+placeholder). Honest result: held-out Spearman **0.46**, and CMA-ES landed
+`amount_mu` at its upper bound (10.0) rather than the corpus-implied ≈6.2 — the
+single-small-generate summary stats are too noisy for the surrogate to locate
+the optimum reliably. (A first attempt was worse, Spearman −0.08, until the
+corpus `lpje_std=123` heavy-tail outlier was dropped from the distance + the
+features clipped.) **Takeaway:** the accelerator needs a larger / lower-variance
+campaign; the calibrated **inverse posterior** above is the more principled
+route to "what params did the corpus come from" — `amount_mu` is strongly
+identified there (cov 0.92), so feeding the corpus summary into `q(θ|x)` is the
+recommended next step over the distance-surrogate.
+
+## 4. GNN fraud showcase (public synthetic data) — `scripts/ml/`
+Separate, publishable result (see `scripts/ml/RESULTS_v5.27.md`): binary fraud
+GraphSAGE test AUC 0.909 (≈ a LogReg on edge features — graph adds little);
+fraud-**typology** is near-random on the collapsed edge list (macro-F1 0.09) but
+**0.58 on the line-level view** — `fraud_type` is learnable, but consumers must
+join the line table.
+
+## 5. Implications
+- **Amount sampler**: the corpus tail is *thinner* and less skewed than the
+  synthetic mixture — the engine over-generates extreme amounts. A learned flow
+  (v2) or a re-fit mixture narrows this.
+- **Source mix**: the engine emits ~4–24 sources vs the corpus's thousands;
+  source-mix breadth is a generation gap (priors bundle partially addresses it).
+- **Lines per JE**: synthetic JEs are ~2× too large — the lines-per-JE prior
+  needs down-weighting toward the corpus mean of ~4.5.
+- **Eval**: fix the DR noise-floor degeneracy at corpus scale before re-baselining.
diff --git a/experiments/ml/README.md b/experiments/ml/README.md
@@ -0,0 +1,95 @@
+# DataSynth ML experiments — neuro-symbolic realism
+
+Four experiment tracks that try to close the behavioral-fidelity (BF) gap by
+learning the **proposal distributions** of the generator while leaving the
+**symbolic constraint layer** (debits = credits, A = L + E, document chains,
+IC matching) untouched.
+
+## The principle
+
+DataSynth is a probabilistic program with hard constraints. The realism gap
+lives in *what we propose* (when a JE posts, how many lines, which accounts /
+entities interact, what amounts), **not** in the constraints. So every model
+here is subordinate to the symbolic engine:
+
+```
+  corpus ──extract──▶ training tensors ──train(A100)──▶ learned proposal
+                                                              │
+                                                              ▼
+   NN emits *structure / latent shape*  ──▶  symbolic decoder enforces every
+   (timing, line count, accounts,            invariant (balance, A=L+E, chains)
+    entity edges, amount density)            ──▶ coherent synthetic output
+```
+
+The NN never emits a final balance. It emits shape; the existing Rust
+generator projects that shape onto the feasible manifold. Coherence stays a
+hard guarantee by construction.
+
+## The five tracks
+
+Tracks 1–4 sharpen the **forward** model (closing the BF gap); track 5 runs it
+**backward** (recover the latent parameters from a GL).
+
+| Dir | Track | Architecture | Targets |
+|-----|-------|--------------|---------|
+| [`gnn/`](gnn/SPEC.md) | Relational / interconnectivity sampler | GraphSAGE encoder + edge/degree decoder (GAE-style) | P3 ClusteringGap, TriangleLogRatio (TP / vendor / IC graphs) |
+| [`sequence/`](sequence/SPEC.md) | Temporal / behavioral stream | Autoregressive transformer over per-(source, entity) JE token streams | P1 IETD + Autocorr, P2 JELineBurst, P4 MeanGap |
+| [`flow/`](flow/SPEC.md) | Amount marginals | Conditional normalizing flow per (source, account-class) | Benford / multimodal amount fidelity |
+| [`surrogate/`](surrogate/SPEC.md) | Tuning-loop accelerator | MLP surrogate of the BF composite + CMA-ES over generator knobs | *Performance* of calibration (no coherence risk — never touches generation) |
+| [`inverse/`](inverse/SPEC.md) | Backward inference (SBI) | Amortized neural posterior `q(θ\|GL)` trained on forward-simulated pairs | Recover the latent process *parameters* a GL was distilled from, with calibrated uncertainty (SBC + coverage validated on synthetic) |
+
+Start with **`gnn/`** (highest leverage on the structural gaps the hand-tuned
+motif samplers can't close) or **`surrogate/`** (pure iteration-speed win,
+zero coherence risk). **`inverse/`** is the audit-analytics direction — it
+reuses the `flow/` density + the forward simulator's free ground truth. See
+each `SPEC.md` for objective, data contract, architecture, and success
+criteria.
+architecture, and success criteria.
+
+## Privacy / legal (read before running)
+
+The training data is **corpus-derived**. Two hard rules:
+
+1. **Nothing corpus-derived is committed.** `data/`, `weights/`, `runs/`, and
+   any run config carrying a corpus path are gitignored. Only code + specs are
+   tracked. See [`.gitignore`](.gitignore).
+2. **Models can memorize.** A GNN trained on raw entity graphs can memorize
+   genuine counterparty relationships; a sequence model can memorize rare
+   account/text patterns. Before *any* trained weight leaves the private box,
+   it must pass a memorization review (the GNN spec describes a k-anonymity /
+   DP-SGD path). Treat weights as sensitive as the corpus until reviewed.
+
+The corpus location is supplied via the `DATASYNTH_CORPUS_DIR` environment
+variable — never hard-coded, never logged. Matches the existing
+`scripts/regenerate-industry-priors.sh` convention.
+
+## Setup (on the A100 box, when free)
+
+```bash
+cd experiments/ml
+python -m venv .venv && source .venv/bin/activate
+pip install -r requirements.txt
+export DATASYNTH_CORPUS_DIR=/path/to/private/corpus   # never committed
+
+# 1. export training tensors from the corpus (CPU, ~minutes)
+python -m common.data_export --track gnn --out data/gnn
+
+# 2. train (A100)
+python -m gnn.train --data data/gnn --out weights/gnn
+
+# 3. score the lift against the BF eval baseline
+python -m common.bf_bridge --candidate weights/gnn/samples.parquet
+```
+
+## Handoff to the Rust generator
+
+PyTorch-first: prove the metric lift Python-side, then decide per-track
+whether to (a) port the learned sampler to `candle` for the shipped generator,
+or (b) keep a Python sidecar that emits structure artifacts the Rust generator
+consumes at build time. Recorded per track in its `SPEC.md` § Handoff.
+
+## Status
+
+Scaffold only — no model trained yet. Built while the A100 was occupied with
+another job. Each `train.py` is runnable but the model bodies carry `TODO`
+markers where corpus-schema-specific wiring lands after the first data export.
diff --git a/experiments/ml/common/__init__.py b/experiments/ml/common/__init__.py
@@ -0,0 +1 @@
+"""Shared utilities for the DataSynth ML experiment tracks."""
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		"""Shared utilities for the DataSynth ML experiment tracks."""