Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
ab4378c
feat(ml): scaffold 4 neuro-symbolic realism experiment tracks
mivertowski May 20, 2026
044fdfb
feat(ml): add inverse / simulation-based-inference (SBI) track
mivertowski May 20, 2026
63c09e8
feat(ml/inverse): implement simulate.py + trim params to 5 verified s…
mivertowski May 21, 2026
701e7b2
feat(ml): descriptive corpus-vs-synthetic gap (interpretable 'what is…
mivertowski May 21, 2026
c66d8f5
feat(ml/flow): implement export_flow (COA-joined amounts, account-cla…
mivertowski May 21, 2026
5a4734b
fix(ml/flow): standardize y before the NSF (was collapsing the amount…
mivertowski May 21, 2026
1c54d55
feat(ml/sequence): implement export_sequence (factorized event-token …
mivertowski May 21, 2026
6ebfb05
docs(ml): corpus->synthetic gap findings + learning-track results
mivertowski May 21, 2026
33d40d2
docs(ml): flow v2 result — learned flow matches corpus amount density
mivertowski May 21, 2026
8b5808c
feat(ml): sequence-lift (NLL vs marginal) + inverse make_base + 3-kno…
mivertowski May 21, 2026
54e4bf9
fix(ml/inverse): log_normal enum (not lognormal) + record sequence +3…
mivertowski May 21, 2026
851f1dc
feat(ml/surrogate): grounded surrogate + CMA-ES (match corpus via cam…
mivertowski May 21, 2026
1b8fbf5
docs(ml): inverse SBI result — amount_mu + fraud_rate recoverable, si…
mivertowski May 21, 2026
4b93304
fix(ml/surrogate): drop heavy-tailed features + clip + cache corpus_x
mivertowski May 21, 2026
aec959a
docs(ml): surrogate result (honest) + close out the 4-track study
mivertowski May 21, 2026
140cf2b
feat(ml/inverse): apply.py — posterior over the params a GL came from…
mivertowski May 21, 2026
b067e28
docs(ml): inverse capstone — posterior is degenerate on the real corp…
mivertowski May 21, 2026
18985ae
docs(ml): legal scrub of the ML experiment artifacts
mivertowski May 21, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions experiments/ml/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# DataSynth ML experiments — keep ALL corpus-derived artifacts out of git.
# Only code (*.py) and specs (*.md, requirements.txt) are tracked.

# Exported training tensors / parquet (corpus-derived)
data/
*.parquet
*.npz
*.npy
*.pt
*.pth
*.ckpt
*.safetensors

# Trained weights + run outputs (treat as sensitive as the corpus until
# memorization-reviewed — see README § Privacy)
weights/
runs/
checkpoints/
lightning_logs/
wandb/
*.log

# Any run config that carries a corpus path
*.local.yaml
*.local.json
config.local.*

# Python env
.venv/
__pycache__/
*.pyc
.ipynb_checkpoints/
140 changes: 140 additions & 0 deletions experiments/ml/FINDINGS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# Corpus → synthetic gap: what's missing, and what the learning tracks recover

A100 study (2026-05-21), DataSynth v5.27. Goal: **learn from the corpus what
the synthetic generator is missing** — the aggregated corpus (53.4M JE lines,
11.8M JEs) vs the v5.27 engine. All learning is on
the corpus on the private box; weights stay on-box (memorization rule). Paper
grounding + generator-optimization targets.

## 1. What's missing (descriptive, corpus vs synthetic)

Raw observables — interpretable units, not normalized DRs:

| Observable | Corpus | Synthetic | Gap |
|---|--:|--:|---|
| Source diversity (entropy / count) | 3.37 / 4,504 | 0.75 / 4 | synthetic **far too concentrated** (one source ≈ 75%) |
| Inter-event-time **std** (days) | 0.0169 | 0.00028 | synthetic **~60× too regular** (irregular-gap structure absent) |
| Amount **p99** | $33k | $542k | synthetic tail **~16× too fat** |
| log-amount std / skew | 2.46 / 0.56 | 3.43 / 0.99 | synthetic **over-dispersed, over-skewed** |
| Lines per JE (mean) | 4.5 | 10.3 | synthetic JEs **~2.3× too large** |
| Benford MAD | 0.0081 | 0.0057 | synthetic slightly *more* Benford-clean than the corpus |

Top generator-optimization targets: **(a)** amount density (tail + spread),
**(b)** IET-variance / lines-per-JE structure, **(c)** source-mix breadth.

## 2. Methodological finding — the DR eval degenerates at full corpus scale

`behavioral score` on the 53.4M-line corpus returns `is_degenerate_baseline =
true` for **every** metric: the corpus-vs-corpus 50/50 noise floor is ≈0, so
each degradation ratio divides by ~0 and saturates at the 100 cap. The
normalized composite is therefore uninformative at this scale — the descriptive
comparison (§1) is the actionable signal. **For the paper:** the DR noise-floor
needs a resampling scheme that stays non-degenerate at large N (e.g. per-entity
block bootstrap), or the composite should fall back to raw distances when the
baseline underflows.

## 3. Learning tracks — recovering the missing structure (corpus-trained)

### Flow (amount density) — `flow/`
Conditional neural-spline flow over `signed log1p(|amount|)`, conditioned on
account-class (COA join, 294/294 accounts matched). **Bug found + fixed:** the
NSF default spline domain (~[-5,5]) cannot represent corpus log-amounts (which
reach ~10.4), collapsing learned p99 to ~$142 (v1). Standardizing `y` before
the flow fixes it (v2).

| | log-amt mean | std | skew | p99 | Benford MAD |
|---|--:|--:|--:|--:|--:|
| Corpus (held-out) | 3.91 | 2.45 | 0.54 | $32,950 | 0.0086 |
| Flow v1 (un-standardized) | 2.81 | 1.45 | −0.39 | $142 | 0.0182 |
| **Flow v2 (standardized y)** | **3.89** | **2.46** | **0.54** | **$31,754** | **0.0081** |
| Synthetic (3-comp mixture) | 3.65 | 3.43 | 0.99 | $541,617 | 0.0057 |

**v2 matches the corpus amount density almost exactly** — NLL 8.96 → 0.67;
p99 $31,754 vs corpus $33,688 (within ~6%), std/skew spot-on — whereas the
current 3-component mixture overshoots p99 by ~16× and is 1.4× over-dispersed.
Headline result: a learned per-account-class flow recovers the corpus amount
distribution the shipped mixture misses. Handoff: export spline knots → candle
`AmountSampler`, or keep as a build-time density artifact.

### Sequence (event-stream temporal) — `sequence/`
Decoder-only transformer over per-(client, source) event-token streams (Δt /
line-count / account-class / weekday buckets), factorized heads. **Trains
cleanly** (loss 1.99 → 1.93 / 25 epochs over 2,500 streams) — the corpus event
structure *is* learnable. Finding: corpus dt-bucket **lag-1 autocorr = −0.118**
(only 11.6% of streams positively autocorrelated), so the corpus is **not**
strongly *sequentially* bursty at this granularity — the §1 "60×" gap is
inter-event-time **variance**, a distinct axis from autocorrelation. **Held-out
lift over an iid per-field marginal sampler: +3.37 nats/token** (account_class
+1.55, weekday +1.36, lines +0.58; Δt ≈ flat at −0.12 — Δt really is near
memoryless). So the autoregressive model captures the joint
source→account-class→line-count→weekday structure the current per-event
marginal sampler discards — the concrete case for an AR event scheduler.
Data-quality
note: the corpus COA `Account Class` carries encoding-mangled label variants
inflating the class count to 397 — a cleaning target.

### Inverse SBI — run the engine backward — `inverse/`
Amortized neural posterior `q(θ | x)` (zuko NSF) over 3 tier-1 knobs
(`fraud_rate`, amount `mu`, amount `sigma`), trained on **1,000 forward-simulated
`(θ, GL-summary)` pairs (0 failures)**, validated on held-out synthetic with
simulation-based calibration + 90% credible-interval coverage:

| knob | MAE (norm) | 90% coverage | verdict |
|---|--:|--:|---|
| **amount_mu** | 0.049 | **0.92** | strongly identifiable |
| **fraud.fraud_rate** | 0.078 | **0.88** | identifiable, calibrated |
| amount_sigma | 0.209 | 0.77 | poorly identified (honest) |

A GL's amount **location** and **fraud rate** are recoverable with calibrated
uncertainty; amount **width** is not (other variance sources swamp the single
component's σ). This is the audit-analytics direction — *"the GL most likely
came from these process parameters"* — validated on synthetic before any
out-of-sample-GL use. Identifiability is gated by forward-model fidelity (the §1 gap), so
the flow/sequence work directly improves how much an inverse can recover.

**Capstone — posterior applied to the corpus.** Feeding the corpus's
summary into the SBC-calibrated `q(θ|x)` returns a **degenerate, boundary-pinned
posterior** — `fraud_rate→0.100`, `amount_mu→3.0`, `amount_sigma→2.6`, all with
**zero-width 90% CIs** — i.e. confidently *wrong* (corpus log-amount mean 3.92
implies `amount_mu≈6.2`). The corpus `x` is **out-of-distribution** for the
synthetic-trained posterior: the §1 gaps (source entropy, lines-per-JE tail, IET
variance) put out-of-sample GLs outside the manifold the forward model produces, so the
flow extrapolates to the prior bounds and collapses its uncertainty. This is
**"distribution shift = the BF gap" made empirical**: the inverse is
well-calibrated on synthetic (cov 0.92) yet *untrustworthy on the corpus
until the forward-fidelity gap is closed*. It is the single strongest argument
for the flow/sequence fidelity work — closing §1 is precisely what makes
backward inference on out-of-sample GLs valid. (Methodology lands; the headline number is
the negative transfer, not a recovered θ.)

### Surrogate / tuning loop — `surrogate/`
Grounded CMA-ES: MLP surrogate `θ → distance-to-corpus` over 10 robust
observables, fit on the campaign, searched by CMA-ES. **Machinery runs
end-to-end on campaign data** (vs the scaffold `optimize.py`'s synthetic-seed
placeholder). Honest result: held-out Spearman **0.46**, and CMA-ES landed
`amount_mu` at its upper bound (10.0) rather than the corpus-implied ≈6.2 — the
single-small-generate summary stats are too noisy for the surrogate to locate
the optimum reliably. (A first attempt was worse, Spearman −0.08, until the
corpus `lpje_std=123` heavy-tail outlier was dropped from the distance + the
features clipped.) **Takeaway:** the accelerator needs a larger / lower-variance
campaign; the calibrated **inverse posterior** above is the more principled
route to "what params did the corpus come from" — `amount_mu` is strongly
identified there (cov 0.92), so feeding the corpus summary into `q(θ|x)` is the
recommended next step over the distance-surrogate.

## 4. GNN fraud showcase (public synthetic data) — `scripts/ml/`
Separate, publishable result (see `scripts/ml/RESULTS_v5.27.md`): binary fraud
GraphSAGE test AUC 0.909 (≈ a LogReg on edge features — graph adds little);
fraud-**typology** is near-random on the collapsed edge list (macro-F1 0.09) but
**0.58 on the line-level view** — `fraud_type` is learnable, but consumers must
join the line table.

## 5. Implications
- **Amount sampler**: the corpus tail is *thinner* and less skewed than the
synthetic mixture — the engine over-generates extreme amounts. A learned flow
(v2) or a re-fit mixture narrows this.
- **Source mix**: the engine emits ~4–24 sources vs the corpus's thousands;
source-mix breadth is a generation gap (priors bundle partially addresses it).
- **Lines per JE**: synthetic JEs are ~2× too large — the lines-per-JE prior
needs down-weighting toward the corpus mean of ~4.5.
- **Eval**: fix the DR noise-floor degeneracy at corpus scale before re-baselining.
95 changes: 95 additions & 0 deletions experiments/ml/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# DataSynth ML experiments — neuro-symbolic realism

Four experiment tracks that try to close the behavioral-fidelity (BF) gap by
learning the **proposal distributions** of the generator while leaving the
**symbolic constraint layer** (debits = credits, A = L + E, document chains,
IC matching) untouched.

## The principle

DataSynth is a probabilistic program with hard constraints. The realism gap
lives in *what we propose* (when a JE posts, how many lines, which accounts /
entities interact, what amounts), **not** in the constraints. So every model
here is subordinate to the symbolic engine:

```
corpus ──extract──▶ training tensors ──train(A100)──▶ learned proposal
NN emits *structure / latent shape* ──▶ symbolic decoder enforces every
(timing, line count, accounts, invariant (balance, A=L+E, chains)
entity edges, amount density) ──▶ coherent synthetic output
```

The NN never emits a final balance. It emits shape; the existing Rust
generator projects that shape onto the feasible manifold. Coherence stays a
hard guarantee by construction.

## The five tracks

Tracks 1–4 sharpen the **forward** model (closing the BF gap); track 5 runs it
**backward** (recover the latent parameters from a GL).

| Dir | Track | Architecture | Targets |
|-----|-------|--------------|---------|
| [`gnn/`](gnn/SPEC.md) | Relational / interconnectivity sampler | GraphSAGE encoder + edge/degree decoder (GAE-style) | P3 ClusteringGap, TriangleLogRatio (TP / vendor / IC graphs) |
| [`sequence/`](sequence/SPEC.md) | Temporal / behavioral stream | Autoregressive transformer over per-(source, entity) JE token streams | P1 IETD + Autocorr, P2 JELineBurst, P4 MeanGap |
| [`flow/`](flow/SPEC.md) | Amount marginals | Conditional normalizing flow per (source, account-class) | Benford / multimodal amount fidelity |
| [`surrogate/`](surrogate/SPEC.md) | Tuning-loop accelerator | MLP surrogate of the BF composite + CMA-ES over generator knobs | *Performance* of calibration (no coherence risk — never touches generation) |
| [`inverse/`](inverse/SPEC.md) | Backward inference (SBI) | Amortized neural posterior `q(θ\|GL)` trained on forward-simulated pairs | Recover the latent process *parameters* a GL was distilled from, with calibrated uncertainty (SBC + coverage validated on synthetic) |

Start with **`gnn/`** (highest leverage on the structural gaps the hand-tuned
motif samplers can't close) or **`surrogate/`** (pure iteration-speed win,
zero coherence risk). **`inverse/`** is the audit-analytics direction — it
reuses the `flow/` density + the forward simulator's free ground truth. See
each `SPEC.md` for objective, data contract, architecture, and success
criteria.
architecture, and success criteria.

## Privacy / legal (read before running)

The training data is **corpus-derived**. Two hard rules:

1. **Nothing corpus-derived is committed.** `data/`, `weights/`, `runs/`, and
any run config carrying a corpus path are gitignored. Only code + specs are
tracked. See [`.gitignore`](.gitignore).
2. **Models can memorize.** A GNN trained on raw entity graphs can memorize
genuine counterparty relationships; a sequence model can memorize rare
account/text patterns. Before *any* trained weight leaves the private box,
it must pass a memorization review (the GNN spec describes a k-anonymity /
DP-SGD path). Treat weights as sensitive as the corpus until reviewed.

The corpus location is supplied via the `DATASYNTH_CORPUS_DIR` environment
variable — never hard-coded, never logged. Matches the existing
`scripts/regenerate-industry-priors.sh` convention.

## Setup (on the A100 box, when free)

```bash
cd experiments/ml
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
export DATASYNTH_CORPUS_DIR=/path/to/private/corpus # never committed

# 1. export training tensors from the corpus (CPU, ~minutes)
python -m common.data_export --track gnn --out data/gnn

# 2. train (A100)
python -m gnn.train --data data/gnn --out weights/gnn

# 3. score the lift against the BF eval baseline
python -m common.bf_bridge --candidate weights/gnn/samples.parquet
```

## Handoff to the Rust generator

PyTorch-first: prove the metric lift Python-side, then decide per-track
whether to (a) port the learned sampler to `candle` for the shipped generator,
or (b) keep a Python sidecar that emits structure artifacts the Rust generator
consumes at build time. Recorded per track in its `SPEC.md` § Handoff.

## Status

Scaffold only — no model trained yet. Built while the A100 was occupied with
another job. Each `train.py` is runnable but the model bodies carry `TODO`
markers where corpus-schema-specific wiring lands after the first data export.
1 change: 1 addition & 0 deletions experiments/ml/common/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Shared utilities for the DataSynth ML experiment tracks."""
Loading
Loading