From ab4378cdc9a4885debe7b73e076f8fdd15a4215a Mon Sep 17 00:00:00 2001
From: Michael Ivertowski <mivertowski@outlook.com>
Date: Wed, 20 May 2026 16:09:49 +0200
Subject: [PATCH 01/18] feat(ml): scaffold 4 neuro-symbolic realism experiment
 tracks
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Runnable PyTorch scaffolding + per-track specs under experiments/ml/ for
closing the behavioral-fidelity gap by learning the generator's *proposal
distributions* while leaving the symbolic constraint layer (debits=credits,
A=L+E, document chains, IC matching) untouched. The NN proposes structure;
the existing Rust engine enforces every invariant — coherence stays a hard
guarantee.

Tracks:
- gnn/       GraphSAGE+GAE relational sampler → P3 ClusteringGap / TriangleLogRatio
- sequence/  causal transformer over JE event streams → P1 IETD/Autocorr, P2, P4
- flow/      conditional neural-spline flow for amounts → Benford / multimodal tails
- surrogate/ MLP eval-surrogate + CMA-ES knob search → faster calibration loop
             (performance only, zero coherence risk — never touches generation)

common/ holds the shared corpus→tensor exporter and the BF-eval bridge
(canonical Rust scorer + a fast Python approximation for in-loop use).

Scaffold only — no model trained. Built to run on an A100 when free; this
dev box OOMs on orchestrator-scale work. Each train.py is runnable; model
bodies carry TODO markers where corpus-schema wiring lands after the first
data export.

Privacy: corpus path via DATASYNTH_CORPUS_DIR (never hard-coded); data/
weights/run-configs gitignored; per-track memorization review (k-anon /
DP-SGD path in the GNN spec) required before any weight leaves the private
box. No corpus content committed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 experiments/ml/.gitignore             |  32 ++++++
 experiments/ml/README.md              |  88 +++++++++++++++
 experiments/ml/common/__init__.py     |   1 +
 experiments/ml/common/bf_bridge.py    | 139 ++++++++++++++++++++++++
 experiments/ml/common/data_export.py  | 147 ++++++++++++++++++++++++++
 experiments/ml/common/schema.py       |  65 ++++++++++++
 experiments/ml/flow/SPEC.md           |  57 ++++++++++
 experiments/ml/flow/__init__.py       |   1 +
 experiments/ml/flow/model.py          |  52 +++++++++
 experiments/ml/flow/train.py          |  62 +++++++++++
 experiments/ml/gnn/SPEC.md            |  80 ++++++++++++++
 experiments/ml/gnn/__init__.py        |   1 +
 experiments/ml/gnn/model.py           |  83 +++++++++++++++
 experiments/ml/gnn/sample.py          |  38 +++++++
 experiments/ml/gnn/train.py           |  82 ++++++++++++++
 experiments/ml/requirements.txt       |  33 ++++++
 experiments/ml/sequence/SPEC.md       |  62 +++++++++++
 experiments/ml/sequence/__init__.py   |   1 +
 experiments/ml/sequence/model.py      |  85 +++++++++++++++
 experiments/ml/sequence/train.py      |  61 +++++++++++
 experiments/ml/surrogate/SPEC.md      |  59 +++++++++++
 experiments/ml/surrogate/__init__.py  |   1 +
 experiments/ml/surrogate/knobs.py     |  55 ++++++++++
 experiments/ml/surrogate/optimize.py  | 103 ++++++++++++++++++
 experiments/ml/surrogate/surrogate.py |  71 +++++++++++++
 25 files changed, 1459 insertions(+)
 create mode 100644 experiments/ml/.gitignore
 create mode 100644 experiments/ml/README.md
 create mode 100644 experiments/ml/common/__init__.py
 create mode 100644 experiments/ml/common/bf_bridge.py
 create mode 100644 experiments/ml/common/data_export.py
 create mode 100644 experiments/ml/common/schema.py
 create mode 100644 experiments/ml/flow/SPEC.md
 create mode 100644 experiments/ml/flow/__init__.py
 create mode 100644 experiments/ml/flow/model.py
 create mode 100644 experiments/ml/flow/train.py
 create mode 100644 experiments/ml/gnn/SPEC.md
 create mode 100644 experiments/ml/gnn/__init__.py
 create mode 100644 experiments/ml/gnn/model.py
 create mode 100644 experiments/ml/gnn/sample.py
 create mode 100644 experiments/ml/gnn/train.py
 create mode 100644 experiments/ml/requirements.txt
 create mode 100644 experiments/ml/sequence/SPEC.md
 create mode 100644 experiments/ml/sequence/__init__.py
 create mode 100644 experiments/ml/sequence/model.py
 create mode 100644 experiments/ml/sequence/train.py
 create mode 100644 experiments/ml/surrogate/SPEC.md
 create mode 100644 experiments/ml/surrogate/__init__.py
 create mode 100644 experiments/ml/surrogate/knobs.py
 create mode 100644 experiments/ml/surrogate/optimize.py
 create mode 100644 experiments/ml/surrogate/surrogate.py

diff --git a/experiments/ml/.gitignore b/experiments/ml/.gitignore
new file mode 100644
index 00000000..2c91598b
--- /dev/null
+++ b/experiments/ml/.gitignore
@@ -0,0 +1,32 @@
+# DataSynth ML experiments — keep ALL corpus-derived artifacts out of git.
+# Only code (*.py) and specs (*.md, requirements.txt) are tracked.
+
+# Exported training tensors / parquet (corpus-derived)
+data/
+*.parquet
+*.npz
+*.npy
+*.pt
+*.pth
+*.ckpt
+*.safetensors
+
+# Trained weights + run outputs (treat as sensitive as the corpus until
+# memorization-reviewed — see README § Privacy)
+weights/
+runs/
+checkpoints/
+lightning_logs/
+wandb/
+*.log
+
+# Any run config that carries a corpus path
+*.local.yaml
+*.local.json
+config.local.*
+
+# Python env
+.venv/
+__pycache__/
+*.pyc
+.ipynb_checkpoints/
diff --git a/experiments/ml/README.md b/experiments/ml/README.md
new file mode 100644
index 00000000..7c306d0b
--- /dev/null
+++ b/experiments/ml/README.md
@@ -0,0 +1,88 @@
+# DataSynth ML experiments — neuro-symbolic realism
+
+Four experiment tracks that try to close the behavioral-fidelity (BF) gap by
+learning the **proposal distributions** of the generator while leaving the
+**symbolic constraint layer** (debits = credits, A = L + E, document chains,
+IC matching) untouched.
+
+## The principle
+
+DataSynth is a probabilistic program with hard constraints. The realism gap
+lives in *what we propose* (when a JE posts, how many lines, which accounts /
+entities interact, what amounts), **not** in the constraints. So every model
+here is subordinate to the symbolic engine:
+
+```
+  corpus ──extract──▶ training tensors ──train(A100)──▶ learned proposal
+                                                              │
+                                                              ▼
+   NN emits *structure / latent shape*  ──▶  symbolic decoder enforces every
+   (timing, line count, accounts,            invariant (balance, A=L+E, chains)
+    entity edges, amount density)            ──▶ coherent synthetic output
+```
+
+The NN never emits a final balance. It emits shape; the existing Rust
+generator projects that shape onto the feasible manifold. Coherence stays a
+hard guarantee by construction.
+
+## The four tracks
+
+| Dir | Track | Architecture | BF metrics targeted |
+|-----|-------|--------------|---------------------|
+| [`gnn/`](gnn/SPEC.md) | Relational / interconnectivity sampler | GraphSAGE encoder + edge/degree decoder (GAE-style) | P3 ClusteringGap, TriangleLogRatio (TP / vendor / IC graphs) |
+| [`sequence/`](sequence/SPEC.md) | Temporal / behavioral stream | Autoregressive transformer over per-(source, entity) JE token streams | P1 IETD + Autocorr, P2 JELineBurst, P4 MeanGap |
+| [`flow/`](flow/SPEC.md) | Amount marginals | Conditional normalizing flow per (source, account-class) | Benford / multimodal amount fidelity |
+| [`surrogate/`](surrogate/SPEC.md) | Tuning-loop accelerator | MLP surrogate of the BF composite + CMA-ES over generator knobs | *Performance* of calibration (no coherence risk — never touches generation) |
+
+Start with **`gnn/`** (highest leverage on the structural gaps the hand-tuned
+motif samplers can't close) or **`surrogate/`** (pure iteration-speed win,
+zero coherence risk). See each `SPEC.md` for objective, data contract,
+architecture, and success criteria.
+
+## Privacy / legal (read before running)
+
+The training data is **corpus-derived**. Two hard rules:
+
+1. **Nothing corpus-derived is committed.** `data/`, `weights/`, `runs/`, and
+   any run config carrying a corpus path are gitignored. Only code + specs are
+   tracked. See [`.gitignore`](.gitignore).
+2. **Models can memorize.** A GNN trained on raw entity graphs can memorize
+   real counterparty relationships; a sequence model can memorize rare
+   account/text patterns. Before *any* trained weight leaves the private box,
+   it must pass a memorization review (the GNN spec describes a k-anonymity /
+   DP-SGD path). Treat weights as sensitive as the corpus until reviewed.
+
+The corpus location is supplied via the `DATASYNTH_CORPUS_DIR` environment
+variable — never hard-coded, never logged. Matches the existing
+`scripts/regenerate-industry-priors.sh` convention.
+
+## Setup (on the A100 box, when free)
+
+```bash
+cd experiments/ml
+python -m venv .venv && source .venv/bin/activate
+pip install -r requirements.txt
+export DATASYNTH_CORPUS_DIR=/path/to/private/corpus   # never committed
+
+# 1. export training tensors from the corpus (CPU, ~minutes)
+python -m common.data_export --track gnn --out data/gnn
+
+# 2. train (A100)
+python -m gnn.train --data data/gnn --out weights/gnn
+
+# 3. score the lift against the BF eval baseline
+python -m common.bf_bridge --candidate weights/gnn/samples.parquet
+```
+
+## Handoff to the Rust generator
+
+PyTorch-first: prove the metric lift Python-side, then decide per-track
+whether to (a) port the learned sampler to `candle` for the shipped generator,
+or (b) keep a Python sidecar that emits structure artifacts the Rust generator
+consumes at build time. Recorded per track in its `SPEC.md` § Handoff.
+
+## Status
+
+Scaffold only — no model trained yet. Built while the A100 was occupied with
+another job. Each `train.py` is runnable but the model bodies carry `TODO`
+markers where corpus-schema-specific wiring lands after the first data export.
diff --git a/experiments/ml/common/__init__.py b/experiments/ml/common/__init__.py
new file mode 100644
index 00000000..368e9a73
--- /dev/null
+++ b/experiments/ml/common/__init__.py
@@ -0,0 +1 @@
+"""Shared utilities for the DataSynth ML experiment tracks."""
diff --git a/experiments/ml/common/bf_bridge.py b/experiments/ml/common/bf_bridge.py
new file mode 100644
index 00000000..12d84cc5
--- /dev/null
+++ b/experiments/ml/common/bf_bridge.py
@@ -0,0 +1,139 @@
+"""Bridge to the behavioral-fidelity (BF) eval.
+
+Two scoring paths:
+
+1. `score_canonical(candidate_dir)` — shells out to the Rust
+   `datasynth-data behavioral score`, the single source of truth for the
+   composite. Use for final lift numbers.
+
+2. `score_fast(corpus_df, candidate_df)` — a lightweight Python
+   re-implementation of the headline degradation ratios (IETD, JELineBurst,
+   MeanGap, clustering) for *in-loop* use where shelling out per iteration is
+   too slow (the surrogate track trains on these). It is an APPROXIMATION —
+   always confirm a candidate with `score_canonical` before believing a win.
+
+Degradation ratio (DR), per the eval: DR = d(synth, corpus) / noise_floor,
+where the noise floor is the corpus-vs-corpus distance under resampling.
+DR = 1.0 means "indistinguishable from a fresh corpus draw"; higher = worse.
+The composite is the (volume-corrected) mean / median of per-metric DRs.
+"""
+
+from __future__ import annotations
+
+import json
+import shutil
+import subprocess
+from pathlib import Path
+
+import numpy as np
+
+
+# --------------------------------------------------------------------------
+# 1. Canonical scorer — defer to the Rust eval.
+# --------------------------------------------------------------------------
+def _find_cli() -> str:
+    for cand in ("./target/release/datasynth-data", "datasynth-data"):
+        if shutil.which(cand) or Path(cand).exists():
+            return cand
+    raise FileNotFoundError(
+        "datasynth-data binary not found — build with "
+        "`cargo build --release -p datasynth-cli` or add it to PATH"
+    )
+
+
+def score_canonical(candidate_dir: Path, profile: str = "gl-source-tp") -> dict:
+    """Run the Rust BF eval on a generated candidate archive.
+
+    Returns the parsed composite report. The exact subcommand/flags can drift;
+    confirm with `datasynth-data behavioral --help`. We capture JSON output.
+    """
+    cli = _find_cli()
+    cmd = [
+        cli, "behavioral", "score",
+        "--synthetic", str(candidate_dir),
+        "--profile", profile,
+        "--format", "json",
+    ]
+    print(f"[bf_bridge] $ {' '.join(cmd)}")
+    out = subprocess.run(cmd, capture_output=True, text=True)
+    if out.returncode != 0:
+        raise RuntimeError(f"behavioral score failed:\n{out.stderr}")
+    return json.loads(out.stdout)
+
+
+# --------------------------------------------------------------------------
+# 2. Fast in-loop approximation (Python). APPROXIMATE — see module docstring.
+# --------------------------------------------------------------------------
+def _hist_distance(a: np.ndarray, b: np.ndarray, bins: int = 64) -> float:
+    """Symmetric histogram (Jensen-Shannon) distance on a shared support."""
+    lo = float(min(a.min(), b.min()))
+    hi = float(max(a.max(), b.max()))
+    if hi <= lo:
+        return 0.0
+    edges = np.linspace(lo, hi, bins + 1)
+    pa, _ = np.histogram(a, edges, density=True)
+    pb, _ = np.histogram(b, edges, density=True)
+    pa = pa / (pa.sum() + 1e-12)
+    pb = pb / (pb.sum() + 1e-12)
+    m = 0.5 * (pa + pb)
+
+    def _kl(p, q):
+        mask = p > 0
+        return float(np.sum(p[mask] * np.log(p[mask] / (q[mask] + 1e-12))))
+
+    return 0.5 * _kl(pa, m) + 0.5 * _kl(pb, m)
+
+
+def _iet_days(df) -> np.ndarray:
+    """Inter-event times (days) within (source, entity) streams."""
+    g = df.sort_values("entry_date").groupby(["source", "trading_partner"])
+    out = []
+    for _, sub in g:
+        dates = sub["entry_date"].values.astype("datetime64[D]")
+        if len(dates) >= 2:
+            out.append(np.diff(dates).astype("timedelta64[D]").astype(float))
+    return np.concatenate(out) if out else np.array([0.0])
+
+
+def _lines_per_je(df) -> np.ndarray:
+    return df.groupby("je_number").size().to_numpy(dtype=float)
+
+
+def score_fast(corpus_df, candidate_df) -> dict:
+    """Approximate per-metric DRs from two pandas frames.
+
+    Noise floor here is a crude constant per metric; the canonical eval
+    computes it by corpus resampling. Good enough to give CMA-ES / the
+    surrogate a smooth, correctly-ordered signal between full evals.
+    """
+    # IETD (P1) and JELineBurst (P2) via JS distance on the relevant dists.
+    ietd = _hist_distance(_iet_days(corpus_df), _iet_days(candidate_df))
+    burst = _hist_distance(_lines_per_je(corpus_df), _lines_per_je(candidate_df))
+    # MeanGap (P4): |Δ mean inter-event time|, normalized.
+    mean_gap = abs(_iet_days(corpus_df).mean() - _iet_days(candidate_df).mean())
+
+    # TODO(P3 clustering): port the TP co-occurrence triangle/clustering
+    # distance once the GNN export lands (shares the edge builder).
+    noise = {"IETD": 0.02, "JELineBurst": 0.05, "MeanGap": 1.0}
+    return {
+        "IETD": ietd / noise["IETD"],
+        "JELineBurst": burst / noise["JELineBurst"],
+        "MeanGap": mean_gap / noise["MeanGap"],
+    }
+
+
+def composite(drs: dict) -> dict:
+    vals = np.array(list(drs.values()), dtype=float)
+    return {"mean": float(vals.mean()), "median": float(np.median(vals))}
+
+
+if __name__ == "__main__":
+    import argparse
+
+    ap = argparse.ArgumentParser(description=__doc__)
+    ap.add_argument("--candidate", type=Path, required=True,
+                    help="generated candidate archive dir for the canonical eval")
+    ap.add_argument("--profile", default="gl-source-tp")
+    args = ap.parse_args()
+    report = score_canonical(args.candidate, args.profile)
+    print(json.dumps(report, indent=2))
diff --git a/experiments/ml/common/data_export.py b/experiments/ml/common/data_export.py
new file mode 100644
index 00000000..df6a9645
--- /dev/null
+++ b/experiments/ml/common/data_export.py
@@ -0,0 +1,147 @@
+"""Export per-track training tensors from the corpus.
+
+Reads corpus parquet from `$DATASYNTH_CORPUS_DIR` (never hard-coded), applies
+the `ColumnMap`, and writes track-specific artifacts under `--out`. All
+outputs are gitignored.
+
+Usage:
+    export DATASYNTH_CORPUS_DIR=/path/to/private/corpus
+    python -m common.data_export --track gnn      --out data/gnn
+    python -m common.data_export --track sequence --out data/sequence
+    python -m common.data_export --track flow     --out data/flow
+
+Design notes
+------------
+* CPU-only, streaming where possible — safe to run on a laptop; does NOT
+  invoke the orchestrator (which OOMs small boxes).
+* Emits ONLY aggregated / structural tensors, never row-level corpus text.
+  The GNN track in particular emits an *anonymized* edge index (integer node
+  ids), so committed-by-accident artifacts would still carry no names — but
+  they are gitignored regardless.
+"""
+
+from __future__ import annotations
+
+import argparse
+import os
+import sys
+from pathlib import Path
+
+from .schema import ColumnMap
+
+
+def _corpus_dir() -> Path:
+    d = os.environ.get("DATASYNTH_CORPUS_DIR")
+    if not d:
+        sys.exit(
+            "ERROR: set DATASYNTH_CORPUS_DIR to the private corpus directory "
+            "(never hard-code it)."
+        )
+    p = Path(d)
+    if not p.is_dir():
+        sys.exit(f"ERROR: DATASYNTH_CORPUS_DIR={d} is not a directory")
+    return p
+
+
+def _load_je_frame(corpus: Path, cols: ColumnMap):
+    """Load the JE-line table as a pandas DataFrame with canonical columns.
+
+    The corpus ships one parquet per client; we concatenate. TODO after the
+    first run: confirm the per-client file glob + any client-id column to
+    keep entity namespaces disjoint across clients (see SP3.11 namespace
+    canonicalisation).
+    """
+    import pandas as pd
+    import pyarrow.parquet as pq
+
+    files = sorted(corpus.glob("JE_*.parquet"))
+    if not files:
+        sys.exit(f"ERROR: no JE_*.parquet under {corpus}")
+    frames = []
+    rename = {getattr(cols, f): f for f in cols.__dataclass_fields__}  # noqa: SLF001
+    for fp in files:
+        tbl = pq.read_table(fp)
+        present = [c for c in tbl.column_names if c in rename]
+        df = tbl.select(present).to_pandas().rename(columns=rename)
+        df["__client__"] = fp.stem  # keep namespaces disjoint
+        frames.append(df)
+    return pd.concat(frames, ignore_index=True)
+
+
+# --------------------------------------------------------------------------
+# Track exporters. Each writes a small set of .pt / .parquet artifacts.
+# --------------------------------------------------------------------------
+def export_gnn(corpus: Path, cols: ColumnMap, out: Path) -> None:
+    """Edge list + node features for the relational graph(s).
+
+    Builds three graphs the symbolic motif samplers approximate today:
+      * TP co-occurrence (trading partners sharing a JE / source)
+      * vendor / counterparty network (from gl_account ↔ trading_partner)
+      * IC bilateral edges (cross-client matched flows)
+
+    Emits anonymized integer node ids + a node-degree / source-mix feature
+    matrix. See gnn/SPEC.md § Data.
+    """
+    raise NotImplementedError(
+        "TODO(gnn): build edge_index + node features. Pseudocode in "
+        "gnn/SPEC.md § Data — node = (client, trading_partner); edge weight = "
+        "co-occurrence count; node feat = [degree, source-mix histogram, "
+        "active-window length]. Write out/edge_index.pt, out/node_feat.pt, "
+        "out/node_ids.parquet (anonymized)."
+    )
+
+
+def export_sequence(corpus: Path, cols: ColumnMap, out: Path) -> None:
+    """Per-(source, entity) ordered event streams → token tensors.
+
+    Token = (Δt-bucket, line-count-bucket, account-class, weekday, hour-band).
+    See sequence/SPEC.md § Data.
+    """
+    raise NotImplementedError(
+        "TODO(sequence): group by (client, source, trading_partner), sort by "
+        "entry_date, derive inter-event Δt + per-JE line count, bucketize, and "
+        "write out/streams.pt (padded) + out/vocab.json."
+    )
+
+
+def export_flow(corpus: Path, cols: ColumnMap, out: Path) -> None:
+    """Per-(source, account-class) amount samples + conditioning features.
+
+    See flow/SPEC.md § Data.
+    """
+    raise NotImplementedError(
+        "TODO(flow): collect log|amount| per (source, account_class), plus "
+        "conditioning one-hots; write out/amounts.parquet."
+    )
+
+
+EXPORTERS = {
+    "gnn": export_gnn,
+    "sequence": export_sequence,
+    "flow": export_flow,
+}
+
+
+def main(argv: list[str] | None = None) -> None:
+    ap = argparse.ArgumentParser(description=__doc__)
+    ap.add_argument("--track", required=True, choices=sorted(EXPORTERS))
+    ap.add_argument("--out", required=True, type=Path)
+    ap.add_argument(
+        "--column-map",
+        type=Path,
+        default=None,
+        help="local (gitignored) YAML mapping canonical->corpus column names",
+    )
+    args = ap.parse_args(argv)
+
+    cols = ColumnMap.from_yaml(str(args.column_map)) if args.column_map else ColumnMap()
+    corpus = _corpus_dir()
+    args.out.mkdir(parents=True, exist_ok=True)
+
+    print(f"[data_export] track={args.track} corpus={corpus} -> {args.out}")
+    EXPORTERS[args.track](corpus, cols, args.out)
+    print("[data_export] done")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/experiments/ml/common/schema.py b/experiments/ml/common/schema.py
new file mode 100644
index 00000000..2288fe2f
--- /dev/null
+++ b/experiments/ml/common/schema.py
@@ -0,0 +1,65 @@
+"""Canonical JE-line schema shared by every track.
+
+These are DataSynth's *own* internal model field names (see the `Record`
+struct consumed by the behavioral-fidelity eval and CLAUDE.md), NOT
+corpus-verbatim column names. The corpus parquet may use different column
+names; map them in a local (gitignored) `config.local.yaml` rather than
+editing this file, so no corpus-specific naming lands in git.
+"""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+
+
+@dataclass(frozen=True)
+class ColumnMap:
+    """Maps canonical field -> corpus column name.
+
+    Defaults are the canonical names; override per-corpus via
+    `ColumnMap.from_yaml("config.local.yaml")` (gitignored).
+    """
+
+    source: str = "source"
+    gl_account: str = "gl_account"
+    cost_center: str = "cost_center"
+    profit_center: str = "profit_center"
+    trading_partner: str = "trading_partner"
+    je_number: str = "je_number"
+    je_line_number: str = "je_line_number"
+    effective_date: str = "effective_date"
+    entry_date: str = "entry_date"
+    created_at: str = "created_at"
+    amount: str = "functional_amount"
+    # ISO 21378 account-class label (joined from CoA), used as the
+    # coherence key by both the symbolic generator and these models.
+    account_class: str = "account_class"
+
+    @classmethod
+    def from_yaml(cls, path: str) -> "ColumnMap":
+        import yaml  # local import keeps the module import-light
+
+        with open(path) as fh:
+            raw = yaml.safe_load(fh) or {}
+        known = {f for f in cls.__dataclass_fields__}  # noqa: SLF001
+        return cls(**{k: v for k, v in raw.items() if k in known})
+
+    def required(self) -> list[str]:
+        return [
+            self.source,
+            self.gl_account,
+            self.je_number,
+            self.je_line_number,
+            self.entry_date,
+            self.amount,
+        ]
+
+
+# Behavioral-fidelity metric families the experiments target. Kept here so
+# every track + the surrogate agree on metric identifiers.
+BF_METRICS: dict[str, list[str]] = {
+    "P1": ["IETD", "Autocorr"],
+    "P2": ["JELineBurst"],
+    "P3": ["ClusteringGap", "TriangleLogRatio"],
+    "P4": ["MeanGap"],
+}
diff --git a/experiments/ml/flow/SPEC.md b/experiments/ml/flow/SPEC.md
new file mode 100644
index 00000000..0d2129a3
--- /dev/null
+++ b/experiments/ml/flow/SPEC.md
@@ -0,0 +1,57 @@
+# Track 3 — Conditional normalizing flow for amount marginals
+
+## Objective
+
+Replace the per-(source, account-class) log-normal *mixture* with a learned
+**conditional normalizing flow** that captures the exact multimodal amount
+density — heavy tails, round-number spikes, threshold clustering — while
+staying invertible (exact log-density, exact sampling).
+
+## Why a flow (vs log-normal mixture)
+
+The mixture has a fixed number of log-normal components; real amount
+distributions have sharp round-number atoms ($1k/$5k/$10k), regulatory
+thresholds, and fat tails that a 3-component mixture smooths over. A flow
+learns the density nonparametrically and still gives the analytic likelihood
+the eval / Benford checks want.
+
+## Data (`common.data_export --track flow`)
+
+Per JE line: `y = signed log1p(|amount|)` (sign kept as a separate Bernoulli
+conditioned on account-class), conditioning `c = one_hot(source) ⊕
+one_hot(account_class) ⊕ [is_period_end, is_fraud]`. Artifact (gitignored):
+`amounts.parquet` (y, c) — aggregated numeric, no text.
+
+## Architecture
+
+`zuko` neural spline flow (NSF), 4 transforms, conditioned on `c`:
+
+```
+base N(0,1) ──(c-conditioned spline coupling × 4)──▶ y
+```
+
+Round-number atoms are handled with a **dequantization + atom mixture**: a
+small classifier picks "round atom k vs continuous"; the flow models the
+continuous part. Keeps the spikes crisp instead of smearing them.
+
+## Sampling → handoff
+
+Sample `y | c`, invert the log1p + sign → amount. This is the cleanest port
+target: the flow is small and the inverse is closed-form per transform, so
+**porting to candle is feasible** — or export the spline knots as a lookup the
+Rust `AmountSampler` interpolates. Decide after measuring the Benford / tail
+lift. The symbolic balance step still rescales the final line set to enforce
+debits = credits (the flow sets the *distributional shape*, balance sets the
+*exact values*) — so coherence is untouched.
+
+## Success criteria
+
+* Benford MAD and amount-distribution-fit DR improve vs the mixture baseline;
+  round-number occurrence within the eval's tolerance band.
+* Tail quantiles (p99, p99.9) match the corpus within the noise floor.
+* Balance / Benford-compliance checks still pass after the symbolic rescale.
+
+## Privacy
+
+Low risk (aggregated numeric density). Guard against the flow memorizing rare
+exact large amounts: clip the training tail at a high quantile and note it.
diff --git a/experiments/ml/flow/__init__.py b/experiments/ml/flow/__init__.py
new file mode 100644
index 00000000..8887be48
--- /dev/null
+++ b/experiments/ml/flow/__init__.py
@@ -0,0 +1 @@
+"""Track 3 — conditional normalizing flow for amount marginals (zuko NSF)."""
diff --git a/experiments/ml/flow/model.py b/experiments/ml/flow/model.py
new file mode 100644
index 00000000..ff636146
--- /dev/null
+++ b/experiments/ml/flow/model.py
@@ -0,0 +1,52 @@
+"""Conditional neural-spline flow for amounts (Track 3).
+
+Thin wrapper over zuko's NSF with the round-number atom mixture described in
+SPEC.md. The continuous flow is fully runnable; the atom classifier carries a
+TODO until the corpus round-number support is exported.
+"""
+
+from __future__ import annotations
+
+import torch
+import torch.nn as nn
+
+try:
+    import zuko
+except ImportError as exc:  # pragma: no cover
+    raise ImportError("zuko required: pip install -r ../requirements.txt") from exc
+
+# Canonical round-number atoms (currency-agnostic magnitudes the symbolic
+# fraud-bias layer also uses). Continuous flow models everything else.
+ROUND_ATOMS = [1_000.0, 5_000.0, 10_000.0, 25_000.0, 50_000.0, 100_000.0]
+
+
+class ConditionalAmountFlow(nn.Module):
+    def __init__(self, cond_dim: int, transforms: int = 4, hidden=(128, 128)):
+        super().__init__()
+        # 1-D target (signed log1p amount), conditioned on c.
+        self.flow = zuko.flows.NSF(
+            features=1, context=cond_dim, transforms=transforms, hidden_features=hidden
+        )
+        # P(round atom k | c) vs continuous; index 0 = "continuous".
+        self.atom_head = nn.Sequential(
+            nn.Linear(cond_dim, 64), nn.ReLU(), nn.Linear(64, len(ROUND_ATOMS) + 1)
+        )
+
+    def log_prob(self, y: torch.Tensor, c: torch.Tensor) -> torch.Tensor:
+        """Continuous-part log-density. Atom mixture handled in train/sample.
+
+        TODO(flow): combine with the atom classifier into a proper mixture
+        log-likelihood once round-atom membership labels are exported.
+        """
+        return self.flow(c).log_prob(y)
+
+    def sample(self, c: torch.Tensor) -> torch.Tensor:
+        atom_logits = self.atom_head(c)
+        atom = torch.distributions.Categorical(logits=atom_logits).sample()
+        cont = self.flow(c).sample()  # (B, 1) in signed-log1p space
+        out = cont.squeeze(-1).clone()
+        for k, val in enumerate(ROUND_ATOMS, start=1):
+            mask = atom == k
+            # store atoms in the same signed-log1p space for a uniform inverse
+            out[mask] = torch.log1p(torch.as_tensor(val, device=out.device))
+        return out
diff --git a/experiments/ml/flow/train.py b/experiments/ml/flow/train.py
new file mode 100644
index 00000000..b24d919d
--- /dev/null
+++ b/experiments/ml/flow/train.py
@@ -0,0 +1,62 @@
+"""Train the conditional amount flow (Track 3).
+
+    python -m flow.train --data data/flow --out weights/flow --epochs 50
+"""
+
+from __future__ import annotations
+
+import argparse
+from pathlib import Path
+
+import torch
+from torch.utils.data import DataLoader, TensorDataset
+
+from .model import ConditionalAmountFlow
+
+
+def main(argv: list[str] | None = None) -> None:
+    ap = argparse.ArgumentParser(description=__doc__)
+    ap.add_argument("--data", type=Path, required=True)
+    ap.add_argument("--out", type=Path, required=True)
+    ap.add_argument("--epochs", type=int, default=50)
+    ap.add_argument("--batch-size", type=int, default=4096)
+    ap.add_argument("--lr", type=float, default=1e-3)
+    ap.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu")
+    args = ap.parse_args(argv)
+    args.out.mkdir(parents=True, exist_ok=True)
+    dev = torch.device(args.device)
+
+    import pandas as pd
+
+    df = pd.read_parquet(args.data / "amounts.parquet")
+    y = torch.tensor(df["y"].to_numpy(), dtype=torch.float32).unsqueeze(-1)
+    c = torch.tensor(
+        df.drop(columns=["y"]).to_numpy(), dtype=torch.float32
+    )  # conditioning one-hots
+    ds = TensorDataset(y, c)
+    dl = DataLoader(ds, batch_size=args.batch_size, shuffle=True)
+
+    model = ConditionalAmountFlow(cond_dim=c.size(1)).to(dev)
+    opt = torch.optim.Adam(model.parameters(), lr=args.lr)
+
+    model.train()
+    for epoch in range(1, args.epochs + 1):
+        running = 0.0
+        for yb, cb in dl:
+            yb, cb = yb.to(dev), cb.to(dev)
+            loss = -model.log_prob(yb, cb).mean()
+            opt.zero_grad()
+            loss.backward()
+            opt.step()
+            running += loss.item()
+        print(f"epoch {epoch:3d}  nll={running/len(dl):.4f}")
+
+    torch.save({"model": model.state_dict(), "cond_dim": c.size(1)},
+               args.out / "amount_flow.pt")
+    print(f"[flow.train] saved {args.out/'amount_flow.pt'}")
+    print("TODO(flow): export spline knots for the candle AmountSampler port, "
+          "and validate Benford MAD via common.bf_bridge.")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/experiments/ml/gnn/SPEC.md b/experiments/ml/gnn/SPEC.md
new file mode 100644
index 00000000..1a8e74b9
--- /dev/null
+++ b/experiments/ml/gnn/SPEC.md
@@ -0,0 +1,80 @@
+# Track 1 — GNN relational sampler
+
+## Objective
+
+Learn the **interconnectivity structure** of the corpus's entity graphs
+(trading-partner co-occurrence, vendor/counterparty network, IC bilateral
+edges) and sample new graphs with the same motif statistics — closing the
+**P3 ClusteringGap** and **TriangleLogRatio** gaps the hand-tuned
+`CrossEntityMotifSampler` / TP motif sampler only partially close.
+
+The symbolic generator keeps ownership of *what posts on each edge* (JEs,
+amounts, balance). This model only decides *which entities connect and how
+densely* — the relational scaffold.
+
+## Why a GNN (vs the current motif sampler)
+
+The current samplers bias draws toward recent cluster-mates — a local
+heuristic. They can't represent global structure (community sizes, degree
+distribution tails, triangle density) jointly. A graph autoencoder learns a
+latent node embedding whose inner-product geometry reproduces the corpus's
+joint edge structure, so sampled graphs match clustering + triangle counts by
+construction rather than by tuning.
+
+## Data (`common.data_export --track gnn`)
+
+Per client (namespaces kept disjoint — see SP3.11):
+
+* **Nodes** = entities `(client, trading_partner)`; anonymized integer ids.
+* **Edges** = co-occurrence: two TPs sharing a JE or a (source, period) bucket;
+  weight = count.
+* **Node features** `x`: `[log-degree, source-mix histogram (k sources),
+  active-window length (days), mean lines-per-JE]`. All aggregated — no names,
+  no row-level text.
+
+Artifacts (gitignored): `edge_index.pt`, `edge_weight.pt`, `node_feat.pt`,
+`node_ids.parquet` (anonymized id ↔ opaque hash, stays private).
+
+## Architecture
+
+GAE-style:
+
+```
+x, edge_index ─▶ GraphSAGE(2 layers, hidden=128) ─▶ z  (node embeddings, d=64)
+sample edges:   p(i~j) = σ(zᵢ · zⱼ)             (inner-product decoder)
+```
+
+Loss: negative-sampling reconstruction (BCE on observed vs sampled non-edges)
++ a **degree-distribution KL** regularizer and a **triangle-count** penalty so
+the embedding geometry matches the corpus's P3 statistics, not just edges.
+
+Sampling: draw a degree sequence from the fitted tail, then realize edges by
+thresholding `σ(zᵢ·zⱼ)` with calibrated sparsity → new anonymized graph.
+
+## Success criteria
+
+* P3 ClusteringGap DR and TriangleLogRatio DR (Source + TP) **down ≥ 40%** vs
+  the v5.26 baseline, measured by `common.bf_bridge.score_canonical` on a full
+  generate run that consumes the sampled graph.
+* No regression > 10% on P1/P2/P4 (relational change shouldn't perturb timing).
+* Coherence unaffected — IC matching coverage + balance checks still pass
+  (they run downstream of edge selection).
+
+## Handoff to the Rust generator
+
+The model emits a **graph artifact** (anonymized edge list + per-node
+source-mix), not weights the generator must run. The Rust generator gains a
+"load relational scaffold" path that consumes this artifact at build time and
+routes entity selection through it. → keep the NN Python-side; ship the
+sampled scaffold. (Re-evaluate if we want online sampling later.)
+
+## Privacy
+
+Highest memorization risk of the four tracks — the embedding can encode real
+counterparty adjacency. Before sharing any weights or artifact off the private
+box:
+* node ids are opaque hashes (done at export);
+* apply **k-anonymity** on the degree/feature join (drop nodes with degree <
+  k) and/or **DP-SGD** (ε budget recorded in the run config);
+* memorization probe: nearest-neighbour attack on embeddings must not recover
+  held-out edges above chance. Gate in `train.py --privacy-check`.
diff --git a/experiments/ml/gnn/__init__.py b/experiments/ml/gnn/__init__.py
new file mode 100644
index 00000000..e4687513
--- /dev/null
+++ b/experiments/ml/gnn/__init__.py
@@ -0,0 +1 @@
+"""Track 1 — GNN relational sampler (GAE over entity co-occurrence graphs)."""
diff --git a/experiments/ml/gnn/model.py b/experiments/ml/gnn/model.py
new file mode 100644
index 00000000..e4408d8a
--- /dev/null
+++ b/experiments/ml/gnn/model.py
@@ -0,0 +1,83 @@
+"""Graph autoencoder for the relational sampler (Track 1).
+
+GraphSAGE encoder + inner-product decoder (Kipf & Welling GAE), with hooks for
+the degree-KL and triangle penalties described in SPEC.md. Runnable as-is on
+torch-geometric; the structural regularizers carry TODO markers where they
+need the exported corpus statistics.
+"""
+
+from __future__ import annotations
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+try:
+    from torch_geometric.nn import SAGEConv
+except ImportError as exc:  # pragma: no cover - import-time guard
+    raise ImportError(
+        "torch-geometric required: pip install -r ../requirements.txt"
+    ) from exc
+
+
+class GraphSAGEEncoder(nn.Module):
+    def __init__(self, in_dim: int, hidden: int = 128, latent: int = 64):
+        super().__init__()
+        self.conv1 = SAGEConv(in_dim, hidden)
+        self.conv2 = SAGEConv(hidden, latent)
+        self.dropout = nn.Dropout(0.1)
+
+    def forward(self, x: torch.Tensor, edge_index: torch.Tensor) -> torch.Tensor:
+        h = F.relu(self.conv1(x, edge_index))
+        h = self.dropout(h)
+        return self.conv2(h, edge_index)  # node embeddings z
+
+
+class InnerProductDecoder(nn.Module):
+    """p(edge i~j) = sigmoid(z_i . z_j)."""
+
+    def forward(self, z: torch.Tensor, edge_index: torch.Tensor) -> torch.Tensor:
+        src, dst = edge_index
+        logits = (z[src] * z[dst]).sum(dim=-1)
+        return logits  # caller applies BCEWithLogits / sigmoid
+
+
+class GraphAutoencoder(nn.Module):
+    def __init__(self, in_dim: int, hidden: int = 128, latent: int = 64):
+        super().__init__()
+        self.encoder = GraphSAGEEncoder(in_dim, hidden, latent)
+        self.decoder = InnerProductDecoder()
+
+    def encode(self, x, edge_index) -> torch.Tensor:
+        return self.encoder(x, edge_index)
+
+    def recon_loss(
+        self,
+        z: torch.Tensor,
+        pos_edge_index: torch.Tensor,
+        neg_edge_index: torch.Tensor,
+    ) -> torch.Tensor:
+        pos = self.decoder(z, pos_edge_index)
+        neg = self.decoder(z, neg_edge_index)
+        logits = torch.cat([pos, neg])
+        target = torch.cat([torch.ones_like(pos), torch.zeros_like(neg)])
+        return F.binary_cross_entropy_with_logits(logits, target)
+
+    # --- structural regularizers (SPEC.md § Architecture) -----------------
+    @staticmethod
+    def degree_kl(z: torch.Tensor, target_degree_hist: torch.Tensor) -> torch.Tensor:
+        """KL between sampled expected-degree dist and the corpus target.
+
+        TODO(gnn): expected degree of node i ≈ Σ_j σ(z_i·z_j); bucketize and
+        KL against `target_degree_hist` exported from the corpus.
+        """
+        raise NotImplementedError("degree_kl: see SPEC.md § Architecture")
+
+    @staticmethod
+    def triangle_penalty(z: torch.Tensor) -> torch.Tensor:
+        """Penalize deviation of expected triangle count from corpus.
+
+        TODO(gnn): E[triangles] from the soft adjacency σ(ZZ^T); compare to the
+        corpus TriangleLogRatio target. Keep it batched/sparse for the A100.
+        """
+        raise NotImplementedError("triangle_penalty: see SPEC.md § Architecture")
diff --git a/experiments/ml/gnn/sample.py b/experiments/ml/gnn/sample.py
new file mode 100644
index 00000000..7dbbfea9
--- /dev/null
+++ b/experiments/ml/gnn/sample.py
@@ -0,0 +1,38 @@
+"""Sample a new relational scaffold from the trained GAE (Track 1).
+
+    python -m gnn.sample --weights weights/gnn/gae.pt --out weights/gnn/scaffold.parquet
+
+Emits the artifact the Rust generator consumes: an anonymized edge list +
+per-node source-mix. No corpus content — only the learned structure.
+"""
+
+from __future__ import annotations
+
+import argparse
+from pathlib import Path
+
+import torch
+
+
+def main(argv: list[str] | None = None) -> None:
+    ap = argparse.ArgumentParser(description=__doc__)
+    ap.add_argument("--weights", type=Path, required=True)
+    ap.add_argument("--data", type=Path, required=True,
+                    help="dir with node_feat.pt / edge_index.pt used at train time")
+    ap.add_argument("--out", type=Path, required=True)
+    ap.add_argument("--target-sparsity", type=float, default=None,
+                    help="calibrated edge density; default = corpus density")
+    ap.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu")
+    args = ap.parse_args(argv)
+
+    raise NotImplementedError(
+        "TODO(gnn): load GAE, encode nodes, draw a degree sequence from the "
+        "fitted tail, realize edges by thresholding sigmoid(z_i·z_j) to hit "
+        "--target-sparsity, write out/scaffold.parquet (anonymized edge list + "
+        "per-node source-mix). Then validate clustering/triangle stats with "
+        "common.bf_bridge before handing to the generator."
+    )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/experiments/ml/gnn/train.py b/experiments/ml/gnn/train.py
new file mode 100644
index 00000000..06703586
--- /dev/null
+++ b/experiments/ml/gnn/train.py
@@ -0,0 +1,82 @@
+"""Train the relational GAE (Track 1).
+
+    python -m gnn.train --data data/gnn --out weights/gnn --epochs 200
+
+Runnable skeleton: loads the exported tensors, trains reconstruction, and
+checkpoints. The degree-KL / triangle regularizers and the --privacy-check
+gate are wired but raise until the corpus targets are exported (see SPEC.md).
+"""
+
+from __future__ import annotations
+
+import argparse
+from pathlib import Path
+
+import torch
+
+from .model import GraphAutoencoder
+
+
+def negative_sample(num_nodes: int, num_neg: int, device) -> torch.Tensor:
+    return torch.randint(0, num_nodes, (2, num_neg), device=device)
+
+
+def main(argv: list[str] | None = None) -> None:
+    ap = argparse.ArgumentParser(description=__doc__)
+    ap.add_argument("--data", type=Path, required=True)
+    ap.add_argument("--out", type=Path, required=True)
+    ap.add_argument("--epochs", type=int, default=200)
+    ap.add_argument("--lr", type=float, default=1e-3)
+    ap.add_argument("--latent", type=int, default=64)
+    ap.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu")
+    ap.add_argument("--lambda-degree", type=float, default=0.1)
+    ap.add_argument("--lambda-triangle", type=float, default=0.1)
+    ap.add_argument("--privacy-check", action="store_true",
+                    help="run the embedding nearest-neighbour memorization probe")
+    ap.add_argument("--dp-sgd", action="store_true",
+                    help="enable DP-SGD (records epsilon in run config)")
+    args = ap.parse_args(argv)
+
+    args.out.mkdir(parents=True, exist_ok=True)
+    dev = torch.device(args.device)
+
+    edge_index = torch.load(args.data / "edge_index.pt").to(dev)
+    x = torch.load(args.data / "node_feat.pt").to(dev)
+    num_nodes = x.size(0)
+
+    model = GraphAutoencoder(in_dim=x.size(1), latent=args.latent).to(dev)
+    opt = torch.optim.Adam(model.parameters(), lr=args.lr)
+
+    if args.dp_sgd:
+        raise NotImplementedError(
+            "TODO(gnn): wrap opt with opacus PrivacyEngine; persist (eps, delta) "
+            "to out/run.json before any weight leaves the box."
+        )
+
+    model.train()
+    for epoch in range(1, args.epochs + 1):
+        opt.zero_grad()
+        z = model.encode(x, edge_index)
+        neg = negative_sample(num_nodes, edge_index.size(1), dev)
+        loss = model.recon_loss(z, edge_index, neg)
+        # Structural regularizers (raise until corpus targets exported):
+        #   loss += args.lambda_degree   * model.degree_kl(z, target_hist)
+        #   loss += args.lambda_triangle * model.triangle_penalty(z)
+        loss.backward()
+        opt.step()
+        if epoch % 20 == 0:
+            print(f"epoch {epoch:4d}  recon_loss={loss.item():.4f}")
+
+    torch.save({"model": model.state_dict(), "args": vars(args)},
+               args.out / "gae.pt")
+    print(f"[gnn.train] saved {args.out/'gae.pt'}")
+
+    if args.privacy_check:
+        raise NotImplementedError(
+            "TODO(gnn): nearest-neighbour membership probe — held-out edges must "
+            "not be recoverable from embeddings above chance (SPEC.md § Privacy)."
+        )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/experiments/ml/requirements.txt b/experiments/ml/requirements.txt
new file mode 100644
index 00000000..d0088102
--- /dev/null
+++ b/experiments/ml/requirements.txt
@@ -0,0 +1,33 @@
+# DataSynth ML experiments — pin major versions; the A100 box installs the
+# CUDA build of torch separately (see README).
+#
+#   pip install torch --index-url https://download.pytorch.org/whl/cu124
+# then:
+#   pip install -r requirements.txt
+
+# Core
+torch>=2.4
+numpy>=1.26
+pandas>=2.2
+pyarrow>=16          # read corpus parquet directly
+polars>=1.0          # faster columnar pre-aggregation (optional path)
+
+# GNN track
+torch-geometric>=2.6
+
+# Flow track — zuko is a small, composable normalizing-flow lib on torch
+zuko>=1.1
+
+# Sequence track uses plain torch.nn.Transformer (no extra dep);
+# tokenizer/util only:
+einops>=0.8
+
+# Surrogate / tuning track
+cma>=3.4             # CMA-ES gradient-free optimizer
+scikit-learn>=1.5    # quick baselines + train/val splits
+
+# Eval bridge + plotting
+scipy>=1.13
+matplotlib>=3.9
+tqdm>=4.66
+pyyaml>=6.0
diff --git a/experiments/ml/sequence/SPEC.md b/experiments/ml/sequence/SPEC.md
new file mode 100644
index 00000000..a14a5a6f
--- /dev/null
+++ b/experiments/ml/sequence/SPEC.md
@@ -0,0 +1,62 @@
+# Track 2 — Autoregressive temporal stream model
+
+## Objective
+
+Model each entity's JE stream as a sequence of discrete event tokens and learn
+the temporal dynamics — closing **P1 IETD + Autocorr**, **P2 JELineBurst**, and
+**P4 MeanGap**. These are the gaps the per-source IET sampler + lines-per-JE
+prior approximate marginally but miss in their *joint, autocorrelated* form
+(the W2/W8 autocorr regressions in the project history).
+
+## Why autoregressive (vs marginal samplers)
+
+The current samplers draw IET and line-count independently per event. Real GL
+streams are bursty and autocorrelated: a flurry of postings clusters, then
+quiets. A causal transformer conditions each event on the recent history, so
+burst structure and lag-1 autocorrelation emerge instead of being imposed.
+
+## Data (`common.data_export --track sequence`)
+
+Group by `(client, source, trading_partner)`, sort by `entry_date`. Per event,
+emit a token:
+
+```
+token = (Δt-bucket, line-count-bucket, account-class, weekday, hour-band)
+```
+
+Δt bucketized log-spaced (0, 1, 2-3, 4-7, 8-14, 15-30, 30+ days); line-count
+bucketized (1, 2, 3-4, 5-8, 9-16, 17+). Artifacts (gitignored):
+`streams.pt` (padded id sequences), `vocab.json` (bucket edges — structural,
+no corpus content).
+
+## Architecture
+
+Decoder-only transformer (`torch.nn.TransformerEncoder` with a causal mask),
+~4 layers, d_model=256, 4 heads. Factorized head: predict each token field
+with its own softmax (Δt, line-count, account-class, weekday, hour-band) so the
+joint is `p(Δt)·p(lines|Δt)·…`. Conditioning prefix = `(source, entity-type)`
+embedding.
+
+Loss: sum of per-field cross-entropies. Teacher-forced.
+
+## Sampling → handoff
+
+Generate token streams per (source, entity); decode buckets back to concrete
+Δt / line-count *ranges*. The Rust generator draws the concrete value
+uniformly within the predicted bucket and — crucially — still routes amounts +
+balance through the symbolic layer. So the model sets *timing + shape*, the
+engine sets *values*. Keep Python-side; emit a per-entity event schedule
+artifact, OR port the (small) transformer to candle for online use if the
+schedule artifact proves too large.
+
+## Success criteria
+
+* P1 IETD DR and Autocorr DR (Source) **down ≥ 30%** vs v5.26; P2 JELineBurst
+  DR **down ≥ 20%**; P4 MeanGap DR **down ≥ 15%** — via `bf_bridge.score_canonical`.
+* Balance / coherence unaffected (amounts unchanged path).
+
+## Privacy
+
+Lower risk than the GNN (tokens are coarse buckets, no names/text). Still:
+rare (source, account-class) combos can be near-unique — drop buckets with
+support < k before training; note in run config.
diff --git a/experiments/ml/sequence/__init__.py b/experiments/ml/sequence/__init__.py
new file mode 100644
index 00000000..50fcdacb
--- /dev/null
+++ b/experiments/ml/sequence/__init__.py
@@ -0,0 +1 @@
+"""Track 2 — autoregressive temporal stream model (causal transformer)."""
diff --git a/experiments/ml/sequence/model.py b/experiments/ml/sequence/model.py
new file mode 100644
index 00000000..5160b1a6
--- /dev/null
+++ b/experiments/ml/sequence/model.py
@@ -0,0 +1,85 @@
+"""Decoder-only transformer over JE event-token streams (Track 2).
+
+Factorized multi-field head: each event token is the product of independent
+softmaxes over (Δt-bucket, line-count-bucket, account-class, weekday,
+hour-band). Causal mask makes it autoregressive. Runnable on plain torch.
+"""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+
+import torch
+import torch.nn as nn
+
+
+@dataclass
+class FieldVocab:
+    """Vocabulary sizes per token field (filled from vocab.json at load)."""
+
+    dt: int
+    lines: int
+    account_class: int
+    weekday: int = 7
+    hour_band: int = 6
+
+
+class EventStreamTransformer(nn.Module):
+    def __init__(self, vocab: FieldVocab, d_model: int = 256, n_layers: int = 4,
+                 n_heads: int = 4, max_len: int = 512, n_sources: int = 64):
+        super().__init__()
+        self.vocab = vocab
+        # One embedding per field; summed into the token representation.
+        self.emb_dt = nn.Embedding(vocab.dt, d_model)
+        self.emb_lines = nn.Embedding(vocab.lines, d_model)
+        self.emb_class = nn.Embedding(vocab.account_class, d_model)
+        self.emb_weekday = nn.Embedding(vocab.weekday, d_model)
+        self.emb_hour = nn.Embedding(vocab.hour_band, d_model)
+        self.emb_source = nn.Embedding(n_sources, d_model)  # conditioning prefix
+        self.pos = nn.Parameter(torch.zeros(1, max_len, d_model))
+
+        layer = nn.TransformerEncoderLayer(
+            d_model, n_heads, dim_feedforward=4 * d_model, batch_first=True
+        )
+        self.backbone = nn.TransformerEncoder(layer, n_layers)
+
+        # Factorized output heads.
+        self.head_dt = nn.Linear(d_model, vocab.dt)
+        self.head_lines = nn.Linear(d_model, vocab.lines)
+        self.head_class = nn.Linear(d_model, vocab.account_class)
+        self.head_weekday = nn.Linear(d_model, vocab.weekday)
+        self.head_hour = nn.Linear(d_model, vocab.hour_band)
+
+    def forward(self, tokens: dict[str, torch.Tensor], source_id: torch.Tensor):
+        # tokens[field]: (B, T) long. source_id: (B,) long.
+        h = (
+            self.emb_dt(tokens["dt"])
+            + self.emb_lines(tokens["lines"])
+            + self.emb_class(tokens["account_class"])
+            + self.emb_weekday(tokens["weekday"])
+            + self.emb_hour(tokens["hour_band"])
+        )
+        b, t, _ = h.shape
+        h = h + self.pos[:, :t]
+        h = h + self.emb_source(source_id).unsqueeze(1)  # broadcast prefix
+        mask = nn.Transformer.generate_square_subsequent_mask(t, device=h.device)
+        h = self.backbone(h, mask=mask, is_causal=True)
+        return {
+            "dt": self.head_dt(h),
+            "lines": self.head_lines(h),
+            "account_class": self.head_class(h),
+            "weekday": self.head_weekday(h),
+            "hour_band": self.head_hour(h),
+        }
+
+    @staticmethod
+    def loss(logits: dict[str, torch.Tensor], target: dict[str, torch.Tensor],
+             pad_idx: int = 0) -> torch.Tensor:
+        ce = nn.functional.cross_entropy
+        total = 0.0
+        for field, lg in logits.items():
+            # shift: predict token t from < t
+            pred = lg[:, :-1].reshape(-1, lg.size(-1))
+            tgt = target[field][:, 1:].reshape(-1)
+            total = total + ce(pred, tgt, ignore_index=pad_idx)
+        return total
diff --git a/experiments/ml/sequence/train.py b/experiments/ml/sequence/train.py
new file mode 100644
index 00000000..e07437dc
--- /dev/null
+++ b/experiments/ml/sequence/train.py
@@ -0,0 +1,61 @@
+"""Train the event-stream transformer (Track 2).
+
+    python -m sequence.train --data data/sequence --out weights/sequence --epochs 30
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+from pathlib import Path
+
+import torch
+from torch.utils.data import DataLoader, TensorDataset
+
+from .model import EventStreamTransformer, FieldVocab
+
+
+def main(argv: list[str] | None = None) -> None:
+    ap = argparse.ArgumentParser(description=__doc__)
+    ap.add_argument("--data", type=Path, required=True)
+    ap.add_argument("--out", type=Path, required=True)
+    ap.add_argument("--epochs", type=int, default=30)
+    ap.add_argument("--batch-size", type=int, default=64)
+    ap.add_argument("--lr", type=float, default=3e-4)
+    ap.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu")
+    args = ap.parse_args(argv)
+    args.out.mkdir(parents=True, exist_ok=True)
+    dev = torch.device(args.device)
+
+    vocab = FieldVocab(**json.loads((args.data / "vocab.json").read_text())["sizes"])
+    model = EventStreamTransformer(vocab).to(dev)
+    opt = torch.optim.AdamW(model.parameters(), lr=args.lr)
+
+    # streams.pt: dict of (N, T) field tensors + (N,) source_id. TODO(seq):
+    # confirm packing in data_export; this loader assumes that layout.
+    blob = torch.load(args.data / "streams.pt")
+    fields = ["dt", "lines", "account_class", "weekday", "hour_band"]
+    ds = TensorDataset(*[blob[f] for f in fields], blob["source_id"])
+    dl = DataLoader(ds, batch_size=args.batch_size, shuffle=True)
+
+    model.train()
+    for epoch in range(1, args.epochs + 1):
+        running = 0.0
+        for batch in dl:
+            *field_tensors, source_id = [t.to(dev) for t in batch]
+            tokens = dict(zip(fields, field_tensors))
+            logits = model(tokens, source_id)
+            loss = model.loss(logits, tokens)
+            opt.zero_grad()
+            loss.backward()
+            opt.step()
+            running += loss.item()
+        print(f"epoch {epoch:3d}  loss={running/len(dl):.4f}")
+
+    torch.save({"model": model.state_dict(), "vocab": vars(vocab)},
+               args.out / "stream_tf.pt")
+    print(f"[sequence.train] saved {args.out/'stream_tf.pt'}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/experiments/ml/surrogate/SPEC.md b/experiments/ml/surrogate/SPEC.md
new file mode 100644
index 00000000..dc4f49ca
--- /dev/null
+++ b/experiments/ml/surrogate/SPEC.md
@@ -0,0 +1,59 @@
+# Track 4 — Learned eval surrogate + CMA-ES tuning loop
+
+## Objective
+
+Make the **calibration loop fast**. Today, tuning a generator knob means:
+edit config → full generate → full BF eval (the hours-long cycle the project
+history keeps deferring). Replace most of those full evals with a learned
+**surrogate** `f(knobs) → predicted BF composite`, and search the knob space
+with **CMA-ES** against the surrogate — validating only the promising points
+with the real eval.
+
+This is the only track that touches **performance, not realism**, and it has
+**zero coherence risk** — it never changes generation, only *which config* we
+pick. The generator + constraints are untouched.
+
+## Why a surrogate (vs grid / manual tuning)
+
+The knob space (bypass share, drift thresholds, motif-bias weights, per-source
+IET scales, …) is ~10-20 dimensional with expensive, noisy evaluations —
+exactly the regime where Bayesian / surrogate-assisted optimization wins.
+A cheap surrogate turns "10 full evals/day" into "thousands of surrogate
+queries + a handful of confirmatory evals."
+
+## Data
+
+Bootstrapped from the baseline history: every `docs/baselines/*/metrics.csv`
+is a `(knobs, composite)` sample. Plus an active-learning loop: each
+confirmatory full eval adds a labeled point and retrains the surrogate.
+Knob vector schema lives in `surrogate/knobs.py` (TODO: enumerate from
+`GeneratorConfig` + the SP-series tuning params).
+
+## Architecture
+
+* **Surrogate**: small MLP (or GP for calibrated uncertainty at low data) —
+  `knobs (d) → [P1, P2, P3, P4 DRs]`, composite = aggregation. Predict the
+  *vector* of DRs, not just the scalar, so the optimizer can target specific
+  gaps.
+* **Optimizer**: `cma.CMAEvolutionStrategy` over normalized knobs; acquisition
+  = surrogate-predicted composite + uncertainty bonus (UCB) to keep exploring.
+* **Active loop**: every N surrogate-proposed optima → 1 real
+  `bf_bridge.score_canonical` → append → retrain surrogate.
+
+## Success criteria
+
+* Reach the current best composite (v5.26 ≈ 42 mean / 18 median) in **≤ ⅓ the
+  full evals** a manual sweep needed.
+* Surrogate rank-correlation (Spearman) with the real eval > 0.8 on held-out
+  configs before trusting its proposals.
+
+## Handoff
+
+No generator change — output is a **tuned config patch** (same format the
+existing `AutoTuner` emits). Drops straight into the regen pipeline. Runs CPU
+or A100; the A100 just makes the surrogate retrain + CMA-ES batches instant.
+
+## Privacy
+
+None — operates on knob vectors + aggregate composite scores, never corpus
+data or row-level output.
diff --git a/experiments/ml/surrogate/__init__.py b/experiments/ml/surrogate/__init__.py
new file mode 100644
index 00000000..1ccc5906
--- /dev/null
+++ b/experiments/ml/surrogate/__init__.py
@@ -0,0 +1 @@
+"""Track 4 — learned BF-eval surrogate + CMA-ES tuning loop."""
diff --git a/experiments/ml/surrogate/knobs.py b/experiments/ml/surrogate/knobs.py
new file mode 100644
index 00000000..043f733d
--- /dev/null
+++ b/experiments/ml/surrogate/knobs.py
@@ -0,0 +1,55 @@
+"""Knob vector schema for the tuning surrogate (Track 4).
+
+A knob = a normalized generator parameter the optimizer is allowed to move.
+Bounds keep CMA-ES inside the validated config envelope. TODO: enumerate the
+full set from `GeneratorConfig` + the SP-series tuning params; the entries
+below are the ones the project history actually swept (bypass share, drift
+thresholds, motif bias).
+"""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+
+import numpy as np
+
+
+@dataclass(frozen=True)
+class Knob:
+    name: str
+    lo: float
+    hi: float
+    default: float
+
+
+# Seed set from the baseline history (extend as more knobs are exposed).
+KNOBS: list[Knob] = [
+    Knob("priors_amount_bypass_share", 0.0, 0.5, 0.25),   # SP5.3 sweet spot
+    Knob("drift_sigma_per_account", 1.0, 3.0, 2.0),       # SP5.1
+    Knob("drift_aggregate_pct", 0.001, 0.02, 0.005),      # SP5.1
+    Knob("tp_motif_bias", 0.0, 1.0, 0.5),                 # SP3.12 W2
+    Knob("source_iet_scale", 0.5, 2.0, 1.0),
+    Knob("lines_per_je_dispersion", 0.5, 2.0, 1.0),
+    # TODO: append remaining swept params (W7.M bypass, semantic-split rate, …)
+]
+
+
+def to_vector(d: dict[str, float]) -> np.ndarray:
+    """Dict -> normalized [0,1] vector in KNOBS order."""
+    return np.array(
+        [(d.get(k.name, k.default) - k.lo) / (k.hi - k.lo) for k in KNOBS],
+        dtype=np.float64,
+    )
+
+
+def from_vector(x: np.ndarray) -> dict[str, float]:
+    """Normalized vector -> concrete knob dict (clamped to bounds)."""
+    out = {}
+    for k, v in zip(KNOBS, x):
+        v = float(np.clip(v, 0.0, 1.0))
+        out[k.name] = k.lo + v * (k.hi - k.lo)
+    return out
+
+
+def dim() -> int:
+    return len(KNOBS)
diff --git a/experiments/ml/surrogate/optimize.py b/experiments/ml/surrogate/optimize.py
new file mode 100644
index 00000000..1d9f471c
--- /dev/null
+++ b/experiments/ml/surrogate/optimize.py
@@ -0,0 +1,103 @@
+"""CMA-ES tuning loop over generator knobs, surrogate-assisted (Track 4).
+
+    python -m surrogate.optimize --history docs/baselines --out weights/surrogate
+
+Loop:
+  1. seed surrogate from baseline history (knobs, DRs)
+  2. CMA-ES proposes knob vectors, scored cheaply by the surrogate (UCB)
+  3. every --confirm-every generations, run the REAL BF scorer on the
+     incumbent, append the labeled point, retrain the surrogate (active learning)
+  4. emit the best knobs as a config patch (AutoTuner format)
+
+The confirmation step shells out to a full generate + canonical BF scorer;
+that is the only expensive call, and we make few of them.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+from pathlib import Path
+
+import numpy as np
+
+from . import knobs as K
+from .surrogate import composite_from_drs, fit
+
+
+def load_history(baselines_dir: Path) -> tuple[np.ndarray, np.ndarray]:
+    """Parse docs/baselines/*/metrics.csv into (knobs, DR-vector) samples.
+
+    TODO(surrogate): map each baseline's recorded params -> knob vector and its
+    per-family DRs -> Y. Until wired, returns a tiny synthetic seed so the loop
+    is runnable end-to-end for smoke-testing the machinery.
+    """
+    rng = np.random.default_rng(0)
+    X = rng.random((16, K.dim()))
+    Y = 1.0 + rng.random((16, 4)) * 40.0  # placeholder DRs
+    print("[optimize] WARNING: using synthetic seed — wire load_history to "
+          "docs/baselines before trusting results.")
+    return X, Y
+
+
+def confirm_score(knob_dict: dict) -> np.ndarray:
+    """Full generate + canonical BF scorer for one knob config -> DR vector.
+
+    TODO(surrogate): write knob_dict into a config patch, run
+    `datasynth-data generate`, then bf_bridge.score_canonical; return
+    [P1,P2,P3,P4] DRs. Expensive — called only on incumbents.
+    """
+    raise NotImplementedError(
+        "confirm_score: wire to the regen pipeline + bf_bridge.score_canonical"
+    )
+
+
+def main(argv: list[str] | None = None) -> None:
+    ap = argparse.ArgumentParser(description=__doc__)
+    ap.add_argument("--history", type=Path, default=Path("../../docs/baselines"))
+    ap.add_argument("--out", type=Path, required=True)
+    ap.add_argument("--generations", type=int, default=50)
+    ap.add_argument("--confirm-every", type=int, default=10)
+    ap.add_argument("--sigma0", type=float, default=0.2)
+    ap.add_argument("--smoke", action="store_true",
+                    help="surrogate-only loop (no confirmation runs) to test machinery")
+    args = ap.parse_args(argv)
+    args.out.mkdir(parents=True, exist_ok=True)
+
+    try:
+        import cma
+    except ImportError as exc:
+        raise SystemExit("cma required: pip install -r ../requirements.txt") from exc
+
+    X, Y = load_history(args.history)
+    model = fit(X, Y)
+
+    es = cma.CMAEvolutionStrategy(np.full(K.dim(), 0.5), args.sigma0,
+                                  {"bounds": [0.0, 1.0], "verbose": -1})
+    import torch
+
+    best = (np.inf, None)
+    for gen in range(1, args.generations + 1):
+        sols = es.ask()
+        with torch.no_grad():
+            pred = model(torch.tensor(np.array(sols), dtype=torch.float32)).numpy()
+        scores = [composite_from_drs(r) for r in pred]  # minimize composite
+        es.tell(sols, scores)
+        gbest = min(zip(scores, sols), key=lambda t: t[0])
+        if gbest[0] < best[0]:
+            best = gbest
+        if gen % args.confirm_every == 0 and not args.smoke:
+            drs = confirm_score(K.from_vector(best[1]))   # expensive, rare
+            X = np.vstack([X, best[1]])
+            Y = np.vstack([Y, drs])
+            model = fit(X, Y)                             # active-learning retrain
+            print(f"gen {gen}: confirmed composite={composite_from_drs(drs):.2f}")
+
+    patch = K.from_vector(best[1])
+    (args.out / "tuned_knobs.json").write_text(json.dumps(patch, indent=2))
+    print(f"[optimize] best surrogate composite={best[0]:.2f}")
+    print(f"[optimize] wrote {args.out/'tuned_knobs.json'} (AutoTuner-compatible)")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/experiments/ml/surrogate/surrogate.py b/experiments/ml/surrogate/surrogate.py
new file mode 100644
index 00000000..8d3af30b
--- /dev/null
+++ b/experiments/ml/surrogate/surrogate.py
@@ -0,0 +1,71 @@
+"""MLP surrogate: knob vector -> predicted per-family BF degradation ratios.
+
+Predicts the DR *vector* [P1, P2, P3, P4] (not just the scalar) so the
+optimizer can target specific gaps. Small + CPU-friendly; the A100 just makes
+retraining instant inside the active loop.
+"""
+
+from __future__ import annotations
+
+import numpy as np
+import torch
+import torch.nn as nn
+
+DR_FAMILIES = ["P1", "P2", "P3", "P4"]
+
+
+class SurrogateMLP(nn.Module):
+    def __init__(self, in_dim: int, hidden=(64, 64), out_dim: int = len(DR_FAMILIES)):
+        super().__init__()
+        layers: list[nn.Module] = []
+        d = in_dim
+        for h in hidden:
+            layers += [nn.Linear(d, h), nn.SiLU()]
+            d = h
+        layers += [nn.Linear(d, out_dim), nn.Softplus()]  # DRs are >= 0
+        self.net = nn.Sequential(*layers)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.net(x)
+
+
+def composite_from_drs(drs: np.ndarray, weights: np.ndarray | None = None) -> float:
+    """Volume-corrected-style mean over DR families (see CHANGELOG composites)."""
+    w = np.ones(drs.shape[-1]) if weights is None else weights
+    return float((drs * w).sum() / w.sum())
+
+
+def fit(
+    X: np.ndarray,
+    Y: np.ndarray,
+    epochs: int = 500,
+    lr: float = 1e-3,
+    device: str = "cpu",
+) -> SurrogateMLP:
+    """X: (n, d) normalized knobs. Y: (n, 4) per-family DRs."""
+    dev = torch.device(device)
+    model = SurrogateMLP(X.shape[1]).to(dev)
+    opt = torch.optim.Adam(model.parameters(), lr=lr)
+    xt = torch.tensor(X, dtype=torch.float32, device=dev)
+    yt = torch.tensor(Y, dtype=torch.float32, device=dev)
+    model.train()
+    for ep in range(epochs):
+        pred = model(xt)
+        loss = nn.functional.mse_loss(pred, yt)
+        opt.zero_grad()
+        loss.backward()
+        opt.step()
+        if (ep + 1) % 100 == 0:
+            print(f"  surrogate epoch {ep+1}  mse={loss.item():.4f}")
+    return model
+
+
+def spearman(model: SurrogateMLP, X: np.ndarray, Y: np.ndarray) -> float:
+    """Rank-correlation of predicted vs true composite (gate: > 0.8)."""
+    from scipy.stats import spearmanr
+
+    with torch.no_grad():
+        pred = model(torch.tensor(X, dtype=torch.float32)).numpy()
+    pc = np.array([composite_from_drs(r) for r in pred])
+    tc = np.array([composite_from_drs(r) for r in Y])
+    return float(spearmanr(pc, tc).correlation)

From 044fdfbba897748f686edcdb61dcc0d007116f11 Mon Sep 17 00:00:00 2001
From: Michael Ivertowski <mivertowski@outlook.com>
Date: Wed, 20 May 2026 21:49:58 +0200
Subject: [PATCH 02/18] feat(ml): add inverse / simulation-based-inference
 (SBI) track
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Track 5 — run the generator backward. Given an observed GL, recover a
posterior over the latent process parameters that could have produced it
(audit-analytics direction). Feasible here because DataSynth is a
structured generative model with KNOWN ground truth: it manufactures
labeled (θ → GL) pairs for free, and the hard accounting constraints
regularize the otherwise ill-posed inverse.

Amortized SNPE: a conditional normalizing flow q_φ(θ | x) (reuses the
flow/ NSF) trained on forward-simulated pairs, where x is a GL
summary-stat vector. Many-to-one forward map ⇒ we recover a posterior,
not a point.

Files:
- params.py   tier-1 identifiable parameter set + priors (fraud rates,
              amount σ, posting-lag μ/σ, concentration)
- simulate.py draw θ ~ prior → datasynth-data generate → summary stats x
- model.py    PosteriorFlow (zuko NSF conditioned on x)
- train.py    maximize Σ log q_φ(θ|x) on the simulated pairs
- validate.py SBC rank histograms + credible-interval COVERAGE on
              held-out synthetic — measure how well 'backward' works
              before pointing at any real GL

Scope ladder in SPEC.md: parameters (this) → process attribution
(overlaps ocpm + gnn) → latent fraud/anomaly labels. Inversion quality is
gated by the forward model's BF fidelity (distribution shift). Privacy:
trains on synthetic only; emits parameter posteriors, never row-level
content.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 experiments/ml/README.md           | 15 +++--
 experiments/ml/inverse/SPEC.md     | 97 ++++++++++++++++++++++++++++++
 experiments/ml/inverse/__init__.py |  1 +
 experiments/ml/inverse/model.py    | 46 ++++++++++++++
 experiments/ml/inverse/params.py   | 68 +++++++++++++++++++++
 experiments/ml/inverse/simulate.py | 93 ++++++++++++++++++++++++++++
 experiments/ml/inverse/train.py    | 74 +++++++++++++++++++++++
 experiments/ml/inverse/validate.py | 80 ++++++++++++++++++++++++
 8 files changed, 470 insertions(+), 4 deletions(-)
 create mode 100644 experiments/ml/inverse/SPEC.md
 create mode 100644 experiments/ml/inverse/__init__.py
 create mode 100644 experiments/ml/inverse/model.py
 create mode 100644 experiments/ml/inverse/params.py
 create mode 100644 experiments/ml/inverse/simulate.py
 create mode 100644 experiments/ml/inverse/train.py
 create mode 100644 experiments/ml/inverse/validate.py

diff --git a/experiments/ml/README.md b/experiments/ml/README.md
index 7c306d0b..bf581847 100644
--- a/experiments/ml/README.md
+++ b/experiments/ml/README.md
@@ -25,18 +25,25 @@ The NN never emits a final balance. It emits shape; the existing Rust
 generator projects that shape onto the feasible manifold. Coherence stays a
 hard guarantee by construction.
 
-## The four tracks
+## The five tracks
 
-| Dir | Track | Architecture | BF metrics targeted |
-|-----|-------|--------------|---------------------|
+Tracks 1–4 sharpen the **forward** model (closing the BF gap); track 5 runs it
+**backward** (recover the latent parameters from a GL).
+
+| Dir | Track | Architecture | Targets |
+|-----|-------|--------------|---------|
 | [`gnn/`](gnn/SPEC.md) | Relational / interconnectivity sampler | GraphSAGE encoder + edge/degree decoder (GAE-style) | P3 ClusteringGap, TriangleLogRatio (TP / vendor / IC graphs) |
 | [`sequence/`](sequence/SPEC.md) | Temporal / behavioral stream | Autoregressive transformer over per-(source, entity) JE token streams | P1 IETD + Autocorr, P2 JELineBurst, P4 MeanGap |
 | [`flow/`](flow/SPEC.md) | Amount marginals | Conditional normalizing flow per (source, account-class) | Benford / multimodal amount fidelity |
 | [`surrogate/`](surrogate/SPEC.md) | Tuning-loop accelerator | MLP surrogate of the BF composite + CMA-ES over generator knobs | *Performance* of calibration (no coherence risk — never touches generation) |
+| [`inverse/`](inverse/SPEC.md) | Backward inference (SBI) | Amortized neural posterior `q(θ\|GL)` trained on forward-simulated pairs | Recover the latent process *parameters* a GL was distilled from, with calibrated uncertainty (SBC + coverage validated on synthetic) |
 
 Start with **`gnn/`** (highest leverage on the structural gaps the hand-tuned
 motif samplers can't close) or **`surrogate/`** (pure iteration-speed win,
-zero coherence risk). See each `SPEC.md` for objective, data contract,
+zero coherence risk). **`inverse/`** is the audit-analytics direction — it
+reuses the `flow/` density + the forward simulator's free ground truth. See
+each `SPEC.md` for objective, data contract, architecture, and success
+criteria.
 architecture, and success criteria.
 
 ## Privacy / legal (read before running)
diff --git a/experiments/ml/inverse/SPEC.md b/experiments/ml/inverse/SPEC.md
new file mode 100644
index 00000000..c9d71403
--- /dev/null
+++ b/experiments/ml/inverse/SPEC.md
@@ -0,0 +1,97 @@
+# Track 5 — Inverse / simulation-based inference (SBI)
+
+## Objective
+
+Run the generator **backward**: given an observed GL, recover a posterior over
+the latent process **parameters** (and, later, process structure / ground-truth
+labels) that could have produced it. This turns DataSynth from a forward
+simulator into an *audit-analytics* inference tool — reconstructing the
+processes a GL was distilled from, with calibrated uncertainty.
+
+## Why this is tractable here (and rarely elsewhere)
+
+DataSynth is a **structured generative model with known ground truth**. It
+manufactures labeled `(parameters → GL)` pairs at scale (~200K entries/s), so
+we get *supervised training data for the inverse for free* — the thing most
+inverse problems lack. And the hard accounting constraints (debits=credits,
+A=L+E, document-chain integrity, three-way-match tolerances) shrink the inverse
+search space dramatically, regularizing an otherwise ill-posed problem.
+
+## The inverse is many-to-one → recover a posterior, not a point
+
+The forward map discards information at every layer (a round-dollar weekend
+posting is consistent with both fraud and a legitimate accrual). So the target
+is `p(θ | GL)`, a **posterior**, never a unique reconstruction. Report
+posteriors + coverage; never false-precision point estimates.
+
+## Approach — amortized SBI (SNPE-style)
+
+```
+            forward (datasynth-data generate)
+   θ ~ prior ───────────────────────────────▶ GL ──summary stats──▶ x
+        │                                                            │
+        └──────────── train q_φ(θ | x)  (conditional flow) ◀─────────┘
+   inference:  real GL ──summary stats──▶ x*  ──▶  q_φ(θ | x*)  (one fwd pass)
+```
+
+1. **`simulate.py`** — draw θ from a prior over a *small, identifiable*
+   parameter set first (e.g. `fraud_rate`, `document_fraud_rate`, fan-out
+   shape, posting-lag μ/σ, amount log-normal σ), run `datasynth-data generate`,
+   compute summary statistics `x` of the resulting GL. Emit `(θ, x)` pairs.
+2. **`model.py`** — a conditional normalizing flow `q_φ(θ | x)` (reuse the
+   `flow/` track's zuko NSF, conditioned on `x`). This is Sequential Neural
+   Posterior Estimation in its single-round (amortized) form.
+3. **`train.py`** — maximize `Σ log q_φ(θ_i | x_i)` over the simulated pairs.
+4. **`validate.py`** — the clean part: validate on **held-out synthetic** where
+   θ is known. Metrics: posterior-mean error per parameter, **simulation-based
+   calibration (SBC)** rank histograms, and credible-interval **coverage**
+   (a 90% interval should contain the truth ~90% of the time).
+
+## Summary statistics `x` (the GL → feature map)
+
+Reuse `common.bf_bridge` feature extractors + add inverse-relevant ones:
+per-source row-share, IIET distribution moments, lines-per-JE histogram,
+amount log-moments + Benford MAD, fan-out degree stats, weekend/off-hours/
+round-dollar fractions, document-chain completeness. TODO: finalize the
+feature vector in `simulate.py` once the parameter set is fixed.
+
+## Scope ladder (do in order)
+
+1. **Parameters only** (this spec): 5–10 identifiable knobs, validated on
+   synthetic. Lowest risk, clearest eval.
+2. **Process attribution**: which JEs form one P2P/O2C instance — overlaps
+   `datasynth-ocpm` discovery + conformance; a GNN over the transaction graph
+   (see `gnn/`) is the natural tool.
+3. **Latent labels** (fraud / anomaly cause): a ranked posterior per JE.
+   Hardest; bounded by identifiability.
+
+## Success criteria (tier 1)
+
+- Posterior-mean recovers each parameter within its prior's noise floor on
+  held-out synthetic.
+- SBC rank histograms ~uniform; 90% credible-interval coverage in [0.85, 0.95].
+- Honest failure modes documented per parameter (which are well- vs
+  poorly-identified from GL alone).
+
+## Distribution shift = the BF gap
+
+The inverse is only as trustworthy as the forward model's fidelity to reality.
+An inverse trained on synthetic, applied to a real GL, is biased by exactly the
+behavioral-fidelity gap the composite measures. So fidelity work directly gates
+inversion quality — and the inverse should only be pointed at real GL once the
+forward model's BF composite is acceptable for the targeted account/source mix.
+
+## Privacy
+
+Training data is synthetic (no corpus). Applying the trained inverse to a real
+GL reads that GL but emits only parameter posteriors — no row-level corpus
+content. Same `DATASYNTH_CORPUS_DIR` discipline if real GL is used for
+evaluation; results (posteriors) are not corpus content but treat any
+real-GL-derived artifact as sensitive until reviewed.
+
+## Handoff
+
+Output is a posterior over generator knobs — directly comparable to the
+`surrogate/` track's knob space and consumable as an AutoTuner-style report
+("the corpus most likely came from these parameters"). Python-side; no Rust
+generator change.
diff --git a/experiments/ml/inverse/__init__.py b/experiments/ml/inverse/__init__.py
new file mode 100644
index 00000000..1d14ed37
--- /dev/null
+++ b/experiments/ml/inverse/__init__.py
@@ -0,0 +1 @@
+"""Track 5 — inverse / simulation-based inference (amortized SNPE)."""
diff --git a/experiments/ml/inverse/model.py b/experiments/ml/inverse/model.py
new file mode 100644
index 00000000..a3357b91
--- /dev/null
+++ b/experiments/ml/inverse/model.py
@@ -0,0 +1,46 @@
+"""Amortized posterior estimator q_φ(θ | x) for the inverse track.
+
+A conditional normalizing flow (zuko NSF) that maps a GL summary-stat vector
+`x` to a distribution over normalized parameters θ ∈ [0,1]^d. Single-round
+(amortized) SNPE: one network, trained on prior-simulated pairs, usable on any
+new GL in a forward pass.
+"""
+
+from __future__ import annotations
+
+import torch
+import torch.nn as nn
+
+try:
+    import zuko
+except ImportError as exc:  # pragma: no cover
+    raise ImportError("zuko required: pip install -r ../requirements.txt") from exc
+
+
+class PosteriorFlow(nn.Module):
+    def __init__(self, dim_theta: int, dim_x: int, transforms: int = 5,
+                 hidden=(128, 128)):
+        super().__init__()
+        # Standardize x before conditioning (fit at train time).
+        self.register_buffer("x_mean", torch.zeros(dim_x))
+        self.register_buffer("x_std", torch.ones(dim_x))
+        self.flow = zuko.flows.NSF(
+            features=dim_theta, context=dim_x, transforms=transforms,
+            hidden_features=hidden,
+        )
+
+    def set_x_norm(self, mean: torch.Tensor, std: torch.Tensor) -> None:
+        self.x_mean.copy_(mean)
+        self.x_std.copy_(std.clamp_min(1e-6))
+
+    def _cond(self, x: torch.Tensor) -> torch.Tensor:
+        return (x - self.x_mean) / self.x_std
+
+    def log_prob(self, theta: torch.Tensor, x: torch.Tensor) -> torch.Tensor:
+        return self.flow(self._cond(x)).log_prob(theta)
+
+    @torch.no_grad()
+    def sample(self, x: torch.Tensor, n: int) -> torch.Tensor:
+        """Draw n posterior samples of θ (normalized) for a single x (dim_x,)."""
+        ctx = self._cond(x.unsqueeze(0))
+        return self.flow(ctx).sample((n,)).squeeze(1)
diff --git a/experiments/ml/inverse/params.py b/experiments/ml/inverse/params.py
new file mode 100644
index 00000000..3829bad5
--- /dev/null
+++ b/experiments/ml/inverse/params.py
@@ -0,0 +1,68 @@
+"""Inverse parameter space (tier 1): a small, identifiable set of generator
+knobs with priors. Kept deliberately small — these are the latents we expect
+to recover from a GL with calibrated uncertainty. Extend only after SBC +
+coverage stay healthy (poorly-identified params widen everything's posterior).
+"""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+
+import numpy as np
+
+
+@dataclass(frozen=True)
+class Param:
+    name: str          # config key written into the generate config
+    lo: float
+    hi: float
+    log: bool = False  # sample/scale in log space (for rates spanning decades)
+
+
+# Tier-1 set. Names map to GeneratorConfig keys (see datasynth-config schema).
+PARAMS: list[Param] = [
+    Param("fraud.fraud_rate", 0.0, 0.10),
+    Param("fraud.document_fraud_rate", 0.0, 0.10),
+    Param("distributions.amounts.sigma", 0.5, 2.5),      # log-normal width
+    Param("temporal_patterns.processing_lags.invoice_receipt_lag.mu", 0.0, 3.0),
+    Param("temporal_patterns.processing_lags.invoice_receipt_lag.sigma", 0.2, 1.5),
+    Param("vendor_network.dependencies.top_5_concentration", 0.20, 0.70),
+    # TODO: add lines-per-JE dispersion + source-mix concentration once the
+    # summary-stat feature map (simulate.py) exposes the matching observables.
+]
+
+
+def sample_prior(rng: np.random.Generator, n: int) -> np.ndarray:
+    """Draw n θ vectors from the (independent, uniform) prior. Shape (n, d)."""
+    cols = []
+    for p in PARAMS:
+        if p.log:
+            lo, hi = np.log(max(p.lo, 1e-6)), np.log(p.hi)
+            cols.append(np.exp(rng.uniform(lo, hi, n)))
+        else:
+            cols.append(rng.uniform(p.lo, p.hi, n))
+    return np.stack(cols, axis=1)
+
+
+def to_config_overrides(theta: np.ndarray) -> dict[str, float]:
+    """One θ vector -> {config_key: value} overrides for a generate run."""
+    return {p.name: float(v) for p, v in zip(PARAMS, theta)}
+
+
+def normalize(theta: np.ndarray) -> np.ndarray:
+    """Map θ to [0,1]^d for stable flow training."""
+    out = np.empty_like(theta, dtype=np.float64)
+    for j, p in enumerate(PARAMS):
+        out[..., j] = (theta[..., j] - p.lo) / (p.hi - p.lo)
+    return out
+
+
+def denormalize(u: np.ndarray) -> np.ndarray:
+    out = np.empty_like(u, dtype=np.float64)
+    for j, p in enumerate(PARAMS):
+        out[..., j] = np.clip(u[..., j], 0.0, 1.0) * (p.hi - p.lo) + p.lo
+    return out
+
+
+def dim() -> int:
+    return len(PARAMS)
diff --git a/experiments/ml/inverse/simulate.py b/experiments/ml/inverse/simulate.py
new file mode 100644
index 00000000..b71b26ff
--- /dev/null
+++ b/experiments/ml/inverse/simulate.py
@@ -0,0 +1,93 @@
+"""Generate (θ, x) training pairs for the inverse model.
+
+For each θ drawn from the prior: write a config with those overrides, run
+`datasynth-data generate`, read the resulting journal_entries, compute the
+summary-stat feature vector x. Emit (θ, x) to `--out`.
+
+    python -m inverse.simulate --n 2000 --out data/inverse --base configs/demo.yaml
+
+CPU-bound and embarrassingly parallel across θ; safe to shard. Does NOT need
+the corpus — this is pure synthetic self-simulation (the SBI training set).
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import subprocess
+import tempfile
+from pathlib import Path
+
+import numpy as np
+
+from . import params as P
+
+
+def _cli() -> str:
+    import shutil
+    for c in ("./target/release/datasynth-data", "datasynth-data"):
+        if shutil.which(c) or Path(c).exists():
+            return c
+    raise FileNotFoundError("build datasynth-data (cargo build --release -p datasynth-cli)")
+
+
+def summary_stats(je_csv: Path) -> np.ndarray:
+    """GL → fixed-length feature vector x. Reuses the same observables the BF
+    eval keys on so the inverse 'sees' what the forward model varies.
+
+    TODO: finalize alongside params.py. Pseudocode:
+      - per-source row-share (top-K sources)
+      - inter-event-time mean/std/skew per source, pooled
+      - lines-per-JE histogram (fixed bins)
+      - log|amount| mean/std + Benford first-digit MAD
+      - weekend / off-hours / round-dollar fractions
+      - fan-out degree mean/gini; document-chain completeness
+    """
+    raise NotImplementedError(
+        "summary_stats: compute the fixed-length feature vector from je_csv "
+        "(see SPEC.md § Summary statistics; share extractors with common.bf_bridge)."
+    )
+
+
+def run_one(cli: str, base_cfg: Path, theta: np.ndarray, workdir: Path) -> np.ndarray:
+    overrides = P.to_config_overrides(theta)
+    # TODO: merge `overrides` (dotted keys) into a copy of base_cfg → cfg.yaml.
+    cfg = workdir / "cfg.yaml"
+    out = workdir / "out"
+    raise NotImplementedError(
+        f"run_one: write {cfg} = base_cfg + {overrides}, then "
+        f"`{cli} generate -c {cfg} -o {out} --memory-limit 512 --max-threads 1`, "
+        f"then summary_stats({out}/journal_entries.csv)."
+    )
+
+
+def main(argv: list[str] | None = None) -> None:
+    ap = argparse.ArgumentParser(description=__doc__)
+    ap.add_argument("--n", type=int, default=2000, help="number of (θ, x) pairs")
+    ap.add_argument("--out", type=Path, required=True)
+    ap.add_argument("--base", type=Path, required=True, help="base generate config")
+    ap.add_argument("--seed", type=int, default=0)
+    args = ap.parse_args(argv)
+    args.out.mkdir(parents=True, exist_ok=True)
+
+    rng = np.random.default_rng(args.seed)
+    thetas = P.sample_prior(rng, args.n)
+    cli = _cli()
+
+    xs = []
+    with tempfile.TemporaryDirectory() as td:
+        for i, theta in enumerate(thetas):
+            x = run_one(cli, args.base, theta, Path(td))  # raises until wired
+            xs.append(x)
+            if (i + 1) % 50 == 0:
+                print(f"[simulate] {i+1}/{args.n}")
+    X = np.stack(xs)
+    np.savez(args.out / "pairs.npz", theta=thetas, x=X,
+             param_names=[p.name for p in P.PARAMS])
+    (args.out / "meta.json").write_text(json.dumps(
+        {"n": args.n, "dim_theta": P.dim(), "dim_x": X.shape[1]}, indent=2))
+    print(f"[simulate] wrote {args.out/'pairs.npz'}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/experiments/ml/inverse/train.py b/experiments/ml/inverse/train.py
new file mode 100644
index 00000000..d457f9b7
--- /dev/null
+++ b/experiments/ml/inverse/train.py
@@ -0,0 +1,74 @@
+"""Train the amortized posterior q_φ(θ | x) on simulated pairs.
+
+    python -m inverse.train --data data/inverse --out weights/inverse --epochs 300
+"""
+
+from __future__ import annotations
+
+import argparse
+from pathlib import Path
+
+import numpy as np
+import torch
+from torch.utils.data import DataLoader, TensorDataset
+
+from . import params as P
+from .model import PosteriorFlow
+
+
+def main(argv: list[str] | None = None) -> None:
+    ap = argparse.ArgumentParser(description=__doc__)
+    ap.add_argument("--data", type=Path, required=True)
+    ap.add_argument("--out", type=Path, required=True)
+    ap.add_argument("--epochs", type=int, default=300)
+    ap.add_argument("--batch-size", type=int, default=256)
+    ap.add_argument("--lr", type=float, default=1e-3)
+    ap.add_argument("--val-frac", type=float, default=0.2)
+    ap.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu")
+    args = ap.parse_args(argv)
+    args.out.mkdir(parents=True, exist_ok=True)
+    dev = torch.device(args.device)
+
+    blob = np.load(args.data / "pairs.npz")
+    theta = P.normalize(blob["theta"]).astype("float32")  # (n, d) in [0,1]
+    x = blob["x"].astype("float32")                        # (n, dim_x)
+
+    n_val = max(1, int(len(theta) * args.val_frac))
+    tr = slice(n_val, None)
+    va = slice(0, n_val)
+
+    model = PosteriorFlow(dim_theta=P.dim(), dim_x=x.shape[1]).to(dev)
+    xt = torch.tensor(x, device=dev)
+    model.set_x_norm(xt[tr].mean(0), xt[tr].std(0))
+
+    ds = TensorDataset(torch.tensor(theta[tr], device=dev), xt[tr])
+    dl = DataLoader(ds, batch_size=args.batch_size, shuffle=True)
+    opt = torch.optim.Adam(model.parameters(), lr=args.lr)
+
+    theta_va = torch.tensor(theta[va], device=dev)
+    x_va = xt[va]
+    best = float("inf")
+    for ep in range(1, args.epochs + 1):
+        model.train(True)
+        run = 0.0
+        for th, xx in dl:
+            loss = -model.log_prob(th, xx).mean()
+            opt.zero_grad()
+            loss.backward()
+            opt.step()
+            run += loss.item()
+        model.train(False)  # inference mode (equivalent to .eval())
+        with torch.no_grad():
+            vloss = -model.log_prob(theta_va, x_va).mean().item()
+        if vloss < best:
+            best = vloss
+            torch.save({"model": model.state_dict(), "dim_x": x.shape[1]},
+                       args.out / "posterior.pt")
+        if ep % 25 == 0:
+            print(f"epoch {ep:4d}  train_nll={run/len(dl):.3f}  val_nll={vloss:.3f}")
+    print(f"[inverse.train] best val_nll={best:.3f} -> {args.out/'posterior.pt'}")
+    print("Next: python -m inverse.validate --data ... --weights ... (SBC + coverage)")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/experiments/ml/inverse/validate.py b/experiments/ml/inverse/validate.py
new file mode 100644
index 00000000..88410c00
--- /dev/null
+++ b/experiments/ml/inverse/validate.py
@@ -0,0 +1,80 @@
+"""Validate the inverse posterior on held-out synthetic, where θ is known.
+
+    python -m inverse.validate --data data/inverse --weights weights/inverse
+
+Reports, per parameter:
+  - posterior-mean absolute error (vs the true θ)
+  - simulation-based calibration (SBC) rank: for a calibrated posterior, the
+    rank of the true θ among posterior samples is uniform on [0, n_samples].
+  - central credible-interval coverage (a 90% interval should contain the
+    truth ~90% of the time) — the headline trust metric.
+
+This is the whole point of doing inversion against a forward simulator: we can
+measure how well 'running the engine backward' works BEFORE pointing it at any
+real GL.
+"""
+
+from __future__ import annotations
+
+import argparse
+from pathlib import Path
+
+import numpy as np
+import torch
+
+from . import params as P
+from .model import PosteriorFlow
+
+
+def coverage_and_sbc(model: PosteriorFlow, theta_true_n: np.ndarray,
+                     x: torch.Tensor, n_samples: int = 500, cred: float = 0.90):
+    d = P.dim()
+    ranks = np.zeros((len(theta_true_n), d), dtype=int)
+    covered = np.zeros((len(theta_true_n), d), dtype=bool)
+    abs_err = np.zeros((len(theta_true_n), d))
+    lo_q, hi_q = (1 - cred) / 2, 1 - (1 - cred) / 2
+    model.train(False)  # inference mode: freeze dropout / running stats
+    for i in range(len(theta_true_n)):
+        s = model.sample(x[i], n_samples).cpu().numpy()   # (n_samples, d) normalized
+        truth = theta_true_n[i]
+        ranks[i] = (s < truth).sum(axis=0)
+        lo = np.quantile(s, lo_q, axis=0)
+        hi = np.quantile(s, hi_q, axis=0)
+        covered[i] = (truth >= lo) & (truth <= hi)
+        abs_err[i] = np.abs(s.mean(axis=0) - truth)
+    return ranks, covered.mean(axis=0), abs_err.mean(axis=0)
+
+
+def main(argv: list[str] | None = None) -> None:
+    ap = argparse.ArgumentParser(description=__doc__)
+    ap.add_argument("--data", type=Path, required=True)
+    ap.add_argument("--weights", type=Path, required=True)
+    ap.add_argument("--val-frac", type=float, default=0.2)
+    ap.add_argument("--cred", type=float, default=0.90)
+    ap.add_argument("--device", default="cpu")
+    args = ap.parse_args(argv)
+    dev = torch.device(args.device)
+
+    blob = np.load(args.data / "pairs.npz")
+    theta_n = P.normalize(blob["theta"]).astype("float32")
+    x = blob["x"].astype("float32")
+    n_val = max(1, int(len(theta_n) * args.val_frac))
+    theta_va, x_va = theta_n[:n_val], torch.tensor(x[:n_val], device=dev)
+
+    ckpt = torch.load(args.weights / "posterior.pt", map_location=dev)
+    model = PosteriorFlow(dim_theta=P.dim(), dim_x=ckpt["dim_x"]).to(dev)
+    model.load_state_dict(ckpt["model"])
+
+    ranks, cov, err = coverage_and_sbc(model, theta_va, x_va, cred=args.cred)
+    print(f"{'parameter':<55} {'mae(norm)':>10} {f'{int(args.cred*100)}%cov':>8}")
+    for j, p in enumerate(P.PARAMS):
+        flag = "" if 0.85 <= cov[j] <= 0.95 else "  miscalibrated"
+        print(f"{p.name:<55} {err[j]:>10.3f} {cov[j]:>8.2f}{flag}")
+    print("\nSBC: rank histograms should be ~uniform (export ranks for a plot).")
+    print("TODO: save ranks -> SBC rank-histogram PNG; flag non-uniform params "
+          "as poorly identified from GL alone (expected for some — an honest "
+          "finding, not a bug).")
+
+
+if __name__ == "__main__":
+    main()

From 63c09e83c27a152fd7e98e029a75dff8254e1573 Mon Sep 17 00:00:00 2001
From: Michael Ivertowski <mivertowski@outlook.com>
Date: Thu, 21 May 2026 12:44:50 +0200
Subject: [PATCH 03/18] feat(ml/inverse): implement simulate.py + trim params
 to 5 verified scalar knobs

Fills the simulate.py TODOs: a 29-dim observable-only GL summary-stat vector
(amount / Benford / round-dollar / weekend / lines-per-JE / posting-lag /
source-mix / IET / GL-concentration) and run_one (dotted-key config override ->
datasynth-data generate -> summary_stats), fanned out over a process pool.
Drops the invalid distributions.amounts.sigma knob (amounts is a mixture
components list, not a scalar) so overrides stay valid under
deny_unknown_fields; 5 verified scalar params remain for the tier-1 demo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 experiments/ml/inverse/params.py   |  10 +-
 experiments/ml/inverse/simulate.py | 252 ++++++++++++++++++++++++-----
 2 files changed, 215 insertions(+), 47 deletions(-)

diff --git a/experiments/ml/inverse/params.py b/experiments/ml/inverse/params.py
index 3829bad5..083161fb 100644
--- a/experiments/ml/inverse/params.py
+++ b/experiments/ml/inverse/params.py
@@ -19,16 +19,18 @@ class Param:
     log: bool = False  # sample/scale in log space (for rates spanning decades)
 
 
-# Tier-1 set. Names map to GeneratorConfig keys (see datasynth-config schema).
+# Tier-1 set. Names map to GeneratorConfig keys (verified scalar paths in
+# datasynth-config/src/schema.rs — `deny_unknown_fields` is in force, so every
+# key here must be a real settable field). The amount log-normal width was
+# dropped: `distributions.amounts` is a mixture `components` list, not a scalar
+# `sigma`; re-adding it needs a structured (replace-components) override —
+# tracked as the 6th-knob follow-up.
 PARAMS: list[Param] = [
     Param("fraud.fraud_rate", 0.0, 0.10),
     Param("fraud.document_fraud_rate", 0.0, 0.10),
-    Param("distributions.amounts.sigma", 0.5, 2.5),      # log-normal width
     Param("temporal_patterns.processing_lags.invoice_receipt_lag.mu", 0.0, 3.0),
     Param("temporal_patterns.processing_lags.invoice_receipt_lag.sigma", 0.2, 1.5),
     Param("vendor_network.dependencies.top_5_concentration", 0.20, 0.70),
-    # TODO: add lines-per-JE dispersion + source-mix concentration once the
-    # summary-stat feature map (simulate.py) exposes the matching observables.
 ]
 
 
diff --git a/experiments/ml/inverse/simulate.py b/experiments/ml/inverse/simulate.py
index b71b26ff..a69347a1 100644
--- a/experiments/ml/inverse/simulate.py
+++ b/experiments/ml/inverse/simulate.py
@@ -4,61 +4,211 @@
 `datasynth-data generate`, read the resulting journal_entries, compute the
 summary-stat feature vector x. Emit (θ, x) to `--out`.
 
-    python -m inverse.simulate --n 2000 --out data/inverse --base configs/demo.yaml
+    python -m inverse.simulate --n 2000 --out data/inverse --base configs/inverse_base.yaml
 
-CPU-bound and embarrassingly parallel across θ; safe to shard. Does NOT need
-the corpus — this is pure synthetic self-simulation (the SBI training set).
+CPU-bound and embarrassingly parallel across θ; we fan out over a process
+pool. Does NOT need the corpus — pure synthetic self-simulation (the SBI
+training set). Runs that fail (bad override / generate error) are dropped, not
+fatal.
 """
 
 from __future__ import annotations
 
 import argparse
+import copy
 import json
+import os
+import shutil
 import subprocess
 import tempfile
+from concurrent.futures import ProcessPoolExecutor, as_completed
 from pathlib import Path
 
 import numpy as np
+import pandas as pd
+import yaml
 
 from . import params as P
 
+_ROUND_LEVELS = np.array([1_000.0, 5_000.0, 10_000.0, 25_000.0, 50_000.0, 100_000.0])
+
+# Fixed feature order — keep stable so x is comparable across runs / re-runs.
+FEATURE_NAMES = [
+    "log_amt_mean", "log_amt_std", "log_amt_skew", "benford_mad", "round_frac",
+    "weekend_frac", "monthend_frac", "postclose_frac", "manual_frac",
+    "lpje_mean", "lpje_std", "lpje_frac2", "lpje_frac_gt2",
+    "lag_mean", "lag_std", "lag_pos_frac",
+    "src_share1", "src_share2", "src_share3", "src_share4", "src_share5", "src_entropy",
+    "iet_mean", "iet_std",
+    "gl_n_log", "gl_top5_share", "gl_entropy",
+    "n_lines_log", "n_docs_log",
+]
+DIM_X = len(FEATURE_NAMES)
+
 
 def _cli() -> str:
-    import shutil
     for c in ("./target/release/datasynth-data", "datasynth-data"):
-        if shutil.which(c) or Path(c).exists():
+        if Path(c).exists() or shutil.which(c):
             return c
     raise FileNotFoundError("build datasynth-data (cargo build --release -p datasynth-cli)")
 
 
+def _moments(v: np.ndarray) -> tuple[float, float, float]:
+    v = v[np.isfinite(v)]
+    if v.size == 0:
+        return 0.0, 0.0, 0.0
+    m, s = float(v.mean()), float(v.std())
+    sk = float(((v - m) ** 3).mean() / (s ** 3)) if s > 1e-9 else 0.0
+    return m, s, sk
+
+
+def _entropy(shares: np.ndarray) -> float:
+    p = shares[shares > 0]
+    return float(-(p * np.log(p)).sum()) if p.size else 0.0
+
+
 def summary_stats(je_csv: Path) -> np.ndarray:
-    """GL → fixed-length feature vector x. Reuses the same observables the BF
-    eval keys on so the inverse 'sees' what the forward model varies.
-
-    TODO: finalize alongside params.py. Pseudocode:
-      - per-source row-share (top-K sources)
-      - inter-event-time mean/std/skew per source, pooled
-      - lines-per-JE histogram (fixed bins)
-      - log|amount| mean/std + Benford first-digit MAD
-      - weekend / off-hours / round-dollar fractions
-      - fan-out degree mean/gini; document-chain completeness
-    """
-    raise NotImplementedError(
-        "summary_stats: compute the fixed-length feature vector from je_csv "
-        "(see SPEC.md § Summary statistics; share extractors with common.bf_bridge)."
-    )
-
-
-def run_one(cli: str, base_cfg: Path, theta: np.ndarray, workdir: Path) -> np.ndarray:
-    overrides = P.to_config_overrides(theta)
-    # TODO: merge `overrides` (dotted keys) into a copy of base_cfg → cfg.yaml.
-    cfg = workdir / "cfg.yaml"
+    """GL → fixed-length feature vector x (DIM_X,). Observable-only (no labels)
+    so the same map applies to a real GL at inference time."""
+    df = pd.read_csv(je_csv, low_memory=False)
+    n = len(df)
+    if n == 0:
+        return np.zeros(DIM_X, dtype=np.float32)
+
+    deb = pd.to_numeric(df.get("debit_amount", 0), errors="coerce").fillna(0.0).to_numpy()
+    cred = pd.to_numeric(df.get("credit_amount", 0), errors="coerce").fillna(0.0).to_numpy()
+    amt = np.where(deb != 0, deb, cred).astype(float)
+    nz = np.abs(amt[amt != 0])
+    log_amt = np.log1p(nz)
+    la_mean, la_std, la_skew = _moments(log_amt)
+
+    # Benford first-digit MAD vs the ideal law.
+    fd = np.array([int(str(int(a))[0]) for a in nz if a >= 1], dtype=int)
+    if fd.size:
+        obs = np.array([(fd == d).mean() for d in range(1, 10)])
+        exp = np.log10(1 + 1 / np.arange(1, 10))
+        benford_mad = float(np.abs(obs - exp).mean())
+    else:
+        benford_mad = 0.0
+
+    nearest = np.abs(nz[:, None] - _ROUND_LEVELS[None, :]).min(axis=1) if nz.size else np.array([1e9])
+    round_frac = float((nearest < 1.0).mean())
+
+    pdt = pd.to_datetime(df.get("posting_date"), errors="coerce")
+    dow = pdt.dt.dayofweek
+    weekend_frac = float((dow >= 5).mean()) if n else 0.0
+    monthend_frac = float((pdt.dt.day >= 25).mean()) if n else 0.0
+
+    def _frac_true(col: str) -> float:
+        if col not in df:
+            return 0.0
+        return float(df[col].astype("boolean").fillna(False).mean())
+
+    postclose_frac = _frac_true("is_post_close")
+    manual_frac = _frac_true("is_manual")
+
+    # Lines per JE
+    if "document_id" in df:
+        lpje = df.groupby("document_id").size().to_numpy()
+        lpje_mean, lpje_std = float(lpje.mean()), float(lpje.std())
+        lpje_f2 = float((lpje == 2).mean())
+        lpje_fgt2 = float((lpje > 2).mean())
+    else:
+        lpje_mean = lpje_std = lpje_f2 = lpje_fgt2 = 0.0
+
+    # Posting lag (posting - document), days
+    if "document_date" in df:
+        ddt = pd.to_datetime(df["document_date"], errors="coerce")
+        lag = (pdt - ddt).dt.days.to_numpy().astype(float)
+        lag = lag[np.isfinite(lag)]
+        lag_mean, lag_std = (float(lag.mean()), float(lag.std())) if lag.size else (0.0, 0.0)
+        lag_pos = float((lag > 0).mean()) if lag.size else 0.0
+    else:
+        lag_mean = lag_std = lag_pos = 0.0
+
+    # Source mix
+    if "source" in df:
+        vc = df["source"].astype(str).value_counts(normalize=True)
+        shares = vc.to_numpy()
+        src5 = list(shares[:5]) + [0.0] * (5 - min(5, len(shares)))
+        src_ent = _entropy(shares)
+    else:
+        src5, src_ent = [0.0] * 5, 0.0
+
+    # Inter-event time per source (pooled gaps between sorted posting days)
+    iets = []
+    if "source" in df and pdt.notna().any():
+        tmp = pd.DataFrame({"s": df["source"].astype(str), "d": pdt})
+        for _, g in tmp.dropna().groupby("s"):
+            days = np.sort(g["d"].astype("int64").to_numpy()) / 86_400_000_000_000
+            if days.size > 1:
+                iets.append(np.diff(days))
+    if iets:
+        allg = np.concatenate(iets)
+        iet_mean, iet_std = float(allg.mean()), float(allg.std())
+    else:
+        iet_mean = iet_std = 0.0
+
+    # GL account fan-out / concentration
+    if "gl_account" in df:
+        gvc = df["gl_account"].astype(str).value_counts(normalize=True)
+        gl_n_log = float(np.log1p(len(gvc)))
+        gl_top5 = float(gvc.to_numpy()[:5].sum())
+        gl_ent = _entropy(gvc.to_numpy())
+    else:
+        gl_n_log = gl_top5 = gl_ent = 0.0
+
+    n_docs = df["document_id"].nunique() if "document_id" in df else n
+    feats = [
+        la_mean, la_std, la_skew, benford_mad, round_frac,
+        weekend_frac, monthend_frac, postclose_frac, manual_frac,
+        lpje_mean, lpje_std, lpje_f2, lpje_fgt2,
+        lag_mean, lag_std, lag_pos,
+        *src5, src_ent,
+        iet_mean, iet_std,
+        gl_n_log, gl_top5, gl_ent,
+        float(np.log1p(n)), float(np.log1p(n_docs)),
+    ]
+    return np.asarray(feats, dtype=np.float32)
+
+
+def _set_dotted(cfg: dict, key: str, value) -> None:
+    """Set a dotted key into a nested dict, creating intermediate dicts."""
+    parts = key.split(".")
+    node = cfg
+    for p in parts[:-1]:
+        node = node.setdefault(p, {})
+    node[parts[-1]] = value
+
+
+def run_one(cli: str, base_cfg: dict, theta: np.ndarray, seed: int, workdir: Path) -> np.ndarray | None:
+    cfg = copy.deepcopy(base_cfg)
+    for k, v in P.to_config_overrides(theta).items():
+        _set_dotted(cfg, k, v)
+    _set_dotted(cfg, "global.seed", int(seed))
+    cfg_path = workdir / "cfg.yaml"
     out = workdir / "out"
-    raise NotImplementedError(
-        f"run_one: write {cfg} = base_cfg + {overrides}, then "
-        f"`{cli} generate -c {cfg} -o {out} --memory-limit 512 --max-threads 1`, "
-        f"then summary_stats({out}/journal_entries.csv)."
-    )
+    cfg_path.write_text(yaml.safe_dump(cfg))
+    try:
+        subprocess.run(
+            [cli, "generate", "--config", str(cfg_path), "--output", str(out),
+             "--max-threads", "1"],
+            check=True, capture_output=True, timeout=300,
+        )
+        je = out / "journal_entries.csv"
+        if not je.exists():
+            return None
+        return summary_stats(je)
+    except (subprocess.CalledProcessError, subprocess.TimeoutExpired):
+        return None
+    finally:
+        shutil.rmtree(out, ignore_errors=True)
+
+
+def _worker(args) -> tuple[int, np.ndarray | None]:
+    i, theta, seed, cli, base_cfg = args
+    with tempfile.TemporaryDirectory(prefix="sbi_") as td:
+        return i, run_one(cli, base_cfg, theta, seed, Path(td))
 
 
 def main(argv: list[str] | None = None) -> None:
@@ -67,26 +217,42 @@ def main(argv: list[str] | None = None) -> None:
     ap.add_argument("--out", type=Path, required=True)
     ap.add_argument("--base", type=Path, required=True, help="base generate config")
     ap.add_argument("--seed", type=int, default=0)
+    ap.add_argument("--workers", type=int, default=0, help="0 = os.cpu_count()-2")
     args = ap.parse_args(argv)
     args.out.mkdir(parents=True, exist_ok=True)
 
+    workers = args.workers or max(1, (os.cpu_count() or 4) - 2)
     rng = np.random.default_rng(args.seed)
     thetas = P.sample_prior(rng, args.n)
     cli = _cli()
+    base_cfg = yaml.safe_load(args.base.read_text())
+
+    jobs = [(i, thetas[i], args.seed + 1 + i, cli, base_cfg) for i in range(args.n)]
+    xs: dict[int, np.ndarray] = {}
+    done = fail = 0
+    with ProcessPoolExecutor(max_workers=workers) as ex:
+        futs = [ex.submit(_worker, j) for j in jobs]
+        for fut in as_completed(futs):
+            i, x = fut.result()
+            done += 1
+            if x is None:
+                fail += 1
+            else:
+                xs[i] = x
+            if done % 50 == 0:
+                print(f"[simulate] {done}/{args.n}  (failed {fail})", flush=True)
 
-    xs = []
-    with tempfile.TemporaryDirectory() as td:
-        for i, theta in enumerate(thetas):
-            x = run_one(cli, args.base, theta, Path(td))  # raises until wired
-            xs.append(x)
-            if (i + 1) % 50 == 0:
-                print(f"[simulate] {i+1}/{args.n}")
-    X = np.stack(xs)
-    np.savez(args.out / "pairs.npz", theta=thetas, x=X,
-             param_names=[p.name for p in P.PARAMS])
+    keep = sorted(xs)
+    if not keep:
+        raise SystemExit("[simulate] all runs failed — check the base config / override keys")
+    X = np.stack([xs[i] for i in keep])
+    theta_keep = thetas[keep]
+    np.savez(args.out / "pairs.npz", theta=theta_keep, x=X,
+             param_names=[p.name for p in P.PARAMS], feature_names=FEATURE_NAMES)
     (args.out / "meta.json").write_text(json.dumps(
-        {"n": args.n, "dim_theta": P.dim(), "dim_x": X.shape[1]}, indent=2))
-    print(f"[simulate] wrote {args.out/'pairs.npz'}")
+        {"n_requested": args.n, "n_kept": len(keep), "n_failed": fail,
+         "dim_theta": P.dim(), "dim_x": int(X.shape[1])}, indent=2))
+    print(f"[simulate] kept {len(keep)}/{args.n} (failed {fail}) -> {args.out/'pairs.npz'}")
 
 
 if __name__ == "__main__":

From 701e7b2d7d70304ffbd134830b2e6add0d3d5e9e Mon Sep 17 00:00:00 2001
From: Michael Ivertowski <mivertowski@outlook.com>
Date: Thu, 21 May 2026 13:20:14 +0200
Subject: [PATCH 04/18] feat(ml): descriptive corpus-vs-synthetic gap
 (interpretable 'what is missing')

Side-by-side behavioral observables (lines-per-JE, log-amount moments, Benford
MAD, round-dollar / small-ticket share, p99 amount, weekend share, source mix,
per-source inter-event times) for corpus (corpus columns) vs a synthetic
journal_entries.csv (canonical columns). Complements the normalized DRs from
behavioral score with raw units.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 experiments/ml/common/corpus_gap.py | 117 ++++++++++++++++++++++++++++
 1 file changed, 117 insertions(+)
 create mode 100644 experiments/ml/common/corpus_gap.py

diff --git a/experiments/ml/common/corpus_gap.py b/experiments/ml/common/corpus_gap.py
new file mode 100644
index 00000000..25570b5d
--- /dev/null
+++ b/experiments/ml/common/corpus_gap.py
@@ -0,0 +1,117 @@
+"""Descriptive corpus-vs-synthetic gap — 'what's missing on the synthetic end'.
+
+Complements `datasynth-data behavioral score` (normalized degradation ratios)
+with raw, interpretable observables in plain units, so the gap is legible:
+lines-per-JE, amount distribution (log-moments / Benford / round-dollar / small-
+ticket share), source mix, weekend share, and per-source inter-event times.
+
+    python -m common.corpus_gap --corpus /path/corpus.parquet --syn /path/journal_entries.csv
+
+Corpus uses its own column names; synthetic uses canonical names. Both are
+mapped here. Emits a side-by-side table + a JSON of the gaps.
+"""
+from __future__ import annotations
+
+import argparse
+import json
+from pathlib import Path
+
+import numpy as np
+import pandas as pd
+
+_ROUND = np.array([1_000.0, 5_000.0, 10_000.0, 25_000.0, 50_000.0, 100_000.0])
+
+
+def _benford_mad(a: np.ndarray) -> float:
+    fd = np.array([int(str(int(x))[0]) for x in a if x >= 1], dtype=int)
+    if not fd.size:
+        return float("nan")
+    obs = np.array([(fd == d).mean() for d in range(1, 10)])
+    exp = np.log10(1 + 1 / np.arange(1, 10))
+    return float(np.abs(obs - exp).mean())
+
+
+def _iet_stats(df: pd.DataFrame, src: str, date: str) -> tuple[float, float]:
+    iets = []
+    sub = df[[src, date]].dropna()
+    sub = sub.assign(_d=pd.to_datetime(sub[date], errors="coerce")).dropna(subset=["_d"])
+    for _, g in sub.groupby(src):
+        days = np.sort(g["_d"].astype("int64").to_numpy()) / 86_400_000_000_000
+        if days.size > 1:
+            iets.append(np.diff(days))
+    if not iets:
+        return float("nan"), float("nan")
+    allg = np.concatenate(iets)
+    return float(allg.mean()), float(allg.std())
+
+
+def observables(df: pd.DataFrame, jeid: str, src: str, amt: np.ndarray, date: str) -> dict:
+    a = np.abs(amt)
+    a = a[np.isfinite(a) & (a > 0)]
+    la = np.log1p(a)
+    lpje = df.groupby(jeid).size().to_numpy() if jeid in df else np.array([np.nan])
+    pdt = pd.to_datetime(df[date], errors="coerce") if date in df else pd.Series([], dtype="datetime64[ns]")
+    nearest = np.abs(a[:, None] - _ROUND[None, :]).min(axis=1) if a.size else np.array([1e9])
+    iet_m, iet_s = _iet_stats(df, src, date) if src in df and date in df else (float("nan"), float("nan"))
+    vc = df[src].astype(str).value_counts(normalize=True).to_numpy() if src in df else np.array([1.0])
+    return {
+        "n_lines": int(len(df)),
+        "n_JEs": int(df[jeid].nunique()) if jeid in df else float("nan"),
+        "lines_per_JE_mean": float(np.nanmean(lpje)),
+        "lines_per_JE_p95": float(np.nanpercentile(lpje, 95)),
+        "log_amt_mean": float(la.mean()),
+        "log_amt_std": float(la.std()),
+        "log_amt_skew": float(((la - la.mean()) ** 3).mean() / (la.std() ** 3 + 1e-9)),
+        "benford_mad": _benford_mad(a),
+        "round_dollar_frac": float((nearest < 1.0).mean()),
+        "small_ticket_frac(<100)": float((a < 100).mean()),
+        "p99_amount": float(np.percentile(a, 99)) if a.size else float("nan"),
+        "weekend_frac": float((pdt.dt.dayofweek >= 5).mean()) if len(pdt) else float("nan"),
+        "n_sources": int(len(vc)),
+        "source_top1_share": float(vc.max()),
+        "source_entropy": float(-(vc[vc > 0] * np.log(vc[vc > 0])).sum()),
+        "iet_days_mean": iet_m,
+        "iet_days_std": iet_s,
+    }
+
+
+def load_corpus(path: Path) -> dict:
+    df = pd.read_parquet(path)
+    amt = pd.to_numeric(df["Functional Amount"], errors="coerce").to_numpy()
+    return observables(df, "JE Number", "Source", amt, "Entry Date")
+
+
+def load_syn(path: Path) -> dict:
+    df = pd.read_csv(path, low_memory=False)
+    deb = pd.to_numeric(df.get("debit_amount", 0), errors="coerce").fillna(0.0)
+    cred = pd.to_numeric(df.get("credit_amount", 0), errors="coerce").fillna(0.0)
+    amt = np.where(deb != 0, deb, cred).astype(float)
+    return observables(df, "document_id", "source", amt, "posting_date")
+
+
+def main() -> None:
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--corpus", type=Path, required=True)
+    ap.add_argument("--syn", type=Path, required=True)
+    ap.add_argument("--out", type=Path, default=None)
+    args = ap.parse_args()
+
+    corp = load_corpus(args.corpus)
+    syn = load_syn(args.syn)
+    keys = list(corp.keys())
+    print(f"{'observable':<26} {'corpus':>16} {'synthetic':>16} {'ratio syn/corp':>16}")
+    print("-" * 78)
+    gaps = {}
+    for k in keys:
+        c, s = corp[k], syn[k]
+        r = (s / c) if (isinstance(c, (int, float)) and c not in (0, float("nan")) and np.isfinite(c) and c != 0) else float("nan")
+        gaps[k] = {"corpus": c, "synthetic": s, "ratio": r}
+        print(f"{k:<26} {c:>16.4g} {s:>16.4g} {r:>16.3g}")
+    if args.out:
+        args.out.parent.mkdir(parents=True, exist_ok=True)
+        args.out.write_text(json.dumps(gaps, indent=2))
+        print(f"\nwrote {args.out}")
+
+
+if __name__ == "__main__":
+    main()

From c66d8f5e16a658c88af0b80f3f16d1bd6f9a674b Mon Sep 17 00:00:00 2001
From: Michael Ivertowski <mivertowski@outlook.com>
Date: Thu, 21 May 2026 13:27:45 +0200
Subject: [PATCH 05/18] feat(ml/flow): implement export_flow (COA-joined
 amounts, account-class conditioning)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Reads corpus Functional Amount + GL Account Number, joins account_class via the
COA 'c' key, emits y=signed log1p(|amount|) + one-hot(account_class) to
amounts.parquet for the conditional flow. Tail clipped at p99.9 (privacy).
Source (~4500 corpus levels — itself a finding) is not one-hot encoded.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 experiments/ml/common/data_export.py | 68 +++++++++++++++++++++++++---
 1 file changed, 62 insertions(+), 6 deletions(-)

diff --git a/experiments/ml/common/data_export.py b/experiments/ml/common/data_export.py
index df6a9645..6a2bcb95 100644
--- a/experiments/ml/common/data_export.py
+++ b/experiments/ml/common/data_export.py
@@ -104,15 +104,71 @@ def export_sequence(corpus: Path, cols: ColumnMap, out: Path) -> None:
     )
 
 
-def export_flow(corpus: Path, cols: ColumnMap, out: Path) -> None:
-    """Per-(source, account-class) amount samples + conditioning features.
+def _account_class_map(corpus: Path) -> dict[str, str]:
+    """Build {gl_account: account_class} from the corpus COA_*.parquet files.
 
-    See flow/SPEC.md § Data.
+    Join key column ``c`` is the zero-padded GL account number; ``Account
+    Class`` is the ISO-style class label. Account numbers are consistent across
+    clients, so a global map is fine (first wins on the rare conflict).
     """
-    raise NotImplementedError(
-        "TODO(flow): collect log|amount| per (source, account_class), plus "
-        "conditioning one-hots; write out/amounts.parquet."
+    import pandas as pd
+
+    m: dict[str, str] = {}
+    for fp in sorted(corpus.glob("COA_*.parquet")):
+        try:
+            c = pd.read_parquet(fp, columns=["c", "Account Class"])
+        except Exception:  # noqa: BLE001 — skip a malformed/empty COA shard
+            continue
+        for acct, cls in zip(c["c"].astype(str), c["Account Class"].astype(str)):
+            m.setdefault(acct, cls)
+    return m
+
+
+def export_flow(corpus: Path, cols: ColumnMap, out: Path) -> None:
+    """Amount samples + one-hot account-class conditioning → amounts.parquet.
+
+    ``y = signed log1p(|amount|)``; conditioning = one-hot(account_class) from
+    the COA join. Source has thousands of corpus levels (a finding in itself,
+    not a useful one-hot), so account-class is the amount-shape conditioning
+    axis. The extreme tail is clipped at the 99.9th percentile (privacy: don't
+    memorize rare exact large amounts — flow/SPEC.md § Privacy). Aggregated
+    numeric only, gitignored. See flow/SPEC.md § Data.
+    """
+    import json
+
+    import numpy as np
+    import pandas as pd
+
+    acc = _account_class_map(corpus)
+    parts = []
+    for fp in sorted(corpus.glob("JE_*.parquet")):
+        d = pd.read_parquet(fp, columns=[cols.amount, cols.gl_account])
+        d.columns = ["amount", "gl_account"]
+        parts.append(d)
+    df = pd.concat(parts, ignore_index=True)
+    df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
+    df = df.dropna(subset=["amount"])
+    df = df[df["amount"] != 0.0]
+    df["account_class"] = df["gl_account"].astype(str).map(acc).fillna("UNK")
+    df["y"] = np.sign(df["amount"]) * np.log1p(np.abs(df["amount"]))
+    hi = float(df["y"].quantile(0.999))
+    df["y"] = df["y"].clip(upper=hi)
+    if len(df) > 3_000_000:
+        df = df.sample(3_000_000, random_state=0).reset_index(drop=True)
+    onehot = pd.get_dummies(df["account_class"], prefix="cls").astype("float32")
+    out_df = pd.concat(
+        [df[["y"]].astype("float32").reset_index(drop=True), onehot.reset_index(drop=True)],
+        axis=1,
+    )
+    out_df.to_parquet(out / "amounts.parquet")
+    (out / "flow_meta.json").write_text(
+        json.dumps(
+            {"n": int(len(out_df)), "cond_cols": list(onehot.columns),
+             "n_classes": int(onehot.shape[1]), "y_clip_hi": hi}, indent=2
+        )
     )
+    print(f"[flow] {len(out_df):,} amounts, {onehot.shape[1]} account-class conds "
+          f"-> {out/'amounts.parquet'}")
 
 
 EXPORTERS = {

From 5a4734bae7377aac32c08afd19d3f83cbc91c64f Mon Sep 17 00:00:00 2001
From: Michael Ivertowski <mivertowski@outlook.com>
Date: Thu, 21 May 2026 14:51:43 +0200
Subject: [PATCH 06/18] fix(ml/flow): standardize y before the NSF (was
 collapsing the amount tail)

The neural-spline flow's default domain (~[-5,5]) couldn't represent corpus
signed-log1p amounts (which reach ~10.4), collapsing learned p99 to ~$142 vs
the corpus $33k. Standardize y (mean/std saved in the checkpoint) so the tail
lands inside the spline; samples are unstandardized at characterization.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 experiments/ml/flow/train.py | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/experiments/ml/flow/train.py b/experiments/ml/flow/train.py
index b24d919d..ef8f397e 100644
--- a/experiments/ml/flow/train.py
+++ b/experiments/ml/flow/train.py
@@ -29,7 +29,13 @@ def main(argv: list[str] | None = None) -> None:
     import pandas as pd
 
     df = pd.read_parquet(args.data / "amounts.parquet")
-    y = torch.tensor(df["y"].to_numpy(), dtype=torch.float32).unsqueeze(-1)
+    y_raw = torch.tensor(df["y"].to_numpy(), dtype=torch.float32).unsqueeze(-1)
+    # Standardize y so it lands inside the neural-spline domain. Without this
+    # the NSF (default bound ~[-5,5]) cannot represent the heavy amount tail —
+    # corpus signed-log1p amounts reach ~10.4, collapsing p99 to a few hundred.
+    y_mean = float(y_raw.mean())
+    y_std = float(y_raw.std()) or 1.0
+    y = (y_raw - y_mean) / y_std
     c = torch.tensor(
         df.drop(columns=["y"]).to_numpy(), dtype=torch.float32
     )  # conditioning one-hots
@@ -51,7 +57,8 @@ def main(argv: list[str] | None = None) -> None:
             running += loss.item()
         print(f"epoch {epoch:3d}  nll={running/len(dl):.4f}")
 
-    torch.save({"model": model.state_dict(), "cond_dim": c.size(1)},
+    torch.save({"model": model.state_dict(), "cond_dim": c.size(1),
+                "y_mean": y_mean, "y_std": y_std},
                args.out / "amount_flow.pt")
     print(f"[flow.train] saved {args.out/'amount_flow.pt'}")
     print("TODO(flow): export spline knots for the candle AmountSampler port, "

From 1c54d5525f354cfaa62a297bf90eaebda7e19acb Mon Sep 17 00:00:00 2001
From: Michael Ivertowski <mivertowski@outlook.com>
Date: Thu, 21 May 2026 15:02:29 +0200
Subject: [PATCH 07/18] feat(ml/sequence): implement export_sequence
 (factorized event-token streams)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Per-(client, source) ordered streams → streams.pt (dt / lines / account_class /
weekday / hour_band fields, 0=pad) + vocab.json, matching EventStreamTransformer.
Δt + line-count carry the inter-event/burst signal (the 60x IET-regularity gap
the descriptive analysis surfaced). Per-client processing bounds memory over the
50M-row corpus; source ranked to a 0..62 id map + 'other'.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 experiments/ml/common/data_export.py | 81 ++++++++++++++++++++++++----
 1 file changed, 71 insertions(+), 10 deletions(-)

diff --git a/experiments/ml/common/data_export.py b/experiments/ml/common/data_export.py
index 6a2bcb95..54752c8f 100644
--- a/experiments/ml/common/data_export.py
+++ b/experiments/ml/common/data_export.py
@@ -91,17 +91,78 @@ def export_gnn(corpus: Path, cols: ColumnMap, out: Path) -> None:
     )
 
 
-def export_sequence(corpus: Path, cols: ColumnMap, out: Path) -> None:
-    """Per-(source, entity) ordered event streams → token tensors.
-
-    Token = (Δt-bucket, line-count-bucket, account-class, weekday, hour-band).
-    See sequence/SPEC.md § Data.
+def export_sequence(corpus: Path, cols: ColumnMap, out: Path, max_streams: int = 60_000,
+                    seq_len: int = 128) -> None:
+    """Per-(client, source) ordered event-token streams → streams.pt + vocab.json.
+
+    Factorized fields matching `EventStreamTransformer`: (dt, lines,
+    account_class, weekday, hour_band), 0 = pad. Δt + line-count carry the
+    inter-event-time / burst signal (the autocorrelation gap). hour_band is a
+    constant single band — the corpus dates carry no time-of-day. Processed
+    per-client to bound memory over the 50M-row corpus. See sequence/SPEC.md.
     """
-    raise NotImplementedError(
-        "TODO(sequence): group by (client, source, trading_partner), sort by "
-        "entry_date, derive inter-event Δt + per-JE line count, bucketize, and "
-        "write out/streams.pt (padded) + out/vocab.json."
-    )
+    import json
+
+    import numpy as np
+    import pandas as pd
+    import torch
+
+    acc = _account_class_map(corpus)
+    DT_EDGES = [1, 2, 4, 8, 15, 30]   # → digitize 0..6, +1 → 1..7 (vocab 8)
+    LC_EDGES = [2, 3, 5, 9, 17]       # → digitize 0..5, +1 → 1..6 (vocab 7)
+
+    # First pass (cheap): rank sources by JE volume for the 0..62 source-id map.
+    src_counts: dict[str, int] = {}
+    files = sorted(corpus.glob("JE_*.parquet"))
+    for fp in files:
+        s = pd.read_parquet(fp, columns=[cols.source])[cols.source].astype(str)
+        for k, v in s.value_counts().items():
+            src_counts[k] = src_counts.get(k, 0) + int(v)
+    top_src = [s for s, _ in sorted(src_counts.items(), key=lambda kv: -kv[1])[:63]]
+    src_ids = {s: i for i, s in enumerate(top_src)}
+
+    classes: dict[str, int] = {}
+    fld = {k: [] for k in ("dt", "lines", "account_class", "weekday", "hour_band")}
+    src_id_list: list[int] = []
+
+    def _pad(a: np.ndarray) -> np.ndarray:
+        a = a[:seq_len].astype(np.int64)
+        z = np.zeros(seq_len, dtype=np.int64)
+        z[: len(a)] = a
+        return z
+
+    for fp in files:
+        if len(src_id_list) >= max_streams:
+            break
+        d = pd.read_parquet(fp, columns=[cols.source, cols.entry_date, cols.je_number, cols.gl_account])
+        d.columns = ["source", "date", "je", "gl"]
+        d["date"] = pd.to_datetime(d["date"], errors="coerce")
+        d = d.dropna(subset=["date"])
+        d["cls"] = d["gl"].astype(str).map(acc).fillna("UNK")
+        je = d.groupby(["source", "je"]).agg(date=("date", "first"), lines=("je", "size"),
+                                             cls=("cls", "first")).reset_index()
+        for src, g in je.groupby("source"):
+            if len(g) < 3:
+                continue
+            g = g.sort_values("date")
+            dts = g["date"].diff().dt.days.fillna(0).clip(0, 3650).to_numpy()
+            cl_id = np.array([classes.setdefault(c, len(classes) + 1) for c in g["cls"]], dtype=np.int64)
+            fld["dt"].append(_pad(np.digitize(dts, DT_EDGES) + 1))
+            fld["lines"].append(_pad(np.digitize(g["lines"].to_numpy(), LC_EDGES) + 1))
+            fld["account_class"].append(_pad(cl_id))
+            fld["weekday"].append(_pad(g["date"].dt.weekday.to_numpy() + 1))
+            fld["hour_band"].append(_pad(np.ones(len(g), dtype=np.int64)))
+            src_id_list.append(src_ids.get(str(src), 63))
+            if len(src_id_list) >= max_streams:
+                break
+
+    blob = {k: torch.from_numpy(np.stack(v)) for k, v in fld.items()}
+    blob["source_id"] = torch.from_numpy(np.array(src_id_list, dtype=np.int64))
+    torch.save(blob, out / "streams.pt")
+    sizes = {"dt": 8, "lines": 7, "account_class": len(classes) + 1, "weekday": 8, "hour_band": 2}
+    (out / "vocab.json").write_text(json.dumps({"sizes": sizes, "n_streams": len(src_id_list), "T": seq_len}, indent=2))
+    print(f"[sequence] {len(src_id_list)} streams (T={seq_len}), {len(classes)} account-classes "
+          f"-> {out/'streams.pt'}")
 
 
 def _account_class_map(corpus: Path) -> dict[str, str]:

From 6ebfb05c8ada6fab685a8ec5690faf3848dfe212 Mon Sep 17 00:00:00 2001
From: Michael Ivertowski <mivertowski@outlook.com>
Date: Thu, 21 May 2026 15:15:11 +0200
Subject: [PATCH 08/18] docs(ml): corpus->synthetic gap findings +
 learning-track results

What's missing (descriptive): source diversity, IET variance ~60x, amount tail
~16x, lines-per-JE ~2.3x. DR eval degenerates at corpus scale (noise floor ~0).
Flow learns amount density (v1 tail-collapse bug found+fixed via y-standardize).
Sequence transformer trains on corpus event streams; corpus dt-bucket lag-1
autocorr -0.118 (variance, not autocorr, is the gap). v2 flow number pending.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 experiments/ml/FINDINGS.md | 78 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 78 insertions(+)
 create mode 100644 experiments/ml/FINDINGS.md

diff --git a/experiments/ml/FINDINGS.md b/experiments/ml/FINDINGS.md
new file mode 100644
index 00000000..9d9f41df
--- /dev/null
+++ b/experiments/ml/FINDINGS.md
@@ -0,0 +1,78 @@
+# Corpus → synthetic gap: what's missing, and what the learning tracks recover
+
+A100 study (2026-05-21), DataSynth v5.27. Goal: **learn from the real corpus
+what the synthetic generator is missing**, on the 21-client health corpus
+(53.4M JE lines, 11.8M JEs aggregated) vs the v5.27 engine. All learning is on
+the corpus on the private box; weights stay on-box (memorization rule). Paper
+grounding + generator-optimization targets.
+
+## 1. What's missing (descriptive, corpus vs synthetic)
+
+Raw observables — interpretable units, not normalized DRs:
+
+| Observable | Corpus | Synthetic | Gap |
+|---|--:|--:|---|
+| Source diversity (entropy / count) | 3.37 / 4,504 | 0.75 / 4 | synthetic **far too concentrated** (one source ≈ 75%) |
+| Inter-event-time **std** (days) | 0.0169 | 0.00028 | synthetic **~60× too regular** (irregular-gap structure absent) |
+| Amount **p99** | $33k | $542k | synthetic tail **~16× too fat** |
+| log-amount std / skew | 2.46 / 0.56 | 3.43 / 0.99 | synthetic **over-dispersed, over-skewed** |
+| Lines per JE (mean) | 4.5 | 10.3 | synthetic JEs **~2.3× too large** |
+| Benford MAD | 0.0081 | 0.0057 | synthetic slightly *more* Benford-clean than reality |
+
+Top generator-optimization targets: **(a)** amount density (tail + spread),
+**(b)** IET-variance / lines-per-JE structure, **(c)** source-mix breadth.
+
+## 2. Methodological finding — the DR eval degenerates at full corpus scale
+
+`behavioral score` on the 53.4M-line corpus returns `is_degenerate_baseline =
+true` for **every** metric: the corpus-vs-corpus 50/50 noise floor is ≈0, so
+each degradation ratio divides by ~0 and saturates at the 100 cap. The
+normalized composite is therefore uninformative at this scale — the descriptive
+comparison (§1) is the actionable signal. **For the paper:** the DR noise-floor
+needs a resampling scheme that stays non-degenerate at large N (e.g. per-entity
+block bootstrap), or the composite should fall back to raw distances when the
+baseline underflows.
+
+## 3. Learning tracks — recovering the missing structure (corpus-trained)
+
+### Flow (amount density) — `flow/`
+Conditional neural-spline flow over `signed log1p(|amount|)`, conditioned on
+account-class (COA join, 294/294 accounts matched). **Bug found + fixed:** the
+NSF default spline domain (~[-5,5]) cannot represent corpus log-amounts (which
+reach ~10.4), collapsing learned p99 to ~$142 (v1). Standardizing `y` before
+the flow fixes it (v2).
+
+| | log-amt mean | std | skew | p99 | Benford MAD |
+|---|--:|--:|--:|--:|--:|
+| Corpus (held-out) | 3.91 | 2.45 | 0.54 | $32,950 | 0.0086 |
+| Flow v1 (un-standardized) | 2.81 | 1.45 | −0.39 | $142 | 0.0182 |
+| **Flow v2 (standardized y)** | _pending_ | _pending_ | _pending_ | _pending_ | _pending_ |
+| Synthetic (3-comp mixture) | 3.65 | 3.43 | 0.99 | $541,617 | 0.0057 |
+
+### Sequence (event-stream temporal) — `sequence/`
+Decoder-only transformer over per-(client, source) event-token streams (Δt /
+line-count / account-class / weekday buckets), factorized heads. **Trains
+cleanly** (loss 1.99 → 1.93 / 25 epochs over 2,500 streams) — the corpus event
+structure *is* learnable. Finding: corpus dt-bucket **lag-1 autocorr = −0.118**
+(only 11.6% of streams positively autocorrelated), so the corpus is **not**
+strongly *sequentially* bursty at this granularity — the §1 "60×" gap is
+inter-event-time **variance**, a distinct axis from autocorrelation. Data-quality
+note: the corpus COA `Account Class` has mojibake encoding variants (`Vorr??te`)
+inflating the class count to 397 — a cleaning target.
+
+## 4. GNN fraud showcase (public synthetic data) — `scripts/ml/`
+Separate, publishable result (see `scripts/ml/RESULTS_v5.27.md`): binary fraud
+GraphSAGE test AUC 0.909 (≈ a LogReg on edge features — graph adds little);
+fraud-**typology** is near-random on the collapsed edge list (macro-F1 0.09) but
+**0.58 on the line-level view** — `fraud_type` is learnable, but consumers must
+join the line table.
+
+## 5. Implications
+- **Amount sampler**: the corpus tail is *thinner* and less skewed than the
+  synthetic mixture — the engine over-generates extreme amounts. A learned flow
+  (v2) or a re-fit mixture narrows this.
+- **Source mix**: the engine emits ~4–24 sources vs the corpus's thousands;
+  source-mix breadth is a generation gap (priors bundle partially addresses it).
+- **Lines per JE**: synthetic JEs are ~2× too large — the lines-per-JE prior
+  needs down-weighting toward the corpus mean of ~4.5.
+- **Eval**: fix the DR noise-floor degeneracy at corpus scale before re-baselining.

From 33d40d2dfffb21ea8ac7c3ec125b959fb44f79a0 Mon Sep 17 00:00:00 2001
From: Michael Ivertowski <mivertowski@outlook.com>
Date: Thu, 21 May 2026 16:46:09 +0200
Subject: [PATCH 09/18] =?UTF-8?q?docs(ml):=20flow=20v2=20result=20?=
 =?UTF-8?q?=E2=80=94=20learned=20flow=20matches=20corpus=20amount=20densit?=
 =?UTF-8?q?y?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

v2 (standardized y): NLL 8.96->0.67; p99 $31,754 vs corpus $33,688 (~6%),
std/skew spot-on. The shipped 3-component mixture overshoots p99 ~16x. A
learned per-account-class flow recovers the amount distribution the mixture
misses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 experiments/ml/FINDINGS.md | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/experiments/ml/FINDINGS.md b/experiments/ml/FINDINGS.md
index 9d9f41df..d5e5ca5b 100644
--- a/experiments/ml/FINDINGS.md
+++ b/experiments/ml/FINDINGS.md
@@ -46,9 +46,16 @@ the flow fixes it (v2).
 |---|--:|--:|--:|--:|--:|
 | Corpus (held-out) | 3.91 | 2.45 | 0.54 | $32,950 | 0.0086 |
 | Flow v1 (un-standardized) | 2.81 | 1.45 | −0.39 | $142 | 0.0182 |
-| **Flow v2 (standardized y)** | _pending_ | _pending_ | _pending_ | _pending_ | _pending_ |
+| **Flow v2 (standardized y)** | **3.89** | **2.46** | **0.54** | **$31,754** | **0.0081** |
 | Synthetic (3-comp mixture) | 3.65 | 3.43 | 0.99 | $541,617 | 0.0057 |
 
+**v2 matches the corpus amount density almost exactly** — NLL 8.96 → 0.67;
+p99 $31,754 vs corpus $33,688 (within ~6%), std/skew spot-on — whereas the
+current 3-component mixture overshoots p99 by ~16× and is 1.4× over-dispersed.
+Headline result: a learned per-account-class flow recovers the corpus amount
+distribution the shipped mixture misses. Handoff: export spline knots → candle
+`AmountSampler`, or keep as a build-time density artifact.
+
 ### Sequence (event-stream temporal) — `sequence/`
 Decoder-only transformer over per-(client, source) event-token streams (Δt /
 line-count / account-class / weekday buckets), factorized heads. **Trains

From 8b5808c4d4d57709d2500b581a02440baf10faa6 Mon Sep 17 00:00:00 2001
From: Michael Ivertowski <mivertowski@outlook.com>
Date: Thu, 21 May 2026 16:51:00 +0200
Subject: [PATCH 10/18] feat(ml): sequence-lift (NLL vs marginal) + inverse
 make_base + 3-knob params
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- sequence/characterize.py: held-out per-token NLL of the transformer vs an iid
  per-field marginal baseline -> information gain from modelling history.
- inverse/make_base.py: small fast campaign config (fraud + distributions only).
- inverse/params.py: pivot to (fraud_rate, amount_mu, amount_sigma) — minimal
  config friction; amount via structured component override; ties inverse +
  surrogate to the flow finding (recover corpus amount mean/std).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 experiments/ml/inverse/make_base.py     | 52 ++++++++++++++
 experiments/ml/inverse/params.py        | 38 +++++++----
 experiments/ml/sequence/characterize.py | 90 +++++++++++++++++++++++++
 3 files changed, 167 insertions(+), 13 deletions(-)
 create mode 100644 experiments/ml/inverse/make_base.py
 create mode 100644 experiments/ml/sequence/characterize.py

diff --git a/experiments/ml/inverse/make_base.py b/experiments/ml/inverse/make_base.py
new file mode 100644
index 00000000..d2ecafa8
--- /dev/null
+++ b/experiments/ml/inverse/make_base.py
@@ -0,0 +1,52 @@
+"""Build a small, fast generate config for the inverse/surrogate forward
+campaign. Enables only what the tier-1 knobs touch (fraud + distributions) and
+shrinks to one period so each forward sim is quick. The campaign overrides
+fraud_rate + the amount mixture per draw (see params.to_config_overrides).
+
+    python -m inverse.make_base --out inverse_base.yaml
+"""
+from __future__ import annotations
+
+import argparse
+import shutil
+import subprocess
+from pathlib import Path
+
+import yaml
+
+
+def _cli() -> str:
+    for c in ("./target/release/datasynth-data", "../../target/release/datasynth-data",
+              "datasynth-data"):
+        if Path(c).exists() or shutil.which(c):
+            return c
+    raise SystemExit("datasynth-data not found — build with cargo build --release -p datasynth-cli")
+
+
+def main(argv: list[str] | None = None) -> None:
+    ap = argparse.ArgumentParser(description=__doc__)
+    ap.add_argument("--out", type=Path, default=Path("inverse_base.yaml"))
+    ap.add_argument("--industry", default="manufacturing")
+    a = ap.parse_args(argv)
+
+    tmp = Path("/tmp/_inv_init.yaml")
+    subprocess.run([_cli(), "init", "--industry", a.industry, "--complexity", "small",
+                    "-o", str(tmp)], check=True, capture_output=True)
+    c = yaml.safe_load(tmp.read_text())
+
+    if isinstance(c.get("fraud"), dict):
+        c["fraud"]["enabled"] = True
+    if isinstance(c.get("distributions"), dict):
+        c["distributions"]["enabled"] = True
+        amt = c["distributions"].setdefault("amounts", {})
+        amt["enabled"] = True
+        amt["distribution_type"] = "lognormal"
+        amt.setdefault("components", [{"weight": 1.0, "mu": 7.0, "sigma": 1.2, "label": "base"}])
+    c.setdefault("global", {})["period_months"] = 1
+
+    a.out.write_text(yaml.safe_dump(c))
+    print(f"wrote {a.out} (fraud + distributions enabled, 1 month)")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/experiments/ml/inverse/params.py b/experiments/ml/inverse/params.py
index 083161fb..56e468e8 100644
--- a/experiments/ml/inverse/params.py
+++ b/experiments/ml/inverse/params.py
@@ -19,18 +19,18 @@ class Param:
     log: bool = False  # sample/scale in log space (for rates spanning decades)
 
 
-# Tier-1 set. Names map to GeneratorConfig keys (verified scalar paths in
-# datasynth-config/src/schema.rs — `deny_unknown_fields` is in force, so every
-# key here must be a real settable field). The amount log-normal width was
-# dropped: `distributions.amounts` is a mixture `components` list, not a scalar
-# `sigma`; re-adding it needs a structured (replace-components) override —
-# tracked as the 6th-knob follow-up.
+# Tier-1 set — three high-identifiability knobs that need only `fraud` +
+# `distributions` enabled (minimal config friction under deny_unknown_fields):
+#   - fraud.fraud_rate    → fraud-bias footprint (weekend / round-dollar / …)
+#   - amount_mu / amount_sigma → the log-normal amount component (location +
+#     width). Set via a STRUCTURED override (replace distributions.amounts.
+#     components) since `amounts` is a mixture list, not scalars — handled in
+#     `to_config_overrides`. Recovering (mu, sigma) ties the inverse + surrogate
+#     to the flow finding (corpus log-amount mean ≈ 3.9, std ≈ 2.45).
 PARAMS: list[Param] = [
     Param("fraud.fraud_rate", 0.0, 0.10),
-    Param("fraud.document_fraud_rate", 0.0, 0.10),
-    Param("temporal_patterns.processing_lags.invoice_receipt_lag.mu", 0.0, 3.0),
-    Param("temporal_patterns.processing_lags.invoice_receipt_lag.sigma", 0.2, 1.5),
-    Param("vendor_network.dependencies.top_5_concentration", 0.20, 0.70),
+    Param("amount_mu", 3.0, 10.0),
+    Param("amount_sigma", 0.5, 2.6),
 ]
 
 
@@ -46,9 +46,21 @@ def sample_prior(rng: np.random.Generator, n: int) -> np.ndarray:
     return np.stack(cols, axis=1)
 
 
-def to_config_overrides(theta: np.ndarray) -> dict[str, float]:
-    """One θ vector -> {config_key: value} overrides for a generate run."""
-    return {p.name: float(v) for p, v in zip(PARAMS, theta)}
+def to_config_overrides(theta: np.ndarray) -> dict[str, object]:
+    """One θ vector -> {config_key: value} overrides for a generate run.
+
+    fraud_rate is a plain scalar; amount_mu/amount_sigma are folded into a
+    single-component log-normal mixture that REPLACES distributions.amounts.
+    components (a structured override — the mixture is a list, not scalars).
+    """
+    vals = {p.name: float(v) for p, v in zip(PARAMS, theta)}
+    return {
+        "fraud.fraud_rate": vals["fraud.fraud_rate"],
+        "distributions.amounts.distribution_type": "lognormal",
+        "distributions.amounts.components": [
+            {"weight": 1.0, "mu": vals["amount_mu"], "sigma": vals["amount_sigma"], "label": "sbi"}
+        ],
+    }
 
 
 def normalize(theta: np.ndarray) -> np.ndarray:
diff --git a/experiments/ml/sequence/characterize.py b/experiments/ml/sequence/characterize.py
new file mode 100644
index 00000000..2f17c90d
--- /dev/null
+++ b/experiments/ml/sequence/characterize.py
@@ -0,0 +1,90 @@
+"""Sequence-track lift: does the autoregressive transformer capture temporal
+structure the marginal (iid) sampler misses?
+
+The shipped generator draws Δt and line-count *independently per event*. This
+measures the information gain of conditioning on history: held-out per-token
+NLL of the trained transformer vs an iid per-field marginal baseline (the
+field's own entropy). NLL_marginal − NLL_transformer > 0 ⇒ the model captures
+joint / autocorrelated structure the marginal sampler cannot — per field and
+pooled.
+
+    python -m sequence.characterize --data data/sequence --weights weights/sequence
+"""
+from __future__ import annotations
+
+import argparse
+import json
+from pathlib import Path
+
+import numpy as np
+import torch
+import torch.nn.functional as F
+
+from .model import EventStreamTransformer, FieldVocab
+
+FIELDS = ["dt", "lines", "account_class", "weekday", "hour_band"]
+
+
+@torch.no_grad()
+def main(argv: list[str] | None = None) -> None:
+    ap = argparse.ArgumentParser(description=__doc__)
+    ap.add_argument("--data", type=Path, required=True)
+    ap.add_argument("--weights", type=Path, required=True)
+    ap.add_argument("--val-frac", type=float, default=0.2)
+    ap.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu")
+    args = ap.parse_args(argv)
+    dev = torch.device(args.device)
+
+    blob = torch.load(args.data / "streams.pt")
+    sizes = json.loads((args.data / "vocab.json").read_text())["sizes"]
+    vocab = FieldVocab(**sizes)
+    ckpt = torch.load(args.weights / "stream_tf.pt", map_location=dev)
+    model = EventStreamTransformer(vocab).to(dev)
+    model.load_state_dict(ckpt["model"])
+    model.train(False)
+
+    n = blob["dt"].shape[0]
+    nval = max(1, int(n * args.val_frac))
+    tr = slice(nval, None)
+    va = slice(0, nval)
+
+    # ── Transformer per-token NLL on held-out (teacher-forced) ──────────────
+    tok_va = {f: blob[f][va].to(dev) for f in FIELDS}
+    logits = model(tok_va, blob["source_id"][va].to(dev))
+    tf_nll = {}
+    for f in FIELDS:
+        pred = logits[f][:, :-1].reshape(-1, logits[f].size(-1))
+        tgt = tok_va[f][:, 1:].reshape(-1)
+        keep = tgt > 0  # ignore pad
+        tf_nll[f] = float(F.cross_entropy(pred[keep], tgt[keep]).item())
+
+    # ── iid marginal baseline: each field's own entropy on the train split ──
+    marg_nll = {}
+    for f in FIELDS:
+        toks = blob[f][tr].reshape(-1).numpy()
+        toks = toks[toks > 0]
+        if toks.size == 0:
+            marg_nll[f] = 0.0
+            continue
+        counts = np.bincount(toks, minlength=sizes[f]).astype(np.float64)
+        p = counts / counts.sum()
+        nz = p > 0
+        marg_nll[f] = float(-(p[nz] * np.log(p[nz])).sum())  # nats
+
+    print(f"{'field':<16}{'transformer':>14}{'marginal(iid)':>16}{'lift(nats)':>14}")
+    print("-" * 60)
+    tf_tot = marg_tot = 0.0
+    for f in FIELDS:
+        lift = marg_nll[f] - tf_nll[f]
+        tf_tot += tf_nll[f]
+        marg_tot += marg_nll[f]
+        print(f"{f:<16}{tf_nll[f]:>14.4f}{marg_nll[f]:>16.4f}{lift:>14.4f}")
+    print("-" * 60)
+    print(f"{'TOTAL/token':<16}{tf_tot:>14.4f}{marg_tot:>16.4f}{marg_tot - tf_tot:>14.4f}")
+    print(f"\nInterpretation: positive lift = the AR model predicts events better "
+          f"than drawing each field iid from its marginal — i.e. it captures the "
+          f"joint/temporal structure the per-event marginal sampler discards.")
+
+
+if __name__ == "__main__":
+    main()

From 54e4bf9944c8d3223a1b273a6ba273bea687ea1d Mon Sep 17 00:00:00 2001
From: Michael Ivertowski <mivertowski@outlook.com>
Date: Thu, 21 May 2026 16:52:39 +0200
Subject: [PATCH 11/18] fix(ml/inverse): log_normal enum (not lognormal) +
 record sequence +3.37 nats lift

Sequence track: AR transformer beats iid marginal by +3.37 nats/token on
held-out (account_class/weekday/lines structure captured; Dt near-memoryless).
Fix the amount distribution_type enum value for the inverse campaign override.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 experiments/ml/FINDINGS.md          | 8 +++++++-
 experiments/ml/inverse/make_base.py | 2 +-
 experiments/ml/inverse/params.py    | 2 +-
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/experiments/ml/FINDINGS.md b/experiments/ml/FINDINGS.md
index d5e5ca5b..fe36b461 100644
--- a/experiments/ml/FINDINGS.md
+++ b/experiments/ml/FINDINGS.md
@@ -63,7 +63,13 @@ cleanly** (loss 1.99 → 1.93 / 25 epochs over 2,500 streams) — the corpus eve
 structure *is* learnable. Finding: corpus dt-bucket **lag-1 autocorr = −0.118**
 (only 11.6% of streams positively autocorrelated), so the corpus is **not**
 strongly *sequentially* bursty at this granularity — the §1 "60×" gap is
-inter-event-time **variance**, a distinct axis from autocorrelation. Data-quality
+inter-event-time **variance**, a distinct axis from autocorrelation. **Held-out
+lift over an iid per-field marginal sampler: +3.37 nats/token** (account_class
++1.55, weekday +1.36, lines +0.58; Δt ≈ flat at −0.12 — Δt really is near
+memoryless). So the autoregressive model captures the joint
+source→account-class→line-count→weekday structure the current per-event
+marginal sampler discards — the concrete case for an AR event scheduler.
+Data-quality
 note: the corpus COA `Account Class` has mojibake encoding variants (`Vorr??te`)
 inflating the class count to 397 — a cleaning target.
 
diff --git a/experiments/ml/inverse/make_base.py b/experiments/ml/inverse/make_base.py
index d2ecafa8..f308e41e 100644
--- a/experiments/ml/inverse/make_base.py
+++ b/experiments/ml/inverse/make_base.py
@@ -40,7 +40,7 @@ def main(argv: list[str] | None = None) -> None:
         c["distributions"]["enabled"] = True
         amt = c["distributions"].setdefault("amounts", {})
         amt["enabled"] = True
-        amt["distribution_type"] = "lognormal"
+        amt["distribution_type"] = "log_normal"
         amt.setdefault("components", [{"weight": 1.0, "mu": 7.0, "sigma": 1.2, "label": "base"}])
     c.setdefault("global", {})["period_months"] = 1
 
diff --git a/experiments/ml/inverse/params.py b/experiments/ml/inverse/params.py
index 56e468e8..0a6158d7 100644
--- a/experiments/ml/inverse/params.py
+++ b/experiments/ml/inverse/params.py
@@ -56,7 +56,7 @@ def to_config_overrides(theta: np.ndarray) -> dict[str, object]:
     vals = {p.name: float(v) for p, v in zip(PARAMS, theta)}
     return {
         "fraud.fraud_rate": vals["fraud.fraud_rate"],
-        "distributions.amounts.distribution_type": "lognormal",
+        "distributions.amounts.distribution_type": "log_normal",
         "distributions.amounts.components": [
             {"weight": 1.0, "mu": vals["amount_mu"], "sigma": vals["amount_sigma"], "label": "sbi"}
         ],

From 851f1dcd2d83564a3abcb3318a00cf273f86394c Mon Sep 17 00:00:00 2001
From: Michael Ivertowski <mivertowski@outlook.com>
Date: Thu, 21 May 2026 16:57:48 +0200
Subject: [PATCH 12/18] feat(ml/surrogate): grounded surrogate + CMA-ES (match
 corpus via campaign)

Reuses the inverse forward campaign's (theta, summary-stat) pairs: objective =
distance(summary_stats(theta), corpus), MLP surrogate, CMA-ES to the
corpus-matching theta*. Runnable + grounded (vs the scaffold optimize.py whose
load_history is a TODO + targets the corpus-scale-degenerate DR). theta* should
recover amount_mu ~ corpus log-amount mean, cross-checking the flow finding.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 experiments/ml/surrogate/match_corpus.py | 117 +++++++++++++++++++++++
 1 file changed, 117 insertions(+)
 create mode 100644 experiments/ml/surrogate/match_corpus.py

diff --git a/experiments/ml/surrogate/match_corpus.py b/experiments/ml/surrogate/match_corpus.py
new file mode 100644
index 00000000..60a7cbd2
--- /dev/null
+++ b/experiments/ml/surrogate/match_corpus.py
@@ -0,0 +1,117 @@
+"""Grounded surrogate + CMA-ES: find the generator params that best match the
+corpus, using the inverse forward campaign as the surrogate's training data.
+
+The scaffold `optimize.py` targets the BF composite over SP-internal knobs, but
+`load_history` is a TODO, those knobs aren't `generate --config`-settable, and
+the DR eval degenerates at corpus scale (FINDINGS.md §2). This is the runnable,
+grounded variant: reuse the inverse campaign's `(θ, summary-stat)` pairs, define
+the objective as `distance(summary_stats(θ), corpus)`, fit an MLP surrogate,
+and CMA-ES to the corpus-matching `θ*`. `θ*` is the config the corpus "most
+likely came from" — cross-checking the flow finding (corpus log-amount mean
+≈ 3.9). Demonstrates the tuning-loop accelerator end-to-end on real data.
+
+    python -m surrogate.match_corpus --campaign data/inverse \\
+        --corpus /home/ubuntu/corpus_health.parquet
+"""
+from __future__ import annotations
+
+import argparse
+import json
+from pathlib import Path
+
+import numpy as np
+import pandas as pd
+import torch
+import torch.nn as nn
+
+from inverse import params as P
+from inverse.simulate import FEATURE_NAMES, summary_stats
+
+# Features comparable corpus↔synthetic (exclude doc-flow / behavioural-only
+# observables the corpus columns don't carry: post-close, manual, posting lag).
+CMP = [
+    "log_amt_mean", "log_amt_std", "log_amt_skew", "benford_mad", "round_frac",
+    "weekend_frac", "monthend_frac", "lpje_mean", "lpje_std", "lpje_frac2",
+    "src_entropy", "iet_mean", "iet_std",
+]
+
+
+def corpus_features(corpus_parquet: Path, tmp_csv: str) -> np.ndarray:
+    """Map corpus columns → canonical, then reuse the campaign summary_stats."""
+    df = pd.read_parquet(corpus_parquet)
+    out = pd.DataFrame()
+    out["debit_amount"] = pd.to_numeric(df["Functional Amount"], errors="coerce")
+    out["credit_amount"] = 0.0
+    out["posting_date"] = df["Entry Date"]
+    out["document_date"] = df["Entry Date"]
+    out["source"] = df["Source"]
+    out["document_id"] = df["JE Number"]
+    out["gl_account"] = df["GL Account Number"]
+    out.to_csv(tmp_csv, index=False)
+    return summary_stats(Path(tmp_csv))
+
+
+def main(argv: list[str] | None = None) -> None:
+    ap = argparse.ArgumentParser(description=__doc__)
+    ap.add_argument("--campaign", type=Path, required=True)
+    ap.add_argument("--corpus", type=Path, required=True)
+    ap.add_argument("--out", type=Path, default=Path("weights/surrogate"))
+    a = ap.parse_args(argv)
+    a.out.mkdir(parents=True, exist_ok=True)
+
+    blob = np.load(a.campaign / "pairs.npz")
+    theta, x = blob["theta"], blob["x"]
+    cx = corpus_features(a.corpus, "/tmp/_corp_canon.csv")
+    idx = [FEATURE_NAMES.index(c) for c in CMP]
+
+    # Standardize comparable features by campaign std → scale-free distance.
+    xs = x[:, idx]
+    mu, sd = xs.mean(0), xs.std(0) + 1e-6
+    cxn, xn = (cx[idx] - mu) / sd, (xs - mu) / sd
+    dist = np.linalg.norm(xn - cxn, axis=1).astype("float32")  # objective per sim
+
+    tn = P.normalize(theta).astype("float32")
+    nval = max(1, len(tn) // 5)
+    Xtr, Xva = torch.tensor(tn[nval:]), torch.tensor(tn[:nval])
+    ytr, yva = torch.tensor(dist[nval:]), torch.tensor(dist[:nval])
+
+    net = nn.Sequential(nn.Linear(P.dim(), 64), nn.SiLU(), nn.Linear(64, 64), nn.SiLU(), nn.Linear(64, 1))
+    opt = torch.optim.Adam(net.parameters(), 1e-3)
+    for _ in range(1500):
+        opt.zero_grad()
+        loss = ((net(Xtr).squeeze(-1) - ytr) ** 2).mean()
+        loss.backward()
+        opt.step()
+    net.train(False)  # inference mode
+    with torch.no_grad():
+        pv = net(Xva).squeeze(-1).numpy()
+    from scipy.stats import spearmanr
+    rho = float(spearmanr(pv, yva.numpy()).statistic)
+    print(f"surrogate Spearman (held-out, predicted vs true distance) = {rho:.3f}")
+
+    import cma
+    es = cma.CMAEvolutionStrategy(np.full(P.dim(), 0.5), 0.2,
+                                  {"bounds": [0, 1], "verbose": -9, "seed": 0})
+    for _ in range(80):
+        sols = es.ask()
+        with torch.no_grad():
+            vals = net(torch.tensor(np.array(sols), dtype=torch.float32)).squeeze(-1).numpy()
+        es.tell(sols, list(vals))
+    theta_star = P.denormalize(np.clip(es.result.xbest, 0, 1))
+    names = [p.name for p in P.PARAMS]
+    print("corpus-matching θ* (surrogate argmin):")
+    for n, v in zip(names, theta_star):
+        print(f"  {n:14s} = {v:.3f}")
+    print(f"(corpus log_amt_mean={cx[idx[0]]:.2f} std={cx[idx[1]]:.2f}; "
+          f"amount_mu≈mean/0.63 sanity → ~{cx[idx[0]] / 0.63:.1f})")
+
+    (a.out / "match_corpus.json").write_text(json.dumps({
+        "surrogate_spearman": rho,
+        "theta_star": dict(zip(names, theta_star.tolist())),
+        "corpus_features": dict(zip(CMP, [float(v) for v in cx[idx]])),
+    }, indent=2))
+    print(f"saved {a.out / 'match_corpus.json'}")
+
+
+if __name__ == "__main__":
+    main()

From 1b8fbf5feaf7eaac2edc48dd53095bc5922493ec Mon Sep 17 00:00:00 2001
From: Michael Ivertowski <mivertowski@outlook.com>
Date: Thu, 21 May 2026 17:25:25 +0200
Subject: [PATCH 13/18] =?UTF-8?q?docs(ml):=20inverse=20SBI=20result=20?=
 =?UTF-8?q?=E2=80=94=20amount=5Fmu=20+=20fraud=5Frate=20recoverable,=20sig?=
 =?UTF-8?q?ma=20not?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Amortized posterior over 3 knobs, 1000-sim campaign (0 fail), SBC + 90% coverage
on held-out synthetic: amount_mu cov 0.92 (MAE .049), fraud_rate cov 0.88 (.078)
— calibrated; amount_sigma cov 0.77 — poorly identified (other variance swamps
the component sigma). 'Run the engine backward' validated on synthetic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 experiments/ml/FINDINGS.md | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/experiments/ml/FINDINGS.md b/experiments/ml/FINDINGS.md
index fe36b461..fb3a91b0 100644
--- a/experiments/ml/FINDINGS.md
+++ b/experiments/ml/FINDINGS.md
@@ -73,6 +73,30 @@ Data-quality
 note: the corpus COA `Account Class` has mojibake encoding variants (`Vorr??te`)
 inflating the class count to 397 — a cleaning target.
 
+### Inverse SBI — run the engine backward — `inverse/`
+Amortized neural posterior `q(θ | x)` (zuko NSF) over 3 tier-1 knobs
+(`fraud_rate`, amount `mu`, amount `sigma`), trained on **1,000 forward-simulated
+`(θ, GL-summary)` pairs (0 failures)**, validated on held-out synthetic with
+simulation-based calibration + 90% credible-interval coverage:
+
+| knob | MAE (norm) | 90% coverage | verdict |
+|---|--:|--:|---|
+| **amount_mu** | 0.049 | **0.92** | strongly identifiable |
+| **fraud.fraud_rate** | 0.078 | **0.88** | identifiable, calibrated |
+| amount_sigma | 0.209 | 0.77 | poorly identified (honest) |
+
+A GL's amount **location** and **fraud rate** are recoverable with calibrated
+uncertainty; amount **width** is not (other variance sources swamp the single
+component's σ). This is the audit-analytics direction — *"the GL most likely
+came from these process parameters"* — validated on synthetic before any
+real-GL use. Identifiability is gated by forward-model fidelity (the §1 gap), so
+the flow/sequence work directly improves how much an inverse can recover.
+
+### Surrogate / tuning loop — `surrogate/`
+Grounded CMA-ES over the same campaign: MLP surrogate `θ → distance-to-corpus`
+(13 comparable observables), optimized to the corpus-matching `θ*`. _Result
+pending the run; θ* should recover amount_mu ≈ corpus, cross-checking the flow._
+
 ## 4. GNN fraud showcase (public synthetic data) — `scripts/ml/`
 Separate, publishable result (see `scripts/ml/RESULTS_v5.27.md`): binary fraud
 GraphSAGE test AUC 0.909 (≈ a LogReg on edge features — graph adds little);

From 4b933042eba10f3d41b7de00baf571302edf4607 Mon Sep 17 00:00:00 2001
From: Michael Ivertowski <mivertowski@outlook.com>
Date: Thu, 21 May 2026 17:33:19 +0200
Subject: [PATCH 14/18] fix(ml/surrogate): drop heavy-tailed features + clip +
 cache corpus_x

First run failed (Spearman -0.08, theta* at bounds): corpus lpje_std=123 (JEs
with thousands of lines) dominated the L2 distance. Drop lpje_std + iet_* from
the comparable set, clip standardized features to +/-4, and add --corpus-cache
to reuse corpus_features (skip the 53M-row pass on rerun).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 experiments/ml/surrogate/match_corpus.py | 37 ++++++++++++++++--------
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/experiments/ml/surrogate/match_corpus.py b/experiments/ml/surrogate/match_corpus.py
index 60a7cbd2..5f9d3f8c 100644
--- a/experiments/ml/surrogate/match_corpus.py
+++ b/experiments/ml/surrogate/match_corpus.py
@@ -27,12 +27,14 @@
 from inverse import params as P
 from inverse.simulate import FEATURE_NAMES, summary_stats
 
-# Features comparable corpus↔synthetic (exclude doc-flow / behavioural-only
-# observables the corpus columns don't carry: post-close, manual, posting lag).
+# Features comparable corpus↔synthetic. Excludes doc-flow / behavioural-only
+# observables the corpus columns don't carry (post-close, manual, lag) AND
+# heavy-tailed/unstable ones that dominate an L2 distance: lpje_std (corpus has
+# JEs with thousands of lines → std≈123) and the iet_* terms. Kept set is the
+# robust amount + structure signal.
 CMP = [
     "log_amt_mean", "log_amt_std", "log_amt_skew", "benford_mad", "round_frac",
-    "weekend_frac", "monthend_frac", "lpje_mean", "lpje_std", "lpje_frac2",
-    "src_entropy", "iet_mean", "iet_std",
+    "weekend_frac", "monthend_frac", "lpje_mean", "lpje_frac2", "src_entropy",
 ]
 
 
@@ -54,20 +56,31 @@ def corpus_features(corpus_parquet: Path, tmp_csv: str) -> np.ndarray:
 def main(argv: list[str] | None = None) -> None:
     ap = argparse.ArgumentParser(description=__doc__)
     ap.add_argument("--campaign", type=Path, required=True)
-    ap.add_argument("--corpus", type=Path, required=True)
+    ap.add_argument("--corpus", type=Path, default=None)
+    ap.add_argument("--corpus-cache", type=Path, default=None,
+                    help="reuse corpus_features from a prior match_corpus.json (skips the 53M-row pass)")
     ap.add_argument("--out", type=Path, default=Path("weights/surrogate"))
     a = ap.parse_args(argv)
     a.out.mkdir(parents=True, exist_ok=True)
 
     blob = np.load(a.campaign / "pairs.npz")
     theta, x = blob["theta"], blob["x"]
-    cx = corpus_features(a.corpus, "/tmp/_corp_canon.csv")
     idx = [FEATURE_NAMES.index(c) for c in CMP]
-
-    # Standardize comparable features by campaign std → scale-free distance.
+    if a.corpus_cache and a.corpus_cache.exists():
+        cache = json.loads(a.corpus_cache.read_text())["corpus_features"]
+        cx_cmp = np.array([cache[c] for c in CMP], dtype=float)
+        print(f"[surrogate] corpus features from cache {a.corpus_cache}")
+    elif a.corpus:
+        cx_cmp = corpus_features(a.corpus, "/tmp/_corp_canon.csv")[idx]
+    else:
+        raise SystemExit("need --corpus or --corpus-cache")
+
+    # Standardize comparable features by campaign std, clip to ±4 so a single
+    # heavy-tailed corpus feature can't dominate the L2 distance.
     xs = x[:, idx]
     mu, sd = xs.mean(0), xs.std(0) + 1e-6
-    cxn, xn = (cx[idx] - mu) / sd, (xs - mu) / sd
+    cxn = np.clip((cx_cmp - mu) / sd, -4, 4)
+    xn = np.clip((xs - mu) / sd, -4, 4)
     dist = np.linalg.norm(xn - cxn, axis=1).astype("float32")  # objective per sim
 
     tn = P.normalize(theta).astype("float32")
@@ -102,13 +115,13 @@ def main(argv: list[str] | None = None) -> None:
     print("corpus-matching θ* (surrogate argmin):")
     for n, v in zip(names, theta_star):
         print(f"  {n:14s} = {v:.3f}")
-    print(f"(corpus log_amt_mean={cx[idx[0]]:.2f} std={cx[idx[1]]:.2f}; "
-          f"amount_mu≈mean/0.63 sanity → ~{cx[idx[0]] / 0.63:.1f})")
+    print(f"(corpus log_amt_mean={cx_cmp[0]:.2f} std={cx_cmp[1]:.2f}; "
+          f"amount_mu≈mean/0.63 sanity → ~{cx_cmp[0] / 0.63:.1f})")
 
     (a.out / "match_corpus.json").write_text(json.dumps({
         "surrogate_spearman": rho,
         "theta_star": dict(zip(names, theta_star.tolist())),
-        "corpus_features": dict(zip(CMP, [float(v) for v in cx[idx]])),
+        "corpus_features": dict(zip(CMP, [float(v) for v in cx_cmp])),
     }, indent=2))
     print(f"saved {a.out / 'match_corpus.json'}")
 

From aec959a7406d604e3df8def6315032f130ce6040 Mon Sep 17 00:00:00 2001
From: Michael Ivertowski <mivertowski@outlook.com>
Date: Thu, 21 May 2026 17:34:54 +0200
Subject: [PATCH 15/18] docs(ml): surrogate result (honest) + close out the
 4-track study
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Surrogate machinery runs end-to-end on real campaign data; Spearman 0.46, theta*
mis-located (amount_mu at bound) — single-small-generate stats too noisy. The
calibrated inverse posterior is the principled route to corpus-param recovery.
Completes flow / sequence / inverse / surrogate in FINDINGS.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 experiments/ml/FINDINGS.md | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/experiments/ml/FINDINGS.md b/experiments/ml/FINDINGS.md
index fb3a91b0..6d60b505 100644
--- a/experiments/ml/FINDINGS.md
+++ b/experiments/ml/FINDINGS.md
@@ -93,9 +93,19 @@ real-GL use. Identifiability is gated by forward-model fidelity (the §1 gap), s
 the flow/sequence work directly improves how much an inverse can recover.
 
 ### Surrogate / tuning loop — `surrogate/`
-Grounded CMA-ES over the same campaign: MLP surrogate `θ → distance-to-corpus`
-(13 comparable observables), optimized to the corpus-matching `θ*`. _Result
-pending the run; θ* should recover amount_mu ≈ corpus, cross-checking the flow._
+Grounded CMA-ES: MLP surrogate `θ → distance-to-corpus` over 10 robust
+observables, fit on the campaign, searched by CMA-ES. **Machinery runs
+end-to-end on real data** (vs the scaffold `optimize.py`'s synthetic-seed
+placeholder). Honest result: held-out Spearman **0.46**, and CMA-ES landed
+`amount_mu` at its upper bound (10.0) rather than the corpus-implied ≈6.2 — the
+single-small-generate summary stats are too noisy for the surrogate to locate
+the optimum reliably. (A first attempt was worse, Spearman −0.08, until the
+corpus `lpje_std=123` heavy-tail outlier was dropped from the distance + the
+features clipped.) **Takeaway:** the accelerator needs a larger / lower-variance
+campaign; the calibrated **inverse posterior** above is the more principled
+route to "what params did the corpus come from" — `amount_mu` is strongly
+identified there (cov 0.92), so feeding the corpus summary into `q(θ|x)` is the
+recommended next step over the distance-surrogate.
 
 ## 4. GNN fraud showcase (public synthetic data) — `scripts/ml/`
 Separate, publishable result (see `scripts/ml/RESULTS_v5.27.md`): binary fraud

From 140cf2bb58294193f836a6ef94193a4572543132 Mon Sep 17 00:00:00 2001
From: Michael Ivertowski <mivertowski@outlook.com>
Date: Thu, 21 May 2026 17:45:27 +0200
Subject: [PATCH 16/18] =?UTF-8?q?feat(ml/inverse):=20apply.py=20=E2=80=94?=
 =?UTF-8?q?=20posterior=20over=20the=20params=20a=20GL=20came=20from=20(co?=
 =?UTF-8?q?rpus=20capstone)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Feeds a GL's summary stats into the SBC-calibrated q(theta|x) and reports a
median + 90% CI per knob. Emits only parameter posteriors (privacy contract).
The inverse-SBI capstone: point the calibrated posterior at the corpus.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 experiments/ml/inverse/apply.py | 74 +++++++++++++++++++++++++++++++++
 1 file changed, 74 insertions(+)
 create mode 100644 experiments/ml/inverse/apply.py

diff --git a/experiments/ml/inverse/apply.py b/experiments/ml/inverse/apply.py
new file mode 100644
index 00000000..cd3d4582
--- /dev/null
+++ b/experiments/ml/inverse/apply.py
@@ -0,0 +1,74 @@
+"""Apply the trained inverse posterior q(θ | x) to a real GL → a posterior over
+the generator parameters that GL most likely came from. The audit-analytics
+capstone: point the SBC-calibrated posterior at the corpus.
+
+Emits ONLY parameter posteriors (median + 90% credible interval), never
+row-level corpus content — the privacy contract in inverse/SPEC.md.
+
+    python -m inverse.apply --weights weights/inverse \\
+        --gl-canonical /tmp/_corp_canon.csv --x-cache /tmp/corpus_x29.json --n 4000
+
+Caveat (SPEC § "Distribution shift = the BF gap"): the posterior is trained on
+synthetic; applied to a real GL it is biased by exactly the forward-fidelity
+gap §1 measures. Trust the well-identified knobs (amount_mu, fraud_rate);
+read the rest as gap-limited.
+"""
+from __future__ import annotations
+
+import argparse
+import json
+from pathlib import Path
+
+import numpy as np
+import torch
+
+from . import params as P
+from .model import PosteriorFlow
+from .simulate import summary_stats
+
+
+def main(argv: list[str] | None = None) -> None:
+    ap = argparse.ArgumentParser(description=__doc__)
+    ap.add_argument("--weights", type=Path, required=True)
+    ap.add_argument("--gl-canonical", type=Path, default=None,
+                    help="GL with canonical columns (debit/credit/posting_date/source/...)")
+    ap.add_argument("--x-cache", type=Path, default=None,
+                    help="cached 29-dim summary-stat vector json (skips the GL pass)")
+    ap.add_argument("--n", type=int, default=4000)
+    ap.add_argument("--out", type=Path, default=None)
+    a = ap.parse_args(argv)
+
+    if a.x_cache and a.x_cache.exists():
+        x = np.array(json.loads(a.x_cache.read_text())["x"], dtype="float32")
+        print(f"[apply] x from cache {a.x_cache}")
+    elif a.gl_canonical:
+        x = summary_stats(a.gl_canonical)
+        if a.x_cache:
+            a.x_cache.write_text(json.dumps({"x": [float(v) for v in x]}))
+            print(f"[apply] cached x → {a.x_cache}")
+    else:
+        raise SystemExit("need --gl-canonical or --x-cache")
+
+    ck = torch.load(a.weights / "posterior.pt", map_location="cpu")
+    m = PosteriorFlow(dim_theta=P.dim(), dim_x=ck["dim_x"])
+    m.load_state_dict(ck["model"])
+    m.train(False)
+    with torch.no_grad():
+        s = m.sample(torch.tensor(x, dtype=torch.float32), a.n).cpu().numpy()  # (n, d) normalized
+    theta = P.denormalize(s)
+
+    names = [p.name for p in P.PARAMS]
+    print("posterior over the generator params the corpus most likely came from:")
+    res = {}
+    for j, nm in enumerate(names):
+        col = theta[:, j]
+        lo, med, hi = (float(v) for v in np.percentile(col, [5, 50, 95]))
+        res[nm] = {"median": med, "ci90": [lo, hi]}
+        print(f"  {nm:14s} median={med:.3f}  90% CI=[{lo:.3f}, {hi:.3f}]")
+    if a.out:
+        a.out.write_text(json.dumps(res, indent=2))
+        print(f"saved {a.out}")
+
+
+if __name__ == "__main__":
+    main()

From b067e2800b7b40e5f90dad2dcd74e641be143c83 Mon Sep 17 00:00:00 2001
From: Michael Ivertowski <mivertowski@outlook.com>
Date: Thu, 21 May 2026 17:47:51 +0200
Subject: [PATCH 17/18] =?UTF-8?q?docs(ml):=20inverse=20capstone=20?=
 =?UTF-8?q?=E2=80=94=20posterior=20is=20degenerate=20on=20the=20real=20cor?=
 =?UTF-8?q?pus=20(OOD)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Feeding the corpus into the SBC-calibrated q(theta|x) returns a boundary-pinned,
zero-width-CI posterior (confidently wrong) — the corpus is out-of-distribution
for the synthetic-trained inverse. 'Distribution shift = the BF gap' made
empirical: well-calibrated on synthetic (cov 0.92), untrustworthy on real until
the forward-fidelity gap (section 1) is closed. The strongest argument for the
flow/sequence fidelity work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 experiments/ml/FINDINGS.md | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/experiments/ml/FINDINGS.md b/experiments/ml/FINDINGS.md
index 6d60b505..b1ea80ca 100644
--- a/experiments/ml/FINDINGS.md
+++ b/experiments/ml/FINDINGS.md
@@ -92,6 +92,21 @@ came from these process parameters"* — validated on synthetic before any
 real-GL use. Identifiability is gated by forward-model fidelity (the §1 gap), so
 the flow/sequence work directly improves how much an inverse can recover.
 
+**Capstone — posterior applied to the *real* corpus.** Feeding the corpus's
+summary into the SBC-calibrated `q(θ|x)` returns a **degenerate, boundary-pinned
+posterior** — `fraud_rate→0.100`, `amount_mu→3.0`, `amount_sigma→2.6`, all with
+**zero-width 90% CIs** — i.e. confidently *wrong* (corpus log-amount mean 3.92
+implies `amount_mu≈6.2`). The corpus `x` is **out-of-distribution** for the
+synthetic-trained posterior: the §1 gaps (source entropy, lines-per-JE tail, IET
+variance) put real GLs outside the manifold the forward model produces, so the
+flow extrapolates to the prior bounds and collapses its uncertainty. This is
+**"distribution shift = the BF gap" made empirical**: the inverse is
+well-calibrated on synthetic (cov 0.92) yet *untrustworthy on the real corpus
+until the forward-fidelity gap is closed*. It is the single strongest argument
+for the flow/sequence fidelity work — closing §1 is precisely what makes
+backward inference on real GLs valid. (Methodology lands; the headline number is
+the negative transfer, not a recovered θ.)
+
 ### Surrogate / tuning loop — `surrogate/`
 Grounded CMA-ES: MLP surrogate `θ → distance-to-corpus` over 10 robust
 observables, fit on the campaign, searched by CMA-ES. **Machinery runs

From 18985ae11a21f94746f42de7a577bba6b3478468 Mon Sep 17 00:00:00 2001
From: Michael Ivertowski <mivertowski@outlook.com>
Date: Thu, 21 May 2026 17:53:19 +0200
Subject: [PATCH 18/18] docs(ml): legal scrub of the ML experiment artifacts

Strip 'real' corpus/GL qualifiers (-> 'corpus' / 'out-of-sample GL'), drop the
client-count + industry hint ('21-client health corpus' -> 'the corpus'), and
remove a verbatim COA label token from FINDINGS + the scaffold SPEC/py docs, per
the corpus-vague-reference rule (no client names, no real-data hints, no paths,
no verbatim corpus content). 'real eval'/'REAL BF scorer' kept (actual-vs-
surrogate, not a data qualifier).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 experiments/ml/FINDINGS.md               | 22 +++++++++++-----------
 experiments/ml/README.md                 |  2 +-
 experiments/ml/flow/SPEC.md              |  2 +-
 experiments/ml/gnn/SPEC.md               |  2 +-
 experiments/ml/inverse/SPEC.md           | 13 ++++++-------
 experiments/ml/inverse/apply.py          |  4 ++--
 experiments/ml/inverse/simulate.py       |  2 +-
 experiments/ml/inverse/validate.py       |  2 +-
 experiments/ml/sequence/SPEC.md          |  2 +-
 experiments/ml/surrogate/match_corpus.py |  2 +-
 10 files changed, 26 insertions(+), 27 deletions(-)

diff --git a/experiments/ml/FINDINGS.md b/experiments/ml/FINDINGS.md
index b1ea80ca..bc2a80cb 100644
--- a/experiments/ml/FINDINGS.md
+++ b/experiments/ml/FINDINGS.md
@@ -1,8 +1,8 @@
 # Corpus → synthetic gap: what's missing, and what the learning tracks recover
 
-A100 study (2026-05-21), DataSynth v5.27. Goal: **learn from the real corpus
-what the synthetic generator is missing**, on the 21-client health corpus
-(53.4M JE lines, 11.8M JEs aggregated) vs the v5.27 engine. All learning is on
+A100 study (2026-05-21), DataSynth v5.27. Goal: **learn from the corpus what
+the synthetic generator is missing** — the aggregated corpus (53.4M JE lines,
+11.8M JEs) vs the v5.27 engine. All learning is on
 the corpus on the private box; weights stay on-box (memorization rule). Paper
 grounding + generator-optimization targets.
 
@@ -17,7 +17,7 @@ Raw observables — interpretable units, not normalized DRs:
 | Amount **p99** | $33k | $542k | synthetic tail **~16× too fat** |
 | log-amount std / skew | 2.46 / 0.56 | 3.43 / 0.99 | synthetic **over-dispersed, over-skewed** |
 | Lines per JE (mean) | 4.5 | 10.3 | synthetic JEs **~2.3× too large** |
-| Benford MAD | 0.0081 | 0.0057 | synthetic slightly *more* Benford-clean than reality |
+| Benford MAD | 0.0081 | 0.0057 | synthetic slightly *more* Benford-clean than the corpus |
 
 Top generator-optimization targets: **(a)** amount density (tail + spread),
 **(b)** IET-variance / lines-per-JE structure, **(c)** source-mix breadth.
@@ -70,7 +70,7 @@ memoryless). So the autoregressive model captures the joint
 source→account-class→line-count→weekday structure the current per-event
 marginal sampler discards — the concrete case for an AR event scheduler.
 Data-quality
-note: the corpus COA `Account Class` has mojibake encoding variants (`Vorr??te`)
+note: the corpus COA `Account Class` carries encoding-mangled label variants
 inflating the class count to 397 — a cleaning target.
 
 ### Inverse SBI — run the engine backward — `inverse/`
@@ -89,28 +89,28 @@ A GL's amount **location** and **fraud rate** are recoverable with calibrated
 uncertainty; amount **width** is not (other variance sources swamp the single
 component's σ). This is the audit-analytics direction — *"the GL most likely
 came from these process parameters"* — validated on synthetic before any
-real-GL use. Identifiability is gated by forward-model fidelity (the §1 gap), so
+out-of-sample-GL use. Identifiability is gated by forward-model fidelity (the §1 gap), so
 the flow/sequence work directly improves how much an inverse can recover.
 
-**Capstone — posterior applied to the *real* corpus.** Feeding the corpus's
+**Capstone — posterior applied to the corpus.** Feeding the corpus's
 summary into the SBC-calibrated `q(θ|x)` returns a **degenerate, boundary-pinned
 posterior** — `fraud_rate→0.100`, `amount_mu→3.0`, `amount_sigma→2.6`, all with
 **zero-width 90% CIs** — i.e. confidently *wrong* (corpus log-amount mean 3.92
 implies `amount_mu≈6.2`). The corpus `x` is **out-of-distribution** for the
 synthetic-trained posterior: the §1 gaps (source entropy, lines-per-JE tail, IET
-variance) put real GLs outside the manifold the forward model produces, so the
+variance) put out-of-sample GLs outside the manifold the forward model produces, so the
 flow extrapolates to the prior bounds and collapses its uncertainty. This is
 **"distribution shift = the BF gap" made empirical**: the inverse is
-well-calibrated on synthetic (cov 0.92) yet *untrustworthy on the real corpus
+well-calibrated on synthetic (cov 0.92) yet *untrustworthy on the corpus
 until the forward-fidelity gap is closed*. It is the single strongest argument
 for the flow/sequence fidelity work — closing §1 is precisely what makes
-backward inference on real GLs valid. (Methodology lands; the headline number is
+backward inference on out-of-sample GLs valid. (Methodology lands; the headline number is
 the negative transfer, not a recovered θ.)
 
 ### Surrogate / tuning loop — `surrogate/`
 Grounded CMA-ES: MLP surrogate `θ → distance-to-corpus` over 10 robust
 observables, fit on the campaign, searched by CMA-ES. **Machinery runs
-end-to-end on real data** (vs the scaffold `optimize.py`'s synthetic-seed
+end-to-end on campaign data** (vs the scaffold `optimize.py`'s synthetic-seed
 placeholder). Honest result: held-out Spearman **0.46**, and CMA-ES landed
 `amount_mu` at its upper bound (10.0) rather than the corpus-implied ≈6.2 — the
 single-small-generate summary stats are too noisy for the surrogate to locate
diff --git a/experiments/ml/README.md b/experiments/ml/README.md
index bf581847..3f3b4854 100644
--- a/experiments/ml/README.md
+++ b/experiments/ml/README.md
@@ -54,7 +54,7 @@ The training data is **corpus-derived**. Two hard rules:
    any run config carrying a corpus path are gitignored. Only code + specs are
    tracked. See [`.gitignore`](.gitignore).
 2. **Models can memorize.** A GNN trained on raw entity graphs can memorize
-   real counterparty relationships; a sequence model can memorize rare
+   genuine counterparty relationships; a sequence model can memorize rare
    account/text patterns. Before *any* trained weight leaves the private box,
    it must pass a memorization review (the GNN spec describes a k-anonymity /
    DP-SGD path). Treat weights as sensitive as the corpus until reviewed.
diff --git a/experiments/ml/flow/SPEC.md b/experiments/ml/flow/SPEC.md
index 0d2129a3..7e1eb0f2 100644
--- a/experiments/ml/flow/SPEC.md
+++ b/experiments/ml/flow/SPEC.md
@@ -9,7 +9,7 @@ staying invertible (exact log-density, exact sampling).
 
 ## Why a flow (vs log-normal mixture)
 
-The mixture has a fixed number of log-normal components; real amount
+The mixture has a fixed number of log-normal components; production amount
 distributions have sharp round-number atoms ($1k/$5k/$10k), regulatory
 thresholds, and fat tails that a 3-component mixture smooths over. A flow
 learns the density nonparametrically and still gives the analytic likelihood
diff --git a/experiments/ml/gnn/SPEC.md b/experiments/ml/gnn/SPEC.md
index 1a8e74b9..d0faf266 100644
--- a/experiments/ml/gnn/SPEC.md
+++ b/experiments/ml/gnn/SPEC.md
@@ -70,7 +70,7 @@ sampled scaffold. (Re-evaluate if we want online sampling later.)
 
 ## Privacy
 
-Highest memorization risk of the four tracks — the embedding can encode real
+Highest memorization risk of the four tracks — the embedding can encode genuine
 counterparty adjacency. Before sharing any weights or artifact off the private
 box:
 * node ids are opaque hashes (done at export);
diff --git a/experiments/ml/inverse/SPEC.md b/experiments/ml/inverse/SPEC.md
index c9d71403..f723bd76 100644
--- a/experiments/ml/inverse/SPEC.md
+++ b/experiments/ml/inverse/SPEC.md
@@ -31,7 +31,7 @@ posteriors + coverage; never false-precision point estimates.
    θ ~ prior ───────────────────────────────▶ GL ──summary stats──▶ x
         │                                                            │
         └──────────── train q_φ(θ | x)  (conditional flow) ◀─────────┘
-   inference:  real GL ──summary stats──▶ x*  ──▶  q_φ(θ | x*)  (one fwd pass)
+   inference:  out-of-sample GL ──summary stats──▶ x*  ──▶  q_φ(θ | x*)  (one fwd pass)
 ```
 
 1. **`simulate.py`** — draw θ from a prior over a *small, identifiable*
@@ -76,18 +76,17 @@ feature vector in `simulate.py` once the parameter set is fixed.
 ## Distribution shift = the BF gap
 
 The inverse is only as trustworthy as the forward model's fidelity to reality.
-An inverse trained on synthetic, applied to a real GL, is biased by exactly the
+An inverse trained on synthetic, applied to an out-of-sample GL, is biased by exactly the
 behavioral-fidelity gap the composite measures. So fidelity work directly gates
-inversion quality — and the inverse should only be pointed at real GL once the
+inversion quality — and the inverse should only be pointed at out-of-sample GL once the
 forward model's BF composite is acceptable for the targeted account/source mix.
 
 ## Privacy
 
-Training data is synthetic (no corpus). Applying the trained inverse to a real
-GL reads that GL but emits only parameter posteriors — no row-level corpus
-content. Same `DATASYNTH_CORPUS_DIR` discipline if real GL is used for
+Training data is synthetic (no corpus). Applying the trained inverse to a out-of-sample GL reads that GL but emits only parameter posteriors — no row-level corpus
+content. Same `DATASYNTH_CORPUS_DIR` discipline if out-of-sample GL is used for
 evaluation; results (posteriors) are not corpus content but treat any
-real-GL-derived artifact as sensitive until reviewed.
+out-of-sample-GL-derived artifact as sensitive until reviewed.
 
 ## Handoff
 
diff --git a/experiments/ml/inverse/apply.py b/experiments/ml/inverse/apply.py
index cd3d4582..fa0a535a 100644
--- a/experiments/ml/inverse/apply.py
+++ b/experiments/ml/inverse/apply.py
@@ -1,4 +1,4 @@
-"""Apply the trained inverse posterior q(θ | x) to a real GL → a posterior over
+"""Apply the trained inverse posterior q(θ | x) to an out-of-sample GL → a posterior over
 the generator parameters that GL most likely came from. The audit-analytics
 capstone: point the SBC-calibrated posterior at the corpus.
 
@@ -9,7 +9,7 @@
         --gl-canonical /tmp/_corp_canon.csv --x-cache /tmp/corpus_x29.json --n 4000
 
 Caveat (SPEC § "Distribution shift = the BF gap"): the posterior is trained on
-synthetic; applied to a real GL it is biased by exactly the forward-fidelity
+synthetic; applied to an out-of-sample GL it is biased by exactly the forward-fidelity
 gap §1 measures. Trust the well-identified knobs (amount_mu, fraud_rate);
 read the rest as gap-limited.
 """
diff --git a/experiments/ml/inverse/simulate.py b/experiments/ml/inverse/simulate.py
index a69347a1..d64b989f 100644
--- a/experiments/ml/inverse/simulate.py
+++ b/experiments/ml/inverse/simulate.py
@@ -69,7 +69,7 @@ def _entropy(shares: np.ndarray) -> float:
 
 def summary_stats(je_csv: Path) -> np.ndarray:
     """GL → fixed-length feature vector x (DIM_X,). Observable-only (no labels)
-    so the same map applies to a real GL at inference time."""
+    so the same map applies to an out-of-sample GL at inference time."""
     df = pd.read_csv(je_csv, low_memory=False)
     n = len(df)
     if n == 0:
diff --git a/experiments/ml/inverse/validate.py b/experiments/ml/inverse/validate.py
index 88410c00..41dd68eb 100644
--- a/experiments/ml/inverse/validate.py
+++ b/experiments/ml/inverse/validate.py
@@ -11,7 +11,7 @@
 
 This is the whole point of doing inversion against a forward simulator: we can
 measure how well 'running the engine backward' works BEFORE pointing it at any
-real GL.
+out-of-sample GL.
 """
 
 from __future__ import annotations
diff --git a/experiments/ml/sequence/SPEC.md b/experiments/ml/sequence/SPEC.md
index a14a5a6f..42f1dd4f 100644
--- a/experiments/ml/sequence/SPEC.md
+++ b/experiments/ml/sequence/SPEC.md
@@ -10,7 +10,7 @@ prior approximate marginally but miss in their *joint, autocorrelated* form
 
 ## Why autoregressive (vs marginal samplers)
 
-The current samplers draw IET and line-count independently per event. Real GL
+The current samplers draw IET and line-count independently per event. Out-of-sample GL
 streams are bursty and autocorrelated: a flurry of postings clusters, then
 quiets. A causal transformer conditions each event on the recent history, so
 burst structure and lag-1 autocorrelation emerge instead of being imposed.
diff --git a/experiments/ml/surrogate/match_corpus.py b/experiments/ml/surrogate/match_corpus.py
index 5f9d3f8c..566028bb 100644
--- a/experiments/ml/surrogate/match_corpus.py
+++ b/experiments/ml/surrogate/match_corpus.py
@@ -8,7 +8,7 @@
 the objective as `distance(summary_stats(θ), corpus)`, fit an MLP surrogate,
 and CMA-ES to the corpus-matching `θ*`. `θ*` is the config the corpus "most
 likely came from" — cross-checking the flow finding (corpus log-amount mean
-≈ 3.9). Demonstrates the tuning-loop accelerator end-to-end on real data.
+≈ 3.9). Demonstrates the tuning-loop accelerator end-to-end on out-of-sample data.
 
     python -m surrogate.match_corpus --campaign data/inverse \\
         --corpus /home/ubuntu/corpus_health.parquet