From ab4378cdc9a4885debe7b73e076f8fdd15a4215a Mon Sep 17 00:00:00 2001 From: Michael Ivertowski Date: Wed, 20 May 2026 16:09:49 +0200 Subject: [PATCH 01/18] feat(ml): scaffold 4 neuro-symbolic realism experiment tracks MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Runnable PyTorch scaffolding + per-track specs under experiments/ml/ for closing the behavioral-fidelity gap by learning the generator's *proposal distributions* while leaving the symbolic constraint layer (debits=credits, A=L+E, document chains, IC matching) untouched. The NN proposes structure; the existing Rust engine enforces every invariant — coherence stays a hard guarantee. Tracks: - gnn/ GraphSAGE+GAE relational sampler → P3 ClusteringGap / TriangleLogRatio - sequence/ causal transformer over JE event streams → P1 IETD/Autocorr, P2, P4 - flow/ conditional neural-spline flow for amounts → Benford / multimodal tails - surrogate/ MLP eval-surrogate + CMA-ES knob search → faster calibration loop (performance only, zero coherence risk — never touches generation) common/ holds the shared corpus→tensor exporter and the BF-eval bridge (canonical Rust scorer + a fast Python approximation for in-loop use). Scaffold only — no model trained. Built to run on an A100 when free; this dev box OOMs on orchestrator-scale work. Each train.py is runnable; model bodies carry TODO markers where corpus-schema wiring lands after the first data export. Privacy: corpus path via DATASYNTH_CORPUS_DIR (never hard-coded); data/ weights/run-configs gitignored; per-track memorization review (k-anon / DP-SGD path in the GNN spec) required before any weight leaves the private box. No corpus content committed. Co-Authored-By: Claude Opus 4.7 (1M context) --- experiments/ml/.gitignore | 32 ++++++ experiments/ml/README.md | 88 +++++++++++++++ experiments/ml/common/__init__.py | 1 + experiments/ml/common/bf_bridge.py | 139 ++++++++++++++++++++++++ experiments/ml/common/data_export.py | 147 ++++++++++++++++++++++++++ experiments/ml/common/schema.py | 65 ++++++++++++ experiments/ml/flow/SPEC.md | 57 ++++++++++ experiments/ml/flow/__init__.py | 1 + experiments/ml/flow/model.py | 52 +++++++++ experiments/ml/flow/train.py | 62 +++++++++++ experiments/ml/gnn/SPEC.md | 80 ++++++++++++++ experiments/ml/gnn/__init__.py | 1 + experiments/ml/gnn/model.py | 83 +++++++++++++++ experiments/ml/gnn/sample.py | 38 +++++++ experiments/ml/gnn/train.py | 82 ++++++++++++++ experiments/ml/requirements.txt | 33 ++++++ experiments/ml/sequence/SPEC.md | 62 +++++++++++ experiments/ml/sequence/__init__.py | 1 + experiments/ml/sequence/model.py | 85 +++++++++++++++ experiments/ml/sequence/train.py | 61 +++++++++++ experiments/ml/surrogate/SPEC.md | 59 +++++++++++ experiments/ml/surrogate/__init__.py | 1 + experiments/ml/surrogate/knobs.py | 55 ++++++++++ experiments/ml/surrogate/optimize.py | 103 ++++++++++++++++++ experiments/ml/surrogate/surrogate.py | 71 +++++++++++++ 25 files changed, 1459 insertions(+) create mode 100644 experiments/ml/.gitignore create mode 100644 experiments/ml/README.md create mode 100644 experiments/ml/common/__init__.py create mode 100644 experiments/ml/common/bf_bridge.py create mode 100644 experiments/ml/common/data_export.py create mode 100644 experiments/ml/common/schema.py create mode 100644 experiments/ml/flow/SPEC.md create mode 100644 experiments/ml/flow/__init__.py create mode 100644 experiments/ml/flow/model.py create mode 100644 experiments/ml/flow/train.py create mode 100644 experiments/ml/gnn/SPEC.md create mode 100644 experiments/ml/gnn/__init__.py create mode 100644 experiments/ml/gnn/model.py create mode 100644 experiments/ml/gnn/sample.py create mode 100644 experiments/ml/gnn/train.py create mode 100644 experiments/ml/requirements.txt create mode 100644 experiments/ml/sequence/SPEC.md create mode 100644 experiments/ml/sequence/__init__.py create mode 100644 experiments/ml/sequence/model.py create mode 100644 experiments/ml/sequence/train.py create mode 100644 experiments/ml/surrogate/SPEC.md create mode 100644 experiments/ml/surrogate/__init__.py create mode 100644 experiments/ml/surrogate/knobs.py create mode 100644 experiments/ml/surrogate/optimize.py create mode 100644 experiments/ml/surrogate/surrogate.py diff --git a/experiments/ml/.gitignore b/experiments/ml/.gitignore new file mode 100644 index 00000000..2c91598b --- /dev/null +++ b/experiments/ml/.gitignore @@ -0,0 +1,32 @@ +# DataSynth ML experiments — keep ALL corpus-derived artifacts out of git. +# Only code (*.py) and specs (*.md, requirements.txt) are tracked. + +# Exported training tensors / parquet (corpus-derived) +data/ +*.parquet +*.npz +*.npy +*.pt +*.pth +*.ckpt +*.safetensors + +# Trained weights + run outputs (treat as sensitive as the corpus until +# memorization-reviewed — see README § Privacy) +weights/ +runs/ +checkpoints/ +lightning_logs/ +wandb/ +*.log + +# Any run config that carries a corpus path +*.local.yaml +*.local.json +config.local.* + +# Python env +.venv/ +__pycache__/ +*.pyc +.ipynb_checkpoints/ diff --git a/experiments/ml/README.md b/experiments/ml/README.md new file mode 100644 index 00000000..7c306d0b --- /dev/null +++ b/experiments/ml/README.md @@ -0,0 +1,88 @@ +# DataSynth ML experiments — neuro-symbolic realism + +Four experiment tracks that try to close the behavioral-fidelity (BF) gap by +learning the **proposal distributions** of the generator while leaving the +**symbolic constraint layer** (debits = credits, A = L + E, document chains, +IC matching) untouched. + +## The principle + +DataSynth is a probabilistic program with hard constraints. The realism gap +lives in *what we propose* (when a JE posts, how many lines, which accounts / +entities interact, what amounts), **not** in the constraints. So every model +here is subordinate to the symbolic engine: + +``` + corpus ──extract──▶ training tensors ──train(A100)──▶ learned proposal + │ + ▼ + NN emits *structure / latent shape* ──▶ symbolic decoder enforces every + (timing, line count, accounts, invariant (balance, A=L+E, chains) + entity edges, amount density) ──▶ coherent synthetic output +``` + +The NN never emits a final balance. It emits shape; the existing Rust +generator projects that shape onto the feasible manifold. Coherence stays a +hard guarantee by construction. + +## The four tracks + +| Dir | Track | Architecture | BF metrics targeted | +|-----|-------|--------------|---------------------| +| [`gnn/`](gnn/SPEC.md) | Relational / interconnectivity sampler | GraphSAGE encoder + edge/degree decoder (GAE-style) | P3 ClusteringGap, TriangleLogRatio (TP / vendor / IC graphs) | +| [`sequence/`](sequence/SPEC.md) | Temporal / behavioral stream | Autoregressive transformer over per-(source, entity) JE token streams | P1 IETD + Autocorr, P2 JELineBurst, P4 MeanGap | +| [`flow/`](flow/SPEC.md) | Amount marginals | Conditional normalizing flow per (source, account-class) | Benford / multimodal amount fidelity | +| [`surrogate/`](surrogate/SPEC.md) | Tuning-loop accelerator | MLP surrogate of the BF composite + CMA-ES over generator knobs | *Performance* of calibration (no coherence risk — never touches generation) | + +Start with **`gnn/`** (highest leverage on the structural gaps the hand-tuned +motif samplers can't close) or **`surrogate/`** (pure iteration-speed win, +zero coherence risk). See each `SPEC.md` for objective, data contract, +architecture, and success criteria. + +## Privacy / legal (read before running) + +The training data is **corpus-derived**. Two hard rules: + +1. **Nothing corpus-derived is committed.** `data/`, `weights/`, `runs/`, and + any run config carrying a corpus path are gitignored. Only code + specs are + tracked. See [`.gitignore`](.gitignore). +2. **Models can memorize.** A GNN trained on raw entity graphs can memorize + real counterparty relationships; a sequence model can memorize rare + account/text patterns. Before *any* trained weight leaves the private box, + it must pass a memorization review (the GNN spec describes a k-anonymity / + DP-SGD path). Treat weights as sensitive as the corpus until reviewed. + +The corpus location is supplied via the `DATASYNTH_CORPUS_DIR` environment +variable — never hard-coded, never logged. Matches the existing +`scripts/regenerate-industry-priors.sh` convention. + +## Setup (on the A100 box, when free) + +```bash +cd experiments/ml +python -m venv .venv && source .venv/bin/activate +pip install -r requirements.txt +export DATASYNTH_CORPUS_DIR=/path/to/private/corpus # never committed + +# 1. export training tensors from the corpus (CPU, ~minutes) +python -m common.data_export --track gnn --out data/gnn + +# 2. train (A100) +python -m gnn.train --data data/gnn --out weights/gnn + +# 3. score the lift against the BF eval baseline +python -m common.bf_bridge --candidate weights/gnn/samples.parquet +``` + +## Handoff to the Rust generator + +PyTorch-first: prove the metric lift Python-side, then decide per-track +whether to (a) port the learned sampler to `candle` for the shipped generator, +or (b) keep a Python sidecar that emits structure artifacts the Rust generator +consumes at build time. Recorded per track in its `SPEC.md` § Handoff. + +## Status + +Scaffold only — no model trained yet. Built while the A100 was occupied with +another job. Each `train.py` is runnable but the model bodies carry `TODO` +markers where corpus-schema-specific wiring lands after the first data export. diff --git a/experiments/ml/common/__init__.py b/experiments/ml/common/__init__.py new file mode 100644 index 00000000..368e9a73 --- /dev/null +++ b/experiments/ml/common/__init__.py @@ -0,0 +1 @@ +"""Shared utilities for the DataSynth ML experiment tracks.""" diff --git a/experiments/ml/common/bf_bridge.py b/experiments/ml/common/bf_bridge.py new file mode 100644 index 00000000..12d84cc5 --- /dev/null +++ b/experiments/ml/common/bf_bridge.py @@ -0,0 +1,139 @@ +"""Bridge to the behavioral-fidelity (BF) eval. + +Two scoring paths: + +1. `score_canonical(candidate_dir)` — shells out to the Rust + `datasynth-data behavioral score`, the single source of truth for the + composite. Use for final lift numbers. + +2. `score_fast(corpus_df, candidate_df)` — a lightweight Python + re-implementation of the headline degradation ratios (IETD, JELineBurst, + MeanGap, clustering) for *in-loop* use where shelling out per iteration is + too slow (the surrogate track trains on these). It is an APPROXIMATION — + always confirm a candidate with `score_canonical` before believing a win. + +Degradation ratio (DR), per the eval: DR = d(synth, corpus) / noise_floor, +where the noise floor is the corpus-vs-corpus distance under resampling. +DR = 1.0 means "indistinguishable from a fresh corpus draw"; higher = worse. +The composite is the (volume-corrected) mean / median of per-metric DRs. +""" + +from __future__ import annotations + +import json +import shutil +import subprocess +from pathlib import Path + +import numpy as np + + +# -------------------------------------------------------------------------- +# 1. Canonical scorer — defer to the Rust eval. +# -------------------------------------------------------------------------- +def _find_cli() -> str: + for cand in ("./target/release/datasynth-data", "datasynth-data"): + if shutil.which(cand) or Path(cand).exists(): + return cand + raise FileNotFoundError( + "datasynth-data binary not found — build with " + "`cargo build --release -p datasynth-cli` or add it to PATH" + ) + + +def score_canonical(candidate_dir: Path, profile: str = "gl-source-tp") -> dict: + """Run the Rust BF eval on a generated candidate archive. + + Returns the parsed composite report. The exact subcommand/flags can drift; + confirm with `datasynth-data behavioral --help`. We capture JSON output. + """ + cli = _find_cli() + cmd = [ + cli, "behavioral", "score", + "--synthetic", str(candidate_dir), + "--profile", profile, + "--format", "json", + ] + print(f"[bf_bridge] $ {' '.join(cmd)}") + out = subprocess.run(cmd, capture_output=True, text=True) + if out.returncode != 0: + raise RuntimeError(f"behavioral score failed:\n{out.stderr}") + return json.loads(out.stdout) + + +# -------------------------------------------------------------------------- +# 2. Fast in-loop approximation (Python). APPROXIMATE — see module docstring. +# -------------------------------------------------------------------------- +def _hist_distance(a: np.ndarray, b: np.ndarray, bins: int = 64) -> float: + """Symmetric histogram (Jensen-Shannon) distance on a shared support.""" + lo = float(min(a.min(), b.min())) + hi = float(max(a.max(), b.max())) + if hi <= lo: + return 0.0 + edges = np.linspace(lo, hi, bins + 1) + pa, _ = np.histogram(a, edges, density=True) + pb, _ = np.histogram(b, edges, density=True) + pa = pa / (pa.sum() + 1e-12) + pb = pb / (pb.sum() + 1e-12) + m = 0.5 * (pa + pb) + + def _kl(p, q): + mask = p > 0 + return float(np.sum(p[mask] * np.log(p[mask] / (q[mask] + 1e-12)))) + + return 0.5 * _kl(pa, m) + 0.5 * _kl(pb, m) + + +def _iet_days(df) -> np.ndarray: + """Inter-event times (days) within (source, entity) streams.""" + g = df.sort_values("entry_date").groupby(["source", "trading_partner"]) + out = [] + for _, sub in g: + dates = sub["entry_date"].values.astype("datetime64[D]") + if len(dates) >= 2: + out.append(np.diff(dates).astype("timedelta64[D]").astype(float)) + return np.concatenate(out) if out else np.array([0.0]) + + +def _lines_per_je(df) -> np.ndarray: + return df.groupby("je_number").size().to_numpy(dtype=float) + + +def score_fast(corpus_df, candidate_df) -> dict: + """Approximate per-metric DRs from two pandas frames. + + Noise floor here is a crude constant per metric; the canonical eval + computes it by corpus resampling. Good enough to give CMA-ES / the + surrogate a smooth, correctly-ordered signal between full evals. + """ + # IETD (P1) and JELineBurst (P2) via JS distance on the relevant dists. + ietd = _hist_distance(_iet_days(corpus_df), _iet_days(candidate_df)) + burst = _hist_distance(_lines_per_je(corpus_df), _lines_per_je(candidate_df)) + # MeanGap (P4): |Δ mean inter-event time|, normalized. + mean_gap = abs(_iet_days(corpus_df).mean() - _iet_days(candidate_df).mean()) + + # TODO(P3 clustering): port the TP co-occurrence triangle/clustering + # distance once the GNN export lands (shares the edge builder). + noise = {"IETD": 0.02, "JELineBurst": 0.05, "MeanGap": 1.0} + return { + "IETD": ietd / noise["IETD"], + "JELineBurst": burst / noise["JELineBurst"], + "MeanGap": mean_gap / noise["MeanGap"], + } + + +def composite(drs: dict) -> dict: + vals = np.array(list(drs.values()), dtype=float) + return {"mean": float(vals.mean()), "median": float(np.median(vals))} + + +if __name__ == "__main__": + import argparse + + ap = argparse.ArgumentParser(description=__doc__) + ap.add_argument("--candidate", type=Path, required=True, + help="generated candidate archive dir for the canonical eval") + ap.add_argument("--profile", default="gl-source-tp") + args = ap.parse_args() + report = score_canonical(args.candidate, args.profile) + print(json.dumps(report, indent=2)) diff --git a/experiments/ml/common/data_export.py b/experiments/ml/common/data_export.py new file mode 100644 index 00000000..df6a9645 --- /dev/null +++ b/experiments/ml/common/data_export.py @@ -0,0 +1,147 @@ +"""Export per-track training tensors from the corpus. + +Reads corpus parquet from `$DATASYNTH_CORPUS_DIR` (never hard-coded), applies +the `ColumnMap`, and writes track-specific artifacts under `--out`. All +outputs are gitignored. + +Usage: + export DATASYNTH_CORPUS_DIR=/path/to/private/corpus + python -m common.data_export --track gnn --out data/gnn + python -m common.data_export --track sequence --out data/sequence + python -m common.data_export --track flow --out data/flow + +Design notes +------------ +* CPU-only, streaming where possible — safe to run on a laptop; does NOT + invoke the orchestrator (which OOMs small boxes). +* Emits ONLY aggregated / structural tensors, never row-level corpus text. + The GNN track in particular emits an *anonymized* edge index (integer node + ids), so committed-by-accident artifacts would still carry no names — but + they are gitignored regardless. +""" + +from __future__ import annotations + +import argparse +import os +import sys +from pathlib import Path + +from .schema import ColumnMap + + +def _corpus_dir() -> Path: + d = os.environ.get("DATASYNTH_CORPUS_DIR") + if not d: + sys.exit( + "ERROR: set DATASYNTH_CORPUS_DIR to the private corpus directory " + "(never hard-code it)." + ) + p = Path(d) + if not p.is_dir(): + sys.exit(f"ERROR: DATASYNTH_CORPUS_DIR={d} is not a directory") + return p + + +def _load_je_frame(corpus: Path, cols: ColumnMap): + """Load the JE-line table as a pandas DataFrame with canonical columns. + + The corpus ships one parquet per client; we concatenate. TODO after the + first run: confirm the per-client file glob + any client-id column to + keep entity namespaces disjoint across clients (see SP3.11 namespace + canonicalisation). + """ + import pandas as pd + import pyarrow.parquet as pq + + files = sorted(corpus.glob("JE_*.parquet")) + if not files: + sys.exit(f"ERROR: no JE_*.parquet under {corpus}") + frames = [] + rename = {getattr(cols, f): f for f in cols.__dataclass_fields__} # noqa: SLF001 + for fp in files: + tbl = pq.read_table(fp) + present = [c for c in tbl.column_names if c in rename] + df = tbl.select(present).to_pandas().rename(columns=rename) + df["__client__"] = fp.stem # keep namespaces disjoint + frames.append(df) + return pd.concat(frames, ignore_index=True) + + +# -------------------------------------------------------------------------- +# Track exporters. Each writes a small set of .pt / .parquet artifacts. +# -------------------------------------------------------------------------- +def export_gnn(corpus: Path, cols: ColumnMap, out: Path) -> None: + """Edge list + node features for the relational graph(s). + + Builds three graphs the symbolic motif samplers approximate today: + * TP co-occurrence (trading partners sharing a JE / source) + * vendor / counterparty network (from gl_account ↔ trading_partner) + * IC bilateral edges (cross-client matched flows) + + Emits anonymized integer node ids + a node-degree / source-mix feature + matrix. See gnn/SPEC.md § Data. + """ + raise NotImplementedError( + "TODO(gnn): build edge_index + node features. Pseudocode in " + "gnn/SPEC.md § Data — node = (client, trading_partner); edge weight = " + "co-occurrence count; node feat = [degree, source-mix histogram, " + "active-window length]. Write out/edge_index.pt, out/node_feat.pt, " + "out/node_ids.parquet (anonymized)." + ) + + +def export_sequence(corpus: Path, cols: ColumnMap, out: Path) -> None: + """Per-(source, entity) ordered event streams → token tensors. + + Token = (Δt-bucket, line-count-bucket, account-class, weekday, hour-band). + See sequence/SPEC.md § Data. + """ + raise NotImplementedError( + "TODO(sequence): group by (client, source, trading_partner), sort by " + "entry_date, derive inter-event Δt + per-JE line count, bucketize, and " + "write out/streams.pt (padded) + out/vocab.json." + ) + + +def export_flow(corpus: Path, cols: ColumnMap, out: Path) -> None: + """Per-(source, account-class) amount samples + conditioning features. + + See flow/SPEC.md § Data. + """ + raise NotImplementedError( + "TODO(flow): collect log|amount| per (source, account_class), plus " + "conditioning one-hots; write out/amounts.parquet." + ) + + +EXPORTERS = { + "gnn": export_gnn, + "sequence": export_sequence, + "flow": export_flow, +} + + +def main(argv: list[str] | None = None) -> None: + ap = argparse.ArgumentParser(description=__doc__) + ap.add_argument("--track", required=True, choices=sorted(EXPORTERS)) + ap.add_argument("--out", required=True, type=Path) + ap.add_argument( + "--column-map", + type=Path, + default=None, + help="local (gitignored) YAML mapping canonical->corpus column names", + ) + args = ap.parse_args(argv) + + cols = ColumnMap.from_yaml(str(args.column_map)) if args.column_map else ColumnMap() + corpus = _corpus_dir() + args.out.mkdir(parents=True, exist_ok=True) + + print(f"[data_export] track={args.track} corpus={corpus} -> {args.out}") + EXPORTERS[args.track](corpus, cols, args.out) + print("[data_export] done") + + +if __name__ == "__main__": + main() diff --git a/experiments/ml/common/schema.py b/experiments/ml/common/schema.py new file mode 100644 index 00000000..2288fe2f --- /dev/null +++ b/experiments/ml/common/schema.py @@ -0,0 +1,65 @@ +"""Canonical JE-line schema shared by every track. + +These are DataSynth's *own* internal model field names (see the `Record` +struct consumed by the behavioral-fidelity eval and CLAUDE.md), NOT +corpus-verbatim column names. The corpus parquet may use different column +names; map them in a local (gitignored) `config.local.yaml` rather than +editing this file, so no corpus-specific naming lands in git. +""" + +from __future__ import annotations + +from dataclasses import dataclass + + +@dataclass(frozen=True) +class ColumnMap: + """Maps canonical field -> corpus column name. + + Defaults are the canonical names; override per-corpus via + `ColumnMap.from_yaml("config.local.yaml")` (gitignored). + """ + + source: str = "source" + gl_account: str = "gl_account" + cost_center: str = "cost_center" + profit_center: str = "profit_center" + trading_partner: str = "trading_partner" + je_number: str = "je_number" + je_line_number: str = "je_line_number" + effective_date: str = "effective_date" + entry_date: str = "entry_date" + created_at: str = "created_at" + amount: str = "functional_amount" + # ISO 21378 account-class label (joined from CoA), used as the + # coherence key by both the symbolic generator and these models. + account_class: str = "account_class" + + @classmethod + def from_yaml(cls, path: str) -> "ColumnMap": + import yaml # local import keeps the module import-light + + with open(path) as fh: + raw = yaml.safe_load(fh) or {} + known = {f for f in cls.__dataclass_fields__} # noqa: SLF001 + return cls(**{k: v for k, v in raw.items() if k in known}) + + def required(self) -> list[str]: + return [ + self.source, + self.gl_account, + self.je_number, + self.je_line_number, + self.entry_date, + self.amount, + ] + + +# Behavioral-fidelity metric families the experiments target. Kept here so +# every track + the surrogate agree on metric identifiers. +BF_METRICS: dict[str, list[str]] = { + "P1": ["IETD", "Autocorr"], + "P2": ["JELineBurst"], + "P3": ["ClusteringGap", "TriangleLogRatio"], + "P4": ["MeanGap"], +} diff --git a/experiments/ml/flow/SPEC.md b/experiments/ml/flow/SPEC.md new file mode 100644 index 00000000..0d2129a3 --- /dev/null +++ b/experiments/ml/flow/SPEC.md @@ -0,0 +1,57 @@ +# Track 3 — Conditional normalizing flow for amount marginals + +## Objective + +Replace the per-(source, account-class) log-normal *mixture* with a learned +**conditional normalizing flow** that captures the exact multimodal amount +density — heavy tails, round-number spikes, threshold clustering — while +staying invertible (exact log-density, exact sampling). + +## Why a flow (vs log-normal mixture) + +The mixture has a fixed number of log-normal components; real amount +distributions have sharp round-number atoms ($1k/$5k/$10k), regulatory +thresholds, and fat tails that a 3-component mixture smooths over. A flow +learns the density nonparametrically and still gives the analytic likelihood +the eval / Benford checks want. + +## Data (`common.data_export --track flow`) + +Per JE line: `y = signed log1p(|amount|)` (sign kept as a separate Bernoulli +conditioned on account-class), conditioning `c = one_hot(source) ⊕ +one_hot(account_class) ⊕ [is_period_end, is_fraud]`. Artifact (gitignored): +`amounts.parquet` (y, c) — aggregated numeric, no text. + +## Architecture + +`zuko` neural spline flow (NSF), 4 transforms, conditioned on `c`: + +``` +base N(0,1) ──(c-conditioned spline coupling × 4)──▶ y +``` + +Round-number atoms are handled with a **dequantization + atom mixture**: a +small classifier picks "round atom k vs continuous"; the flow models the +continuous part. Keeps the spikes crisp instead of smearing them. + +## Sampling → handoff + +Sample `y | c`, invert the log1p + sign → amount. This is the cleanest port +target: the flow is small and the inverse is closed-form per transform, so +**porting to candle is feasible** — or export the spline knots as a lookup the +Rust `AmountSampler` interpolates. Decide after measuring the Benford / tail +lift. The symbolic balance step still rescales the final line set to enforce +debits = credits (the flow sets the *distributional shape*, balance sets the +*exact values*) — so coherence is untouched. + +## Success criteria + +* Benford MAD and amount-distribution-fit DR improve vs the mixture baseline; + round-number occurrence within the eval's tolerance band. +* Tail quantiles (p99, p99.9) match the corpus within the noise floor. +* Balance / Benford-compliance checks still pass after the symbolic rescale. + +## Privacy + +Low risk (aggregated numeric density). Guard against the flow memorizing rare +exact large amounts: clip the training tail at a high quantile and note it. diff --git a/experiments/ml/flow/__init__.py b/experiments/ml/flow/__init__.py new file mode 100644 index 00000000..8887be48 --- /dev/null +++ b/experiments/ml/flow/__init__.py @@ -0,0 +1 @@ +"""Track 3 — conditional normalizing flow for amount marginals (zuko NSF).""" diff --git a/experiments/ml/flow/model.py b/experiments/ml/flow/model.py new file mode 100644 index 00000000..ff636146 --- /dev/null +++ b/experiments/ml/flow/model.py @@ -0,0 +1,52 @@ +"""Conditional neural-spline flow for amounts (Track 3). + +Thin wrapper over zuko's NSF with the round-number atom mixture described in +SPEC.md. The continuous flow is fully runnable; the atom classifier carries a +TODO until the corpus round-number support is exported. +""" + +from __future__ import annotations + +import torch +import torch.nn as nn + +try: + import zuko +except ImportError as exc: # pragma: no cover + raise ImportError("zuko required: pip install -r ../requirements.txt") from exc + +# Canonical round-number atoms (currency-agnostic magnitudes the symbolic +# fraud-bias layer also uses). Continuous flow models everything else. +ROUND_ATOMS = [1_000.0, 5_000.0, 10_000.0, 25_000.0, 50_000.0, 100_000.0] + + +class ConditionalAmountFlow(nn.Module): + def __init__(self, cond_dim: int, transforms: int = 4, hidden=(128, 128)): + super().__init__() + # 1-D target (signed log1p amount), conditioned on c. + self.flow = zuko.flows.NSF( + features=1, context=cond_dim, transforms=transforms, hidden_features=hidden + ) + # P(round atom k | c) vs continuous; index 0 = "continuous". + self.atom_head = nn.Sequential( + nn.Linear(cond_dim, 64), nn.ReLU(), nn.Linear(64, len(ROUND_ATOMS) + 1) + ) + + def log_prob(self, y: torch.Tensor, c: torch.Tensor) -> torch.Tensor: + """Continuous-part log-density. Atom mixture handled in train/sample. + + TODO(flow): combine with the atom classifier into a proper mixture + log-likelihood once round-atom membership labels are exported. + """ + return self.flow(c).log_prob(y) + + def sample(self, c: torch.Tensor) -> torch.Tensor: + atom_logits = self.atom_head(c) + atom = torch.distributions.Categorical(logits=atom_logits).sample() + cont = self.flow(c).sample() # (B, 1) in signed-log1p space + out = cont.squeeze(-1).clone() + for k, val in enumerate(ROUND_ATOMS, start=1): + mask = atom == k + # store atoms in the same signed-log1p space for a uniform inverse + out[mask] = torch.log1p(torch.as_tensor(val, device=out.device)) + return out diff --git a/experiments/ml/flow/train.py b/experiments/ml/flow/train.py new file mode 100644 index 00000000..b24d919d --- /dev/null +++ b/experiments/ml/flow/train.py @@ -0,0 +1,62 @@ +"""Train the conditional amount flow (Track 3). + + python -m flow.train --data data/flow --out weights/flow --epochs 50 +""" + +from __future__ import annotations + +import argparse +from pathlib import Path + +import torch +from torch.utils.data import DataLoader, TensorDataset + +from .model import ConditionalAmountFlow + + +def main(argv: list[str] | None = None) -> None: + ap = argparse.ArgumentParser(description=__doc__) + ap.add_argument("--data", type=Path, required=True) + ap.add_argument("--out", type=Path, required=True) + ap.add_argument("--epochs", type=int, default=50) + ap.add_argument("--batch-size", type=int, default=4096) + ap.add_argument("--lr", type=float, default=1e-3) + ap.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu") + args = ap.parse_args(argv) + args.out.mkdir(parents=True, exist_ok=True) + dev = torch.device(args.device) + + import pandas as pd + + df = pd.read_parquet(args.data / "amounts.parquet") + y = torch.tensor(df["y"].to_numpy(), dtype=torch.float32).unsqueeze(-1) + c = torch.tensor( + df.drop(columns=["y"]).to_numpy(), dtype=torch.float32 + ) # conditioning one-hots + ds = TensorDataset(y, c) + dl = DataLoader(ds, batch_size=args.batch_size, shuffle=True) + + model = ConditionalAmountFlow(cond_dim=c.size(1)).to(dev) + opt = torch.optim.Adam(model.parameters(), lr=args.lr) + + model.train() + for epoch in range(1, args.epochs + 1): + running = 0.0 + for yb, cb in dl: + yb, cb = yb.to(dev), cb.to(dev) + loss = -model.log_prob(yb, cb).mean() + opt.zero_grad() + loss.backward() + opt.step() + running += loss.item() + print(f"epoch {epoch:3d} nll={running/len(dl):.4f}") + + torch.save({"model": model.state_dict(), "cond_dim": c.size(1)}, + args.out / "amount_flow.pt") + print(f"[flow.train] saved {args.out/'amount_flow.pt'}") + print("TODO(flow): export spline knots for the candle AmountSampler port, " + "and validate Benford MAD via common.bf_bridge.") + + +if __name__ == "__main__": + main() diff --git a/experiments/ml/gnn/SPEC.md b/experiments/ml/gnn/SPEC.md new file mode 100644 index 00000000..1a8e74b9 --- /dev/null +++ b/experiments/ml/gnn/SPEC.md @@ -0,0 +1,80 @@ +# Track 1 — GNN relational sampler + +## Objective + +Learn the **interconnectivity structure** of the corpus's entity graphs +(trading-partner co-occurrence, vendor/counterparty network, IC bilateral +edges) and sample new graphs with the same motif statistics — closing the +**P3 ClusteringGap** and **TriangleLogRatio** gaps the hand-tuned +`CrossEntityMotifSampler` / TP motif sampler only partially close. + +The symbolic generator keeps ownership of *what posts on each edge* (JEs, +amounts, balance). This model only decides *which entities connect and how +densely* — the relational scaffold. + +## Why a GNN (vs the current motif sampler) + +The current samplers bias draws toward recent cluster-mates — a local +heuristic. They can't represent global structure (community sizes, degree +distribution tails, triangle density) jointly. A graph autoencoder learns a +latent node embedding whose inner-product geometry reproduces the corpus's +joint edge structure, so sampled graphs match clustering + triangle counts by +construction rather than by tuning. + +## Data (`common.data_export --track gnn`) + +Per client (namespaces kept disjoint — see SP3.11): + +* **Nodes** = entities `(client, trading_partner)`; anonymized integer ids. +* **Edges** = co-occurrence: two TPs sharing a JE or a (source, period) bucket; + weight = count. +* **Node features** `x`: `[log-degree, source-mix histogram (k sources), + active-window length (days), mean lines-per-JE]`. All aggregated — no names, + no row-level text. + +Artifacts (gitignored): `edge_index.pt`, `edge_weight.pt`, `node_feat.pt`, +`node_ids.parquet` (anonymized id ↔ opaque hash, stays private). + +## Architecture + +GAE-style: + +``` +x, edge_index ─▶ GraphSAGE(2 layers, hidden=128) ─▶ z (node embeddings, d=64) +sample edges: p(i~j) = σ(zᵢ · zⱼ) (inner-product decoder) +``` + +Loss: negative-sampling reconstruction (BCE on observed vs sampled non-edges) ++ a **degree-distribution KL** regularizer and a **triangle-count** penalty so +the embedding geometry matches the corpus's P3 statistics, not just edges. + +Sampling: draw a degree sequence from the fitted tail, then realize edges by +thresholding `σ(zᵢ·zⱼ)` with calibrated sparsity → new anonymized graph. + +## Success criteria + +* P3 ClusteringGap DR and TriangleLogRatio DR (Source + TP) **down ≥ 40%** vs + the v5.26 baseline, measured by `common.bf_bridge.score_canonical` on a full + generate run that consumes the sampled graph. +* No regression > 10% on P1/P2/P4 (relational change shouldn't perturb timing). +* Coherence unaffected — IC matching coverage + balance checks still pass + (they run downstream of edge selection). + +## Handoff to the Rust generator + +The model emits a **graph artifact** (anonymized edge list + per-node +source-mix), not weights the generator must run. The Rust generator gains a +"load relational scaffold" path that consumes this artifact at build time and +routes entity selection through it. → keep the NN Python-side; ship the +sampled scaffold. (Re-evaluate if we want online sampling later.) + +## Privacy + +Highest memorization risk of the four tracks — the embedding can encode real +counterparty adjacency. Before sharing any weights or artifact off the private +box: +* node ids are opaque hashes (done at export); +* apply **k-anonymity** on the degree/feature join (drop nodes with degree < + k) and/or **DP-SGD** (ε budget recorded in the run config); +* memorization probe: nearest-neighbour attack on embeddings must not recover + held-out edges above chance. Gate in `train.py --privacy-check`. diff --git a/experiments/ml/gnn/__init__.py b/experiments/ml/gnn/__init__.py new file mode 100644 index 00000000..e4687513 --- /dev/null +++ b/experiments/ml/gnn/__init__.py @@ -0,0 +1 @@ +"""Track 1 — GNN relational sampler (GAE over entity co-occurrence graphs).""" diff --git a/experiments/ml/gnn/model.py b/experiments/ml/gnn/model.py new file mode 100644 index 00000000..e4408d8a --- /dev/null +++ b/experiments/ml/gnn/model.py @@ -0,0 +1,83 @@ +"""Graph autoencoder for the relational sampler (Track 1). + +GraphSAGE encoder + inner-product decoder (Kipf & Welling GAE), with hooks for +the degree-KL and triangle penalties described in SPEC.md. Runnable as-is on +torch-geometric; the structural regularizers carry TODO markers where they +need the exported corpus statistics. +""" + +from __future__ import annotations + +import torch +import torch.nn as nn +import torch.nn.functional as F + +try: + from torch_geometric.nn import SAGEConv +except ImportError as exc: # pragma: no cover - import-time guard + raise ImportError( + "torch-geometric required: pip install -r ../requirements.txt" + ) from exc + + +class GraphSAGEEncoder(nn.Module): + def __init__(self, in_dim: int, hidden: int = 128, latent: int = 64): + super().__init__() + self.conv1 = SAGEConv(in_dim, hidden) + self.conv2 = SAGEConv(hidden, latent) + self.dropout = nn.Dropout(0.1) + + def forward(self, x: torch.Tensor, edge_index: torch.Tensor) -> torch.Tensor: + h = F.relu(self.conv1(x, edge_index)) + h = self.dropout(h) + return self.conv2(h, edge_index) # node embeddings z + + +class InnerProductDecoder(nn.Module): + """p(edge i~j) = sigmoid(z_i . z_j).""" + + def forward(self, z: torch.Tensor, edge_index: torch.Tensor) -> torch.Tensor: + src, dst = edge_index + logits = (z[src] * z[dst]).sum(dim=-1) + return logits # caller applies BCEWithLogits / sigmoid + + +class GraphAutoencoder(nn.Module): + def __init__(self, in_dim: int, hidden: int = 128, latent: int = 64): + super().__init__() + self.encoder = GraphSAGEEncoder(in_dim, hidden, latent) + self.decoder = InnerProductDecoder() + + def encode(self, x, edge_index) -> torch.Tensor: + return self.encoder(x, edge_index) + + def recon_loss( + self, + z: torch.Tensor, + pos_edge_index: torch.Tensor, + neg_edge_index: torch.Tensor, + ) -> torch.Tensor: + pos = self.decoder(z, pos_edge_index) + neg = self.decoder(z, neg_edge_index) + logits = torch.cat([pos, neg]) + target = torch.cat([torch.ones_like(pos), torch.zeros_like(neg)]) + return F.binary_cross_entropy_with_logits(logits, target) + + # --- structural regularizers (SPEC.md § Architecture) ----------------- + @staticmethod + def degree_kl(z: torch.Tensor, target_degree_hist: torch.Tensor) -> torch.Tensor: + """KL between sampled expected-degree dist and the corpus target. + + TODO(gnn): expected degree of node i ≈ Σ_j σ(z_i·z_j); bucketize and + KL against `target_degree_hist` exported from the corpus. + """ + raise NotImplementedError("degree_kl: see SPEC.md § Architecture") + + @staticmethod + def triangle_penalty(z: torch.Tensor) -> torch.Tensor: + """Penalize deviation of expected triangle count from corpus. + + TODO(gnn): E[triangles] from the soft adjacency σ(ZZ^T); compare to the + corpus TriangleLogRatio target. Keep it batched/sparse for the A100. + """ + raise NotImplementedError("triangle_penalty: see SPEC.md § Architecture") diff --git a/experiments/ml/gnn/sample.py b/experiments/ml/gnn/sample.py new file mode 100644 index 00000000..7dbbfea9 --- /dev/null +++ b/experiments/ml/gnn/sample.py @@ -0,0 +1,38 @@ +"""Sample a new relational scaffold from the trained GAE (Track 1). + + python -m gnn.sample --weights weights/gnn/gae.pt --out weights/gnn/scaffold.parquet + +Emits the artifact the Rust generator consumes: an anonymized edge list + +per-node source-mix. No corpus content — only the learned structure. +""" + +from __future__ import annotations + +import argparse +from pathlib import Path + +import torch + + +def main(argv: list[str] | None = None) -> None: + ap = argparse.ArgumentParser(description=__doc__) + ap.add_argument("--weights", type=Path, required=True) + ap.add_argument("--data", type=Path, required=True, + help="dir with node_feat.pt / edge_index.pt used at train time") + ap.add_argument("--out", type=Path, required=True) + ap.add_argument("--target-sparsity", type=float, default=None, + help="calibrated edge density; default = corpus density") + ap.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu") + args = ap.parse_args(argv) + + raise NotImplementedError( + "TODO(gnn): load GAE, encode nodes, draw a degree sequence from the " + "fitted tail, realize edges by thresholding sigmoid(z_i·z_j) to hit " + "--target-sparsity, write out/scaffold.parquet (anonymized edge list + " + "per-node source-mix). Then validate clustering/triangle stats with " + "common.bf_bridge before handing to the generator." + ) + + +if __name__ == "__main__": + main() diff --git a/experiments/ml/gnn/train.py b/experiments/ml/gnn/train.py new file mode 100644 index 00000000..06703586 --- /dev/null +++ b/experiments/ml/gnn/train.py @@ -0,0 +1,82 @@ +"""Train the relational GAE (Track 1). + + python -m gnn.train --data data/gnn --out weights/gnn --epochs 200 + +Runnable skeleton: loads the exported tensors, trains reconstruction, and +checkpoints. The degree-KL / triangle regularizers and the --privacy-check +gate are wired but raise until the corpus targets are exported (see SPEC.md). +""" + +from __future__ import annotations + +import argparse +from pathlib import Path + +import torch + +from .model import GraphAutoencoder + + +def negative_sample(num_nodes: int, num_neg: int, device) -> torch.Tensor: + return torch.randint(0, num_nodes, (2, num_neg), device=device) + + +def main(argv: list[str] | None = None) -> None: + ap = argparse.ArgumentParser(description=__doc__) + ap.add_argument("--data", type=Path, required=True) + ap.add_argument("--out", type=Path, required=True) + ap.add_argument("--epochs", type=int, default=200) + ap.add_argument("--lr", type=float, default=1e-3) + ap.add_argument("--latent", type=int, default=64) + ap.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu") + ap.add_argument("--lambda-degree", type=float, default=0.1) + ap.add_argument("--lambda-triangle", type=float, default=0.1) + ap.add_argument("--privacy-check", action="store_true", + help="run the embedding nearest-neighbour memorization probe") + ap.add_argument("--dp-sgd", action="store_true", + help="enable DP-SGD (records epsilon in run config)") + args = ap.parse_args(argv) + + args.out.mkdir(parents=True, exist_ok=True) + dev = torch.device(args.device) + + edge_index = torch.load(args.data / "edge_index.pt").to(dev) + x = torch.load(args.data / "node_feat.pt").to(dev) + num_nodes = x.size(0) + + model = GraphAutoencoder(in_dim=x.size(1), latent=args.latent).to(dev) + opt = torch.optim.Adam(model.parameters(), lr=args.lr) + + if args.dp_sgd: + raise NotImplementedError( + "TODO(gnn): wrap opt with opacus PrivacyEngine; persist (eps, delta) " + "to out/run.json before any weight leaves the box." + ) + + model.train() + for epoch in range(1, args.epochs + 1): + opt.zero_grad() + z = model.encode(x, edge_index) + neg = negative_sample(num_nodes, edge_index.size(1), dev) + loss = model.recon_loss(z, edge_index, neg) + # Structural regularizers (raise until corpus targets exported): + # loss += args.lambda_degree * model.degree_kl(z, target_hist) + # loss += args.lambda_triangle * model.triangle_penalty(z) + loss.backward() + opt.step() + if epoch % 20 == 0: + print(f"epoch {epoch:4d} recon_loss={loss.item():.4f}") + + torch.save({"model": model.state_dict(), "args": vars(args)}, + args.out / "gae.pt") + print(f"[gnn.train] saved {args.out/'gae.pt'}") + + if args.privacy_check: + raise NotImplementedError( + "TODO(gnn): nearest-neighbour membership probe — held-out edges must " + "not be recoverable from embeddings above chance (SPEC.md § Privacy)." + ) + + +if __name__ == "__main__": + main() diff --git a/experiments/ml/requirements.txt b/experiments/ml/requirements.txt new file mode 100644 index 00000000..d0088102 --- /dev/null +++ b/experiments/ml/requirements.txt @@ -0,0 +1,33 @@ +# DataSynth ML experiments — pin major versions; the A100 box installs the +# CUDA build of torch separately (see README). +# +# pip install torch --index-url https://download.pytorch.org/whl/cu124 +# then: +# pip install -r requirements.txt + +# Core +torch>=2.4 +numpy>=1.26 +pandas>=2.2 +pyarrow>=16 # read corpus parquet directly +polars>=1.0 # faster columnar pre-aggregation (optional path) + +# GNN track +torch-geometric>=2.6 + +# Flow track — zuko is a small, composable normalizing-flow lib on torch +zuko>=1.1 + +# Sequence track uses plain torch.nn.Transformer (no extra dep); +# tokenizer/util only: +einops>=0.8 + +# Surrogate / tuning track +cma>=3.4 # CMA-ES gradient-free optimizer +scikit-learn>=1.5 # quick baselines + train/val splits + +# Eval bridge + plotting +scipy>=1.13 +matplotlib>=3.9 +tqdm>=4.66 +pyyaml>=6.0 diff --git a/experiments/ml/sequence/SPEC.md b/experiments/ml/sequence/SPEC.md new file mode 100644 index 00000000..a14a5a6f --- /dev/null +++ b/experiments/ml/sequence/SPEC.md @@ -0,0 +1,62 @@ +# Track 2 — Autoregressive temporal stream model + +## Objective + +Model each entity's JE stream as a sequence of discrete event tokens and learn +the temporal dynamics — closing **P1 IETD + Autocorr**, **P2 JELineBurst**, and +**P4 MeanGap**. These are the gaps the per-source IET sampler + lines-per-JE +prior approximate marginally but miss in their *joint, autocorrelated* form +(the W2/W8 autocorr regressions in the project history). + +## Why autoregressive (vs marginal samplers) + +The current samplers draw IET and line-count independently per event. Real GL +streams are bursty and autocorrelated: a flurry of postings clusters, then +quiets. A causal transformer conditions each event on the recent history, so +burst structure and lag-1 autocorrelation emerge instead of being imposed. + +## Data (`common.data_export --track sequence`) + +Group by `(client, source, trading_partner)`, sort by `entry_date`. Per event, +emit a token: + +``` +token = (Δt-bucket, line-count-bucket, account-class, weekday, hour-band) +``` + +Δt bucketized log-spaced (0, 1, 2-3, 4-7, 8-14, 15-30, 30+ days); line-count +bucketized (1, 2, 3-4, 5-8, 9-16, 17+). Artifacts (gitignored): +`streams.pt` (padded id sequences), `vocab.json` (bucket edges — structural, +no corpus content). + +## Architecture + +Decoder-only transformer (`torch.nn.TransformerEncoder` with a causal mask), +~4 layers, d_model=256, 4 heads. Factorized head: predict each token field +with its own softmax (Δt, line-count, account-class, weekday, hour-band) so the +joint is `p(Δt)·p(lines|Δt)·…`. Conditioning prefix = `(source, entity-type)` +embedding. + +Loss: sum of per-field cross-entropies. Teacher-forced. + +## Sampling → handoff + +Generate token streams per (source, entity); decode buckets back to concrete +Δt / line-count *ranges*. The Rust generator draws the concrete value +uniformly within the predicted bucket and — crucially — still routes amounts + +balance through the symbolic layer. So the model sets *timing + shape*, the +engine sets *values*. Keep Python-side; emit a per-entity event schedule +artifact, OR port the (small) transformer to candle for online use if the +schedule artifact proves too large. + +## Success criteria + +* P1 IETD DR and Autocorr DR (Source) **down ≥ 30%** vs v5.26; P2 JELineBurst + DR **down ≥ 20%**; P4 MeanGap DR **down ≥ 15%** — via `bf_bridge.score_canonical`. +* Balance / coherence unaffected (amounts unchanged path). + +## Privacy + +Lower risk than the GNN (tokens are coarse buckets, no names/text). Still: +rare (source, account-class) combos can be near-unique — drop buckets with +support < k before training; note in run config. diff --git a/experiments/ml/sequence/__init__.py b/experiments/ml/sequence/__init__.py new file mode 100644 index 00000000..50fcdacb --- /dev/null +++ b/experiments/ml/sequence/__init__.py @@ -0,0 +1 @@ +"""Track 2 — autoregressive temporal stream model (causal transformer).""" diff --git a/experiments/ml/sequence/model.py b/experiments/ml/sequence/model.py new file mode 100644 index 00000000..5160b1a6 --- /dev/null +++ b/experiments/ml/sequence/model.py @@ -0,0 +1,85 @@ +"""Decoder-only transformer over JE event-token streams (Track 2). + +Factorized multi-field head: each event token is the product of independent +softmaxes over (Δt-bucket, line-count-bucket, account-class, weekday, +hour-band). Causal mask makes it autoregressive. Runnable on plain torch. +""" + +from __future__ import annotations + +from dataclasses import dataclass + +import torch +import torch.nn as nn + + +@dataclass +class FieldVocab: + """Vocabulary sizes per token field (filled from vocab.json at load).""" + + dt: int + lines: int + account_class: int + weekday: int = 7 + hour_band: int = 6 + + +class EventStreamTransformer(nn.Module): + def __init__(self, vocab: FieldVocab, d_model: int = 256, n_layers: int = 4, + n_heads: int = 4, max_len: int = 512, n_sources: int = 64): + super().__init__() + self.vocab = vocab + # One embedding per field; summed into the token representation. + self.emb_dt = nn.Embedding(vocab.dt, d_model) + self.emb_lines = nn.Embedding(vocab.lines, d_model) + self.emb_class = nn.Embedding(vocab.account_class, d_model) + self.emb_weekday = nn.Embedding(vocab.weekday, d_model) + self.emb_hour = nn.Embedding(vocab.hour_band, d_model) + self.emb_source = nn.Embedding(n_sources, d_model) # conditioning prefix + self.pos = nn.Parameter(torch.zeros(1, max_len, d_model)) + + layer = nn.TransformerEncoderLayer( + d_model, n_heads, dim_feedforward=4 * d_model, batch_first=True + ) + self.backbone = nn.TransformerEncoder(layer, n_layers) + + # Factorized output heads. + self.head_dt = nn.Linear(d_model, vocab.dt) + self.head_lines = nn.Linear(d_model, vocab.lines) + self.head_class = nn.Linear(d_model, vocab.account_class) + self.head_weekday = nn.Linear(d_model, vocab.weekday) + self.head_hour = nn.Linear(d_model, vocab.hour_band) + + def forward(self, tokens: dict[str, torch.Tensor], source_id: torch.Tensor): + # tokens[field]: (B, T) long. source_id: (B,) long. + h = ( + self.emb_dt(tokens["dt"]) + + self.emb_lines(tokens["lines"]) + + self.emb_class(tokens["account_class"]) + + self.emb_weekday(tokens["weekday"]) + + self.emb_hour(tokens["hour_band"]) + ) + b, t, _ = h.shape + h = h + self.pos[:, :t] + h = h + self.emb_source(source_id).unsqueeze(1) # broadcast prefix + mask = nn.Transformer.generate_square_subsequent_mask(t, device=h.device) + h = self.backbone(h, mask=mask, is_causal=True) + return { + "dt": self.head_dt(h), + "lines": self.head_lines(h), + "account_class": self.head_class(h), + "weekday": self.head_weekday(h), + "hour_band": self.head_hour(h), + } + + @staticmethod + def loss(logits: dict[str, torch.Tensor], target: dict[str, torch.Tensor], + pad_idx: int = 0) -> torch.Tensor: + ce = nn.functional.cross_entropy + total = 0.0 + for field, lg in logits.items(): + # shift: predict token t from < t + pred = lg[:, :-1].reshape(-1, lg.size(-1)) + tgt = target[field][:, 1:].reshape(-1) + total = total + ce(pred, tgt, ignore_index=pad_idx) + return total diff --git a/experiments/ml/sequence/train.py b/experiments/ml/sequence/train.py new file mode 100644 index 00000000..e07437dc --- /dev/null +++ b/experiments/ml/sequence/train.py @@ -0,0 +1,61 @@ +"""Train the event-stream transformer (Track 2). + + python -m sequence.train --data data/sequence --out weights/sequence --epochs 30 +""" + +from __future__ import annotations + +import argparse +import json +from pathlib import Path + +import torch +from torch.utils.data import DataLoader, TensorDataset + +from .model import EventStreamTransformer, FieldVocab + + +def main(argv: list[str] | None = None) -> None: + ap = argparse.ArgumentParser(description=__doc__) + ap.add_argument("--data", type=Path, required=True) + ap.add_argument("--out", type=Path, required=True) + ap.add_argument("--epochs", type=int, default=30) + ap.add_argument("--batch-size", type=int, default=64) + ap.add_argument("--lr", type=float, default=3e-4) + ap.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu") + args = ap.parse_args(argv) + args.out.mkdir(parents=True, exist_ok=True) + dev = torch.device(args.device) + + vocab = FieldVocab(**json.loads((args.data / "vocab.json").read_text())["sizes"]) + model = EventStreamTransformer(vocab).to(dev) + opt = torch.optim.AdamW(model.parameters(), lr=args.lr) + + # streams.pt: dict of (N, T) field tensors + (N,) source_id. TODO(seq): + # confirm packing in data_export; this loader assumes that layout. + blob = torch.load(args.data / "streams.pt") + fields = ["dt", "lines", "account_class", "weekday", "hour_band"] + ds = TensorDataset(*[blob[f] for f in fields], blob["source_id"]) + dl = DataLoader(ds, batch_size=args.batch_size, shuffle=True) + + model.train() + for epoch in range(1, args.epochs + 1): + running = 0.0 + for batch in dl: + *field_tensors, source_id = [t.to(dev) for t in batch] + tokens = dict(zip(fields, field_tensors)) + logits = model(tokens, source_id) + loss = model.loss(logits, tokens) + opt.zero_grad() + loss.backward() + opt.step() + running += loss.item() + print(f"epoch {epoch:3d} loss={running/len(dl):.4f}") + + torch.save({"model": model.state_dict(), "vocab": vars(vocab)}, + args.out / "stream_tf.pt") + print(f"[sequence.train] saved {args.out/'stream_tf.pt'}") + + +if __name__ == "__main__": + main() diff --git a/experiments/ml/surrogate/SPEC.md b/experiments/ml/surrogate/SPEC.md new file mode 100644 index 00000000..dc4f49ca --- /dev/null +++ b/experiments/ml/surrogate/SPEC.md @@ -0,0 +1,59 @@ +# Track 4 — Learned eval surrogate + CMA-ES tuning loop + +## Objective + +Make the **calibration loop fast**. Today, tuning a generator knob means: +edit config → full generate → full BF eval (the hours-long cycle the project +history keeps deferring). Replace most of those full evals with a learned +**surrogate** `f(knobs) → predicted BF composite`, and search the knob space +with **CMA-ES** against the surrogate — validating only the promising points +with the real eval. + +This is the only track that touches **performance, not realism**, and it has +**zero coherence risk** — it never changes generation, only *which config* we +pick. The generator + constraints are untouched. + +## Why a surrogate (vs grid / manual tuning) + +The knob space (bypass share, drift thresholds, motif-bias weights, per-source +IET scales, …) is ~10-20 dimensional with expensive, noisy evaluations — +exactly the regime where Bayesian / surrogate-assisted optimization wins. +A cheap surrogate turns "10 full evals/day" into "thousands of surrogate +queries + a handful of confirmatory evals." + +## Data + +Bootstrapped from the baseline history: every `docs/baselines/*/metrics.csv` +is a `(knobs, composite)` sample. Plus an active-learning loop: each +confirmatory full eval adds a labeled point and retrains the surrogate. +Knob vector schema lives in `surrogate/knobs.py` (TODO: enumerate from +`GeneratorConfig` + the SP-series tuning params). + +## Architecture + +* **Surrogate**: small MLP (or GP for calibrated uncertainty at low data) — + `knobs (d) → [P1, P2, P3, P4 DRs]`, composite = aggregation. Predict the + *vector* of DRs, not just the scalar, so the optimizer can target specific + gaps. +* **Optimizer**: `cma.CMAEvolutionStrategy` over normalized knobs; acquisition + = surrogate-predicted composite + uncertainty bonus (UCB) to keep exploring. +* **Active loop**: every N surrogate-proposed optima → 1 real + `bf_bridge.score_canonical` → append → retrain surrogate. + +## Success criteria + +* Reach the current best composite (v5.26 ≈ 42 mean / 18 median) in **≤ ⅓ the + full evals** a manual sweep needed. +* Surrogate rank-correlation (Spearman) with the real eval > 0.8 on held-out + configs before trusting its proposals. + +## Handoff + +No generator change — output is a **tuned config patch** (same format the +existing `AutoTuner` emits). Drops straight into the regen pipeline. Runs CPU +or A100; the A100 just makes the surrogate retrain + CMA-ES batches instant. + +## Privacy + +None — operates on knob vectors + aggregate composite scores, never corpus +data or row-level output. diff --git a/experiments/ml/surrogate/__init__.py b/experiments/ml/surrogate/__init__.py new file mode 100644 index 00000000..1ccc5906 --- /dev/null +++ b/experiments/ml/surrogate/__init__.py @@ -0,0 +1 @@ +"""Track 4 — learned BF-eval surrogate + CMA-ES tuning loop.""" diff --git a/experiments/ml/surrogate/knobs.py b/experiments/ml/surrogate/knobs.py new file mode 100644 index 00000000..043f733d --- /dev/null +++ b/experiments/ml/surrogate/knobs.py @@ -0,0 +1,55 @@ +"""Knob vector schema for the tuning surrogate (Track 4). + +A knob = a normalized generator parameter the optimizer is allowed to move. +Bounds keep CMA-ES inside the validated config envelope. TODO: enumerate the +full set from `GeneratorConfig` + the SP-series tuning params; the entries +below are the ones the project history actually swept (bypass share, drift +thresholds, motif bias). +""" + +from __future__ import annotations + +from dataclasses import dataclass + +import numpy as np + + +@dataclass(frozen=True) +class Knob: + name: str + lo: float + hi: float + default: float + + +# Seed set from the baseline history (extend as more knobs are exposed). +KNOBS: list[Knob] = [ + Knob("priors_amount_bypass_share", 0.0, 0.5, 0.25), # SP5.3 sweet spot + Knob("drift_sigma_per_account", 1.0, 3.0, 2.0), # SP5.1 + Knob("drift_aggregate_pct", 0.001, 0.02, 0.005), # SP5.1 + Knob("tp_motif_bias", 0.0, 1.0, 0.5), # SP3.12 W2 + Knob("source_iet_scale", 0.5, 2.0, 1.0), + Knob("lines_per_je_dispersion", 0.5, 2.0, 1.0), + # TODO: append remaining swept params (W7.M bypass, semantic-split rate, …) +] + + +def to_vector(d: dict[str, float]) -> np.ndarray: + """Dict -> normalized [0,1] vector in KNOBS order.""" + return np.array( + [(d.get(k.name, k.default) - k.lo) / (k.hi - k.lo) for k in KNOBS], + dtype=np.float64, + ) + + +def from_vector(x: np.ndarray) -> dict[str, float]: + """Normalized vector -> concrete knob dict (clamped to bounds).""" + out = {} + for k, v in zip(KNOBS, x): + v = float(np.clip(v, 0.0, 1.0)) + out[k.name] = k.lo + v * (k.hi - k.lo) + return out + + +def dim() -> int: + return len(KNOBS) diff --git a/experiments/ml/surrogate/optimize.py b/experiments/ml/surrogate/optimize.py new file mode 100644 index 00000000..1d9f471c --- /dev/null +++ b/experiments/ml/surrogate/optimize.py @@ -0,0 +1,103 @@ +"""CMA-ES tuning loop over generator knobs, surrogate-assisted (Track 4). + + python -m surrogate.optimize --history docs/baselines --out weights/surrogate + +Loop: + 1. seed surrogate from baseline history (knobs, DRs) + 2. CMA-ES proposes knob vectors, scored cheaply by the surrogate (UCB) + 3. every --confirm-every generations, run the REAL BF scorer on the + incumbent, append the labeled point, retrain the surrogate (active learning) + 4. emit the best knobs as a config patch (AutoTuner format) + +The confirmation step shells out to a full generate + canonical BF scorer; +that is the only expensive call, and we make few of them. +""" + +from __future__ import annotations + +import argparse +import json +from pathlib import Path + +import numpy as np + +from . import knobs as K +from .surrogate import composite_from_drs, fit + + +def load_history(baselines_dir: Path) -> tuple[np.ndarray, np.ndarray]: + """Parse docs/baselines/*/metrics.csv into (knobs, DR-vector) samples. + + TODO(surrogate): map each baseline's recorded params -> knob vector and its + per-family DRs -> Y. Until wired, returns a tiny synthetic seed so the loop + is runnable end-to-end for smoke-testing the machinery. + """ + rng = np.random.default_rng(0) + X = rng.random((16, K.dim())) + Y = 1.0 + rng.random((16, 4)) * 40.0 # placeholder DRs + print("[optimize] WARNING: using synthetic seed — wire load_history to " + "docs/baselines before trusting results.") + return X, Y + + +def confirm_score(knob_dict: dict) -> np.ndarray: + """Full generate + canonical BF scorer for one knob config -> DR vector. + + TODO(surrogate): write knob_dict into a config patch, run + `datasynth-data generate`, then bf_bridge.score_canonical; return + [P1,P2,P3,P4] DRs. Expensive — called only on incumbents. + """ + raise NotImplementedError( + "confirm_score: wire to the regen pipeline + bf_bridge.score_canonical" + ) + + +def main(argv: list[str] | None = None) -> None: + ap = argparse.ArgumentParser(description=__doc__) + ap.add_argument("--history", type=Path, default=Path("../../docs/baselines")) + ap.add_argument("--out", type=Path, required=True) + ap.add_argument("--generations", type=int, default=50) + ap.add_argument("--confirm-every", type=int, default=10) + ap.add_argument("--sigma0", type=float, default=0.2) + ap.add_argument("--smoke", action="store_true", + help="surrogate-only loop (no confirmation runs) to test machinery") + args = ap.parse_args(argv) + args.out.mkdir(parents=True, exist_ok=True) + + try: + import cma + except ImportError as exc: + raise SystemExit("cma required: pip install -r ../requirements.txt") from exc + + X, Y = load_history(args.history) + model = fit(X, Y) + + es = cma.CMAEvolutionStrategy(np.full(K.dim(), 0.5), args.sigma0, + {"bounds": [0.0, 1.0], "verbose": -1}) + import torch + + best = (np.inf, None) + for gen in range(1, args.generations + 1): + sols = es.ask() + with torch.no_grad(): + pred = model(torch.tensor(np.array(sols), dtype=torch.float32)).numpy() + scores = [composite_from_drs(r) for r in pred] # minimize composite + es.tell(sols, scores) + gbest = min(zip(scores, sols), key=lambda t: t[0]) + if gbest[0] < best[0]: + best = gbest + if gen % args.confirm_every == 0 and not args.smoke: + drs = confirm_score(K.from_vector(best[1])) # expensive, rare + X = np.vstack([X, best[1]]) + Y = np.vstack([Y, drs]) + model = fit(X, Y) # active-learning retrain + print(f"gen {gen}: confirmed composite={composite_from_drs(drs):.2f}") + + patch = K.from_vector(best[1]) + (args.out / "tuned_knobs.json").write_text(json.dumps(patch, indent=2)) + print(f"[optimize] best surrogate composite={best[0]:.2f}") + print(f"[optimize] wrote {args.out/'tuned_knobs.json'} (AutoTuner-compatible)") + + +if __name__ == "__main__": + main() diff --git a/experiments/ml/surrogate/surrogate.py b/experiments/ml/surrogate/surrogate.py new file mode 100644 index 00000000..8d3af30b --- /dev/null +++ b/experiments/ml/surrogate/surrogate.py @@ -0,0 +1,71 @@ +"""MLP surrogate: knob vector -> predicted per-family BF degradation ratios. + +Predicts the DR *vector* [P1, P2, P3, P4] (not just the scalar) so the +optimizer can target specific gaps. Small + CPU-friendly; the A100 just makes +retraining instant inside the active loop. +""" + +from __future__ import annotations + +import numpy as np +import torch +import torch.nn as nn + +DR_FAMILIES = ["P1", "P2", "P3", "P4"] + + +class SurrogateMLP(nn.Module): + def __init__(self, in_dim: int, hidden=(64, 64), out_dim: int = len(DR_FAMILIES)): + super().__init__() + layers: list[nn.Module] = [] + d = in_dim + for h in hidden: + layers += [nn.Linear(d, h), nn.SiLU()] + d = h + layers += [nn.Linear(d, out_dim), nn.Softplus()] # DRs are >= 0 + self.net = nn.Sequential(*layers) + + def forward(self, x: torch.Tensor) -> torch.Tensor: + return self.net(x) + + +def composite_from_drs(drs: np.ndarray, weights: np.ndarray | None = None) -> float: + """Volume-corrected-style mean over DR families (see CHANGELOG composites).""" + w = np.ones(drs.shape[-1]) if weights is None else weights + return float((drs * w).sum() / w.sum()) + + +def fit( + X: np.ndarray, + Y: np.ndarray, + epochs: int = 500, + lr: float = 1e-3, + device: str = "cpu", +) -> SurrogateMLP: + """X: (n, d) normalized knobs. Y: (n, 4) per-family DRs.""" + dev = torch.device(device) + model = SurrogateMLP(X.shape[1]).to(dev) + opt = torch.optim.Adam(model.parameters(), lr=lr) + xt = torch.tensor(X, dtype=torch.float32, device=dev) + yt = torch.tensor(Y, dtype=torch.float32, device=dev) + model.train() + for ep in range(epochs): + pred = model(xt) + loss = nn.functional.mse_loss(pred, yt) + opt.zero_grad() + loss.backward() + opt.step() + if (ep + 1) % 100 == 0: + print(f" surrogate epoch {ep+1} mse={loss.item():.4f}") + return model + + +def spearman(model: SurrogateMLP, X: np.ndarray, Y: np.ndarray) -> float: + """Rank-correlation of predicted vs true composite (gate: > 0.8).""" + from scipy.stats import spearmanr + + with torch.no_grad(): + pred = model(torch.tensor(X, dtype=torch.float32)).numpy() + pc = np.array([composite_from_drs(r) for r in pred]) + tc = np.array([composite_from_drs(r) for r in Y]) + return float(spearmanr(pc, tc).correlation) From 044fdfbba897748f686edcdb61dcc0d007116f11 Mon Sep 17 00:00:00 2001 From: Michael Ivertowski Date: Wed, 20 May 2026 21:49:58 +0200 Subject: [PATCH 02/18] feat(ml): add inverse / simulation-based-inference (SBI) track MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Track 5 — run the generator backward. Given an observed GL, recover a posterior over the latent process parameters that could have produced it (audit-analytics direction). Feasible here because DataSynth is a structured generative model with KNOWN ground truth: it manufactures labeled (θ → GL) pairs for free, and the hard accounting constraints regularize the otherwise ill-posed inverse. Amortized SNPE: a conditional normalizing flow q_φ(θ | x) (reuses the flow/ NSF) trained on forward-simulated pairs, where x is a GL summary-stat vector. Many-to-one forward map ⇒ we recover a posterior, not a point. Files: - params.py tier-1 identifiable parameter set + priors (fraud rates, amount σ, posting-lag μ/σ, concentration) - simulate.py draw θ ~ prior → datasynth-data generate → summary stats x - model.py PosteriorFlow (zuko NSF conditioned on x) - train.py maximize Σ log q_φ(θ|x) on the simulated pairs - validate.py SBC rank histograms + credible-interval COVERAGE on held-out synthetic — measure how well 'backward' works before pointing at any real GL Scope ladder in SPEC.md: parameters (this) → process attribution (overlaps ocpm + gnn) → latent fraud/anomaly labels. Inversion quality is gated by the forward model's BF fidelity (distribution shift). Privacy: trains on synthetic only; emits parameter posteriors, never row-level content. Co-Authored-By: Claude Opus 4.7 (1M context) --- experiments/ml/README.md | 15 +++-- experiments/ml/inverse/SPEC.md | 97 ++++++++++++++++++++++++++++++ experiments/ml/inverse/__init__.py | 1 + experiments/ml/inverse/model.py | 46 ++++++++++++++ experiments/ml/inverse/params.py | 68 +++++++++++++++++++++ experiments/ml/inverse/simulate.py | 93 ++++++++++++++++++++++++++++ experiments/ml/inverse/train.py | 74 +++++++++++++++++++++++ experiments/ml/inverse/validate.py | 80 ++++++++++++++++++++++++ 8 files changed, 470 insertions(+), 4 deletions(-) create mode 100644 experiments/ml/inverse/SPEC.md create mode 100644 experiments/ml/inverse/__init__.py create mode 100644 experiments/ml/inverse/model.py create mode 100644 experiments/ml/inverse/params.py create mode 100644 experiments/ml/inverse/simulate.py create mode 100644 experiments/ml/inverse/train.py create mode 100644 experiments/ml/inverse/validate.py diff --git a/experiments/ml/README.md b/experiments/ml/README.md index 7c306d0b..bf581847 100644 --- a/experiments/ml/README.md +++ b/experiments/ml/README.md @@ -25,18 +25,25 @@ The NN never emits a final balance. It emits shape; the existing Rust generator projects that shape onto the feasible manifold. Coherence stays a hard guarantee by construction. -## The four tracks +## The five tracks -| Dir | Track | Architecture | BF metrics targeted | -|-----|-------|--------------|---------------------| +Tracks 1–4 sharpen the **forward** model (closing the BF gap); track 5 runs it +**backward** (recover the latent parameters from a GL). + +| Dir | Track | Architecture | Targets | +|-----|-------|--------------|---------| | [`gnn/`](gnn/SPEC.md) | Relational / interconnectivity sampler | GraphSAGE encoder + edge/degree decoder (GAE-style) | P3 ClusteringGap, TriangleLogRatio (TP / vendor / IC graphs) | | [`sequence/`](sequence/SPEC.md) | Temporal / behavioral stream | Autoregressive transformer over per-(source, entity) JE token streams | P1 IETD + Autocorr, P2 JELineBurst, P4 MeanGap | | [`flow/`](flow/SPEC.md) | Amount marginals | Conditional normalizing flow per (source, account-class) | Benford / multimodal amount fidelity | | [`surrogate/`](surrogate/SPEC.md) | Tuning-loop accelerator | MLP surrogate of the BF composite + CMA-ES over generator knobs | *Performance* of calibration (no coherence risk — never touches generation) | +| [`inverse/`](inverse/SPEC.md) | Backward inference (SBI) | Amortized neural posterior `q(θ\|GL)` trained on forward-simulated pairs | Recover the latent process *parameters* a GL was distilled from, with calibrated uncertainty (SBC + coverage validated on synthetic) | Start with **`gnn/`** (highest leverage on the structural gaps the hand-tuned motif samplers can't close) or **`surrogate/`** (pure iteration-speed win, -zero coherence risk). See each `SPEC.md` for objective, data contract, +zero coherence risk). **`inverse/`** is the audit-analytics direction — it +reuses the `flow/` density + the forward simulator's free ground truth. See +each `SPEC.md` for objective, data contract, architecture, and success +criteria. architecture, and success criteria. ## Privacy / legal (read before running) diff --git a/experiments/ml/inverse/SPEC.md b/experiments/ml/inverse/SPEC.md new file mode 100644 index 00000000..c9d71403 --- /dev/null +++ b/experiments/ml/inverse/SPEC.md @@ -0,0 +1,97 @@ +# Track 5 — Inverse / simulation-based inference (SBI) + +## Objective + +Run the generator **backward**: given an observed GL, recover a posterior over +the latent process **parameters** (and, later, process structure / ground-truth +labels) that could have produced it. This turns DataSynth from a forward +simulator into an *audit-analytics* inference tool — reconstructing the +processes a GL was distilled from, with calibrated uncertainty. + +## Why this is tractable here (and rarely elsewhere) + +DataSynth is a **structured generative model with known ground truth**. It +manufactures labeled `(parameters → GL)` pairs at scale (~200K entries/s), so +we get *supervised training data for the inverse for free* — the thing most +inverse problems lack. And the hard accounting constraints (debits=credits, +A=L+E, document-chain integrity, three-way-match tolerances) shrink the inverse +search space dramatically, regularizing an otherwise ill-posed problem. + +## The inverse is many-to-one → recover a posterior, not a point + +The forward map discards information at every layer (a round-dollar weekend +posting is consistent with both fraud and a legitimate accrual). So the target +is `p(θ | GL)`, a **posterior**, never a unique reconstruction. Report +posteriors + coverage; never false-precision point estimates. + +## Approach — amortized SBI (SNPE-style) + +``` + forward (datasynth-data generate) + θ ~ prior ───────────────────────────────▶ GL ──summary stats──▶ x + │ │ + └──────────── train q_φ(θ | x) (conditional flow) ◀─────────┘ + inference: real GL ──summary stats──▶ x* ──▶ q_φ(θ | x*) (one fwd pass) +``` + +1. **`simulate.py`** — draw θ from a prior over a *small, identifiable* + parameter set first (e.g. `fraud_rate`, `document_fraud_rate`, fan-out + shape, posting-lag μ/σ, amount log-normal σ), run `datasynth-data generate`, + compute summary statistics `x` of the resulting GL. Emit `(θ, x)` pairs. +2. **`model.py`** — a conditional normalizing flow `q_φ(θ | x)` (reuse the + `flow/` track's zuko NSF, conditioned on `x`). This is Sequential Neural + Posterior Estimation in its single-round (amortized) form. +3. **`train.py`** — maximize `Σ log q_φ(θ_i | x_i)` over the simulated pairs. +4. **`validate.py`** — the clean part: validate on **held-out synthetic** where + θ is known. Metrics: posterior-mean error per parameter, **simulation-based + calibration (SBC)** rank histograms, and credible-interval **coverage** + (a 90% interval should contain the truth ~90% of the time). + +## Summary statistics `x` (the GL → feature map) + +Reuse `common.bf_bridge` feature extractors + add inverse-relevant ones: +per-source row-share, IIET distribution moments, lines-per-JE histogram, +amount log-moments + Benford MAD, fan-out degree stats, weekend/off-hours/ +round-dollar fractions, document-chain completeness. TODO: finalize the +feature vector in `simulate.py` once the parameter set is fixed. + +## Scope ladder (do in order) + +1. **Parameters only** (this spec): 5–10 identifiable knobs, validated on + synthetic. Lowest risk, clearest eval. +2. **Process attribution**: which JEs form one P2P/O2C instance — overlaps + `datasynth-ocpm` discovery + conformance; a GNN over the transaction graph + (see `gnn/`) is the natural tool. +3. **Latent labels** (fraud / anomaly cause): a ranked posterior per JE. + Hardest; bounded by identifiability. + +## Success criteria (tier 1) + +- Posterior-mean recovers each parameter within its prior's noise floor on + held-out synthetic. +- SBC rank histograms ~uniform; 90% credible-interval coverage in [0.85, 0.95]. +- Honest failure modes documented per parameter (which are well- vs + poorly-identified from GL alone). + +## Distribution shift = the BF gap + +The inverse is only as trustworthy as the forward model's fidelity to reality. +An inverse trained on synthetic, applied to a real GL, is biased by exactly the +behavioral-fidelity gap the composite measures. So fidelity work directly gates +inversion quality — and the inverse should only be pointed at real GL once the +forward model's BF composite is acceptable for the targeted account/source mix. + +## Privacy + +Training data is synthetic (no corpus). Applying the trained inverse to a real +GL reads that GL but emits only parameter posteriors — no row-level corpus +content. Same `DATASYNTH_CORPUS_DIR` discipline if real GL is used for +evaluation; results (posteriors) are not corpus content but treat any +real-GL-derived artifact as sensitive until reviewed. + +## Handoff + +Output is a posterior over generator knobs — directly comparable to the +`surrogate/` track's knob space and consumable as an AutoTuner-style report +("the corpus most likely came from these parameters"). Python-side; no Rust +generator change. diff --git a/experiments/ml/inverse/__init__.py b/experiments/ml/inverse/__init__.py new file mode 100644 index 00000000..1d14ed37 --- /dev/null +++ b/experiments/ml/inverse/__init__.py @@ -0,0 +1 @@ +"""Track 5 — inverse / simulation-based inference (amortized SNPE).""" diff --git a/experiments/ml/inverse/model.py b/experiments/ml/inverse/model.py new file mode 100644 index 00000000..a3357b91 --- /dev/null +++ b/experiments/ml/inverse/model.py @@ -0,0 +1,46 @@ +"""Amortized posterior estimator q_φ(θ | x) for the inverse track. + +A conditional normalizing flow (zuko NSF) that maps a GL summary-stat vector +`x` to a distribution over normalized parameters θ ∈ [0,1]^d. Single-round +(amortized) SNPE: one network, trained on prior-simulated pairs, usable on any +new GL in a forward pass. +""" + +from __future__ import annotations + +import torch +import torch.nn as nn + +try: + import zuko +except ImportError as exc: # pragma: no cover + raise ImportError("zuko required: pip install -r ../requirements.txt") from exc + + +class PosteriorFlow(nn.Module): + def __init__(self, dim_theta: int, dim_x: int, transforms: int = 5, + hidden=(128, 128)): + super().__init__() + # Standardize x before conditioning (fit at train time). + self.register_buffer("x_mean", torch.zeros(dim_x)) + self.register_buffer("x_std", torch.ones(dim_x)) + self.flow = zuko.flows.NSF( + features=dim_theta, context=dim_x, transforms=transforms, + hidden_features=hidden, + ) + + def set_x_norm(self, mean: torch.Tensor, std: torch.Tensor) -> None: + self.x_mean.copy_(mean) + self.x_std.copy_(std.clamp_min(1e-6)) + + def _cond(self, x: torch.Tensor) -> torch.Tensor: + return (x - self.x_mean) / self.x_std + + def log_prob(self, theta: torch.Tensor, x: torch.Tensor) -> torch.Tensor: + return self.flow(self._cond(x)).log_prob(theta) + + @torch.no_grad() + def sample(self, x: torch.Tensor, n: int) -> torch.Tensor: + """Draw n posterior samples of θ (normalized) for a single x (dim_x,).""" + ctx = self._cond(x.unsqueeze(0)) + return self.flow(ctx).sample((n,)).squeeze(1) diff --git a/experiments/ml/inverse/params.py b/experiments/ml/inverse/params.py new file mode 100644 index 00000000..3829bad5 --- /dev/null +++ b/experiments/ml/inverse/params.py @@ -0,0 +1,68 @@ +"""Inverse parameter space (tier 1): a small, identifiable set of generator +knobs with priors. Kept deliberately small — these are the latents we expect +to recover from a GL with calibrated uncertainty. Extend only after SBC + +coverage stay healthy (poorly-identified params widen everything's posterior). +""" + +from __future__ import annotations + +from dataclasses import dataclass + +import numpy as np + + +@dataclass(frozen=True) +class Param: + name: str # config key written into the generate config + lo: float + hi: float + log: bool = False # sample/scale in log space (for rates spanning decades) + + +# Tier-1 set. Names map to GeneratorConfig keys (see datasynth-config schema). +PARAMS: list[Param] = [ + Param("fraud.fraud_rate", 0.0, 0.10), + Param("fraud.document_fraud_rate", 0.0, 0.10), + Param("distributions.amounts.sigma", 0.5, 2.5), # log-normal width + Param("temporal_patterns.processing_lags.invoice_receipt_lag.mu", 0.0, 3.0), + Param("temporal_patterns.processing_lags.invoice_receipt_lag.sigma", 0.2, 1.5), + Param("vendor_network.dependencies.top_5_concentration", 0.20, 0.70), + # TODO: add lines-per-JE dispersion + source-mix concentration once the + # summary-stat feature map (simulate.py) exposes the matching observables. +] + + +def sample_prior(rng: np.random.Generator, n: int) -> np.ndarray: + """Draw n θ vectors from the (independent, uniform) prior. Shape (n, d).""" + cols = [] + for p in PARAMS: + if p.log: + lo, hi = np.log(max(p.lo, 1e-6)), np.log(p.hi) + cols.append(np.exp(rng.uniform(lo, hi, n))) + else: + cols.append(rng.uniform(p.lo, p.hi, n)) + return np.stack(cols, axis=1) + + +def to_config_overrides(theta: np.ndarray) -> dict[str, float]: + """One θ vector -> {config_key: value} overrides for a generate run.""" + return {p.name: float(v) for p, v in zip(PARAMS, theta)} + + +def normalize(theta: np.ndarray) -> np.ndarray: + """Map θ to [0,1]^d for stable flow training.""" + out = np.empty_like(theta, dtype=np.float64) + for j, p in enumerate(PARAMS): + out[..., j] = (theta[..., j] - p.lo) / (p.hi - p.lo) + return out + + +def denormalize(u: np.ndarray) -> np.ndarray: + out = np.empty_like(u, dtype=np.float64) + for j, p in enumerate(PARAMS): + out[..., j] = np.clip(u[..., j], 0.0, 1.0) * (p.hi - p.lo) + p.lo + return out + + +def dim() -> int: + return len(PARAMS) diff --git a/experiments/ml/inverse/simulate.py b/experiments/ml/inverse/simulate.py new file mode 100644 index 00000000..b71b26ff --- /dev/null +++ b/experiments/ml/inverse/simulate.py @@ -0,0 +1,93 @@ +"""Generate (θ, x) training pairs for the inverse model. + +For each θ drawn from the prior: write a config with those overrides, run +`datasynth-data generate`, read the resulting journal_entries, compute the +summary-stat feature vector x. Emit (θ, x) to `--out`. + + python -m inverse.simulate --n 2000 --out data/inverse --base configs/demo.yaml + +CPU-bound and embarrassingly parallel across θ; safe to shard. Does NOT need +the corpus — this is pure synthetic self-simulation (the SBI training set). +""" + +from __future__ import annotations + +import argparse +import json +import subprocess +import tempfile +from pathlib import Path + +import numpy as np + +from . import params as P + + +def _cli() -> str: + import shutil + for c in ("./target/release/datasynth-data", "datasynth-data"): + if shutil.which(c) or Path(c).exists(): + return c + raise FileNotFoundError("build datasynth-data (cargo build --release -p datasynth-cli)") + + +def summary_stats(je_csv: Path) -> np.ndarray: + """GL → fixed-length feature vector x. Reuses the same observables the BF + eval keys on so the inverse 'sees' what the forward model varies. + + TODO: finalize alongside params.py. Pseudocode: + - per-source row-share (top-K sources) + - inter-event-time mean/std/skew per source, pooled + - lines-per-JE histogram (fixed bins) + - log|amount| mean/std + Benford first-digit MAD + - weekend / off-hours / round-dollar fractions + - fan-out degree mean/gini; document-chain completeness + """ + raise NotImplementedError( + "summary_stats: compute the fixed-length feature vector from je_csv " + "(see SPEC.md § Summary statistics; share extractors with common.bf_bridge)." + ) + + +def run_one(cli: str, base_cfg: Path, theta: np.ndarray, workdir: Path) -> np.ndarray: + overrides = P.to_config_overrides(theta) + # TODO: merge `overrides` (dotted keys) into a copy of base_cfg → cfg.yaml. + cfg = workdir / "cfg.yaml" + out = workdir / "out" + raise NotImplementedError( + f"run_one: write {cfg} = base_cfg + {overrides}, then " + f"`{cli} generate -c {cfg} -o {out} --memory-limit 512 --max-threads 1`, " + f"then summary_stats({out}/journal_entries.csv)." + ) + + +def main(argv: list[str] | None = None) -> None: + ap = argparse.ArgumentParser(description=__doc__) + ap.add_argument("--n", type=int, default=2000, help="number of (θ, x) pairs") + ap.add_argument("--out", type=Path, required=True) + ap.add_argument("--base", type=Path, required=True, help="base generate config") + ap.add_argument("--seed", type=int, default=0) + args = ap.parse_args(argv) + args.out.mkdir(parents=True, exist_ok=True) + + rng = np.random.default_rng(args.seed) + thetas = P.sample_prior(rng, args.n) + cli = _cli() + + xs = [] + with tempfile.TemporaryDirectory() as td: + for i, theta in enumerate(thetas): + x = run_one(cli, args.base, theta, Path(td)) # raises until wired + xs.append(x) + if (i + 1) % 50 == 0: + print(f"[simulate] {i+1}/{args.n}") + X = np.stack(xs) + np.savez(args.out / "pairs.npz", theta=thetas, x=X, + param_names=[p.name for p in P.PARAMS]) + (args.out / "meta.json").write_text(json.dumps( + {"n": args.n, "dim_theta": P.dim(), "dim_x": X.shape[1]}, indent=2)) + print(f"[simulate] wrote {args.out/'pairs.npz'}") + + +if __name__ == "__main__": + main() diff --git a/experiments/ml/inverse/train.py b/experiments/ml/inverse/train.py new file mode 100644 index 00000000..d457f9b7 --- /dev/null +++ b/experiments/ml/inverse/train.py @@ -0,0 +1,74 @@ +"""Train the amortized posterior q_φ(θ | x) on simulated pairs. + + python -m inverse.train --data data/inverse --out weights/inverse --epochs 300 +""" + +from __future__ import annotations + +import argparse +from pathlib import Path + +import numpy as np +import torch +from torch.utils.data import DataLoader, TensorDataset + +from . import params as P +from .model import PosteriorFlow + + +def main(argv: list[str] | None = None) -> None: + ap = argparse.ArgumentParser(description=__doc__) + ap.add_argument("--data", type=Path, required=True) + ap.add_argument("--out", type=Path, required=True) + ap.add_argument("--epochs", type=int, default=300) + ap.add_argument("--batch-size", type=int, default=256) + ap.add_argument("--lr", type=float, default=1e-3) + ap.add_argument("--val-frac", type=float, default=0.2) + ap.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu") + args = ap.parse_args(argv) + args.out.mkdir(parents=True, exist_ok=True) + dev = torch.device(args.device) + + blob = np.load(args.data / "pairs.npz") + theta = P.normalize(blob["theta"]).astype("float32") # (n, d) in [0,1] + x = blob["x"].astype("float32") # (n, dim_x) + + n_val = max(1, int(len(theta) * args.val_frac)) + tr = slice(n_val, None) + va = slice(0, n_val) + + model = PosteriorFlow(dim_theta=P.dim(), dim_x=x.shape[1]).to(dev) + xt = torch.tensor(x, device=dev) + model.set_x_norm(xt[tr].mean(0), xt[tr].std(0)) + + ds = TensorDataset(torch.tensor(theta[tr], device=dev), xt[tr]) + dl = DataLoader(ds, batch_size=args.batch_size, shuffle=True) + opt = torch.optim.Adam(model.parameters(), lr=args.lr) + + theta_va = torch.tensor(theta[va], device=dev) + x_va = xt[va] + best = float("inf") + for ep in range(1, args.epochs + 1): + model.train(True) + run = 0.0 + for th, xx in dl: + loss = -model.log_prob(th, xx).mean() + opt.zero_grad() + loss.backward() + opt.step() + run += loss.item() + model.train(False) # inference mode (equivalent to .eval()) + with torch.no_grad(): + vloss = -model.log_prob(theta_va, x_va).mean().item() + if vloss < best: + best = vloss + torch.save({"model": model.state_dict(), "dim_x": x.shape[1]}, + args.out / "posterior.pt") + if ep % 25 == 0: + print(f"epoch {ep:4d} train_nll={run/len(dl):.3f} val_nll={vloss:.3f}") + print(f"[inverse.train] best val_nll={best:.3f} -> {args.out/'posterior.pt'}") + print("Next: python -m inverse.validate --data ... --weights ... (SBC + coverage)") + + +if __name__ == "__main__": + main() diff --git a/experiments/ml/inverse/validate.py b/experiments/ml/inverse/validate.py new file mode 100644 index 00000000..88410c00 --- /dev/null +++ b/experiments/ml/inverse/validate.py @@ -0,0 +1,80 @@ +"""Validate the inverse posterior on held-out synthetic, where θ is known. + + python -m inverse.validate --data data/inverse --weights weights/inverse + +Reports, per parameter: + - posterior-mean absolute error (vs the true θ) + - simulation-based calibration (SBC) rank: for a calibrated posterior, the + rank of the true θ among posterior samples is uniform on [0, n_samples]. + - central credible-interval coverage (a 90% interval should contain the + truth ~90% of the time) — the headline trust metric. + +This is the whole point of doing inversion against a forward simulator: we can +measure how well 'running the engine backward' works BEFORE pointing it at any +real GL. +""" + +from __future__ import annotations + +import argparse +from pathlib import Path + +import numpy as np +import torch + +from . import params as P +from .model import PosteriorFlow + + +def coverage_and_sbc(model: PosteriorFlow, theta_true_n: np.ndarray, + x: torch.Tensor, n_samples: int = 500, cred: float = 0.90): + d = P.dim() + ranks = np.zeros((len(theta_true_n), d), dtype=int) + covered = np.zeros((len(theta_true_n), d), dtype=bool) + abs_err = np.zeros((len(theta_true_n), d)) + lo_q, hi_q = (1 - cred) / 2, 1 - (1 - cred) / 2 + model.train(False) # inference mode: freeze dropout / running stats + for i in range(len(theta_true_n)): + s = model.sample(x[i], n_samples).cpu().numpy() # (n_samples, d) normalized + truth = theta_true_n[i] + ranks[i] = (s < truth).sum(axis=0) + lo = np.quantile(s, lo_q, axis=0) + hi = np.quantile(s, hi_q, axis=0) + covered[i] = (truth >= lo) & (truth <= hi) + abs_err[i] = np.abs(s.mean(axis=0) - truth) + return ranks, covered.mean(axis=0), abs_err.mean(axis=0) + + +def main(argv: list[str] | None = None) -> None: + ap = argparse.ArgumentParser(description=__doc__) + ap.add_argument("--data", type=Path, required=True) + ap.add_argument("--weights", type=Path, required=True) + ap.add_argument("--val-frac", type=float, default=0.2) + ap.add_argument("--cred", type=float, default=0.90) + ap.add_argument("--device", default="cpu") + args = ap.parse_args(argv) + dev = torch.device(args.device) + + blob = np.load(args.data / "pairs.npz") + theta_n = P.normalize(blob["theta"]).astype("float32") + x = blob["x"].astype("float32") + n_val = max(1, int(len(theta_n) * args.val_frac)) + theta_va, x_va = theta_n[:n_val], torch.tensor(x[:n_val], device=dev) + + ckpt = torch.load(args.weights / "posterior.pt", map_location=dev) + model = PosteriorFlow(dim_theta=P.dim(), dim_x=ckpt["dim_x"]).to(dev) + model.load_state_dict(ckpt["model"]) + + ranks, cov, err = coverage_and_sbc(model, theta_va, x_va, cred=args.cred) + print(f"{'parameter':<55} {'mae(norm)':>10} {f'{int(args.cred*100)}%cov':>8}") + for j, p in enumerate(P.PARAMS): + flag = "" if 0.85 <= cov[j] <= 0.95 else " miscalibrated" + print(f"{p.name:<55} {err[j]:>10.3f} {cov[j]:>8.2f}{flag}") + print("\nSBC: rank histograms should be ~uniform (export ranks for a plot).") + print("TODO: save ranks -> SBC rank-histogram PNG; flag non-uniform params " + "as poorly identified from GL alone (expected for some — an honest " + "finding, not a bug).") + + +if __name__ == "__main__": + main() From 63c09e83c27a152fd7e98e029a75dff8254e1573 Mon Sep 17 00:00:00 2001 From: Michael Ivertowski Date: Thu, 21 May 2026 12:44:50 +0200 Subject: [PATCH 03/18] feat(ml/inverse): implement simulate.py + trim params to 5 verified scalar knobs Fills the simulate.py TODOs: a 29-dim observable-only GL summary-stat vector (amount / Benford / round-dollar / weekend / lines-per-JE / posting-lag / source-mix / IET / GL-concentration) and run_one (dotted-key config override -> datasynth-data generate -> summary_stats), fanned out over a process pool. Drops the invalid distributions.amounts.sigma knob (amounts is a mixture components list, not a scalar) so overrides stay valid under deny_unknown_fields; 5 verified scalar params remain for the tier-1 demo. Co-Authored-By: Claude Opus 4.7 (1M context) --- experiments/ml/inverse/params.py | 10 +- experiments/ml/inverse/simulate.py | 252 ++++++++++++++++++++++++----- 2 files changed, 215 insertions(+), 47 deletions(-) diff --git a/experiments/ml/inverse/params.py b/experiments/ml/inverse/params.py index 3829bad5..083161fb 100644 --- a/experiments/ml/inverse/params.py +++ b/experiments/ml/inverse/params.py @@ -19,16 +19,18 @@ class Param: log: bool = False # sample/scale in log space (for rates spanning decades) -# Tier-1 set. Names map to GeneratorConfig keys (see datasynth-config schema). +# Tier-1 set. Names map to GeneratorConfig keys (verified scalar paths in +# datasynth-config/src/schema.rs — `deny_unknown_fields` is in force, so every +# key here must be a real settable field). The amount log-normal width was +# dropped: `distributions.amounts` is a mixture `components` list, not a scalar +# `sigma`; re-adding it needs a structured (replace-components) override — +# tracked as the 6th-knob follow-up. PARAMS: list[Param] = [ Param("fraud.fraud_rate", 0.0, 0.10), Param("fraud.document_fraud_rate", 0.0, 0.10), - Param("distributions.amounts.sigma", 0.5, 2.5), # log-normal width Param("temporal_patterns.processing_lags.invoice_receipt_lag.mu", 0.0, 3.0), Param("temporal_patterns.processing_lags.invoice_receipt_lag.sigma", 0.2, 1.5), Param("vendor_network.dependencies.top_5_concentration", 0.20, 0.70), - # TODO: add lines-per-JE dispersion + source-mix concentration once the - # summary-stat feature map (simulate.py) exposes the matching observables. ] diff --git a/experiments/ml/inverse/simulate.py b/experiments/ml/inverse/simulate.py index b71b26ff..a69347a1 100644 --- a/experiments/ml/inverse/simulate.py +++ b/experiments/ml/inverse/simulate.py @@ -4,61 +4,211 @@ `datasynth-data generate`, read the resulting journal_entries, compute the summary-stat feature vector x. Emit (θ, x) to `--out`. - python -m inverse.simulate --n 2000 --out data/inverse --base configs/demo.yaml + python -m inverse.simulate --n 2000 --out data/inverse --base configs/inverse_base.yaml -CPU-bound and embarrassingly parallel across θ; safe to shard. Does NOT need -the corpus — this is pure synthetic self-simulation (the SBI training set). +CPU-bound and embarrassingly parallel across θ; we fan out over a process +pool. Does NOT need the corpus — pure synthetic self-simulation (the SBI +training set). Runs that fail (bad override / generate error) are dropped, not +fatal. """ from __future__ import annotations import argparse +import copy import json +import os +import shutil import subprocess import tempfile +from concurrent.futures import ProcessPoolExecutor, as_completed from pathlib import Path import numpy as np +import pandas as pd +import yaml from . import params as P +_ROUND_LEVELS = np.array([1_000.0, 5_000.0, 10_000.0, 25_000.0, 50_000.0, 100_000.0]) + +# Fixed feature order — keep stable so x is comparable across runs / re-runs. +FEATURE_NAMES = [ + "log_amt_mean", "log_amt_std", "log_amt_skew", "benford_mad", "round_frac", + "weekend_frac", "monthend_frac", "postclose_frac", "manual_frac", + "lpje_mean", "lpje_std", "lpje_frac2", "lpje_frac_gt2", + "lag_mean", "lag_std", "lag_pos_frac", + "src_share1", "src_share2", "src_share3", "src_share4", "src_share5", "src_entropy", + "iet_mean", "iet_std", + "gl_n_log", "gl_top5_share", "gl_entropy", + "n_lines_log", "n_docs_log", +] +DIM_X = len(FEATURE_NAMES) + def _cli() -> str: - import shutil for c in ("./target/release/datasynth-data", "datasynth-data"): - if shutil.which(c) or Path(c).exists(): + if Path(c).exists() or shutil.which(c): return c raise FileNotFoundError("build datasynth-data (cargo build --release -p datasynth-cli)") +def _moments(v: np.ndarray) -> tuple[float, float, float]: + v = v[np.isfinite(v)] + if v.size == 0: + return 0.0, 0.0, 0.0 + m, s = float(v.mean()), float(v.std()) + sk = float(((v - m) ** 3).mean() / (s ** 3)) if s > 1e-9 else 0.0 + return m, s, sk + + +def _entropy(shares: np.ndarray) -> float: + p = shares[shares > 0] + return float(-(p * np.log(p)).sum()) if p.size else 0.0 + + def summary_stats(je_csv: Path) -> np.ndarray: - """GL → fixed-length feature vector x. Reuses the same observables the BF - eval keys on so the inverse 'sees' what the forward model varies. - - TODO: finalize alongside params.py. Pseudocode: - - per-source row-share (top-K sources) - - inter-event-time mean/std/skew per source, pooled - - lines-per-JE histogram (fixed bins) - - log|amount| mean/std + Benford first-digit MAD - - weekend / off-hours / round-dollar fractions - - fan-out degree mean/gini; document-chain completeness - """ - raise NotImplementedError( - "summary_stats: compute the fixed-length feature vector from je_csv " - "(see SPEC.md § Summary statistics; share extractors with common.bf_bridge)." - ) - - -def run_one(cli: str, base_cfg: Path, theta: np.ndarray, workdir: Path) -> np.ndarray: - overrides = P.to_config_overrides(theta) - # TODO: merge `overrides` (dotted keys) into a copy of base_cfg → cfg.yaml. - cfg = workdir / "cfg.yaml" + """GL → fixed-length feature vector x (DIM_X,). Observable-only (no labels) + so the same map applies to a real GL at inference time.""" + df = pd.read_csv(je_csv, low_memory=False) + n = len(df) + if n == 0: + return np.zeros(DIM_X, dtype=np.float32) + + deb = pd.to_numeric(df.get("debit_amount", 0), errors="coerce").fillna(0.0).to_numpy() + cred = pd.to_numeric(df.get("credit_amount", 0), errors="coerce").fillna(0.0).to_numpy() + amt = np.where(deb != 0, deb, cred).astype(float) + nz = np.abs(amt[amt != 0]) + log_amt = np.log1p(nz) + la_mean, la_std, la_skew = _moments(log_amt) + + # Benford first-digit MAD vs the ideal law. + fd = np.array([int(str(int(a))[0]) for a in nz if a >= 1], dtype=int) + if fd.size: + obs = np.array([(fd == d).mean() for d in range(1, 10)]) + exp = np.log10(1 + 1 / np.arange(1, 10)) + benford_mad = float(np.abs(obs - exp).mean()) + else: + benford_mad = 0.0 + + nearest = np.abs(nz[:, None] - _ROUND_LEVELS[None, :]).min(axis=1) if nz.size else np.array([1e9]) + round_frac = float((nearest < 1.0).mean()) + + pdt = pd.to_datetime(df.get("posting_date"), errors="coerce") + dow = pdt.dt.dayofweek + weekend_frac = float((dow >= 5).mean()) if n else 0.0 + monthend_frac = float((pdt.dt.day >= 25).mean()) if n else 0.0 + + def _frac_true(col: str) -> float: + if col not in df: + return 0.0 + return float(df[col].astype("boolean").fillna(False).mean()) + + postclose_frac = _frac_true("is_post_close") + manual_frac = _frac_true("is_manual") + + # Lines per JE + if "document_id" in df: + lpje = df.groupby("document_id").size().to_numpy() + lpje_mean, lpje_std = float(lpje.mean()), float(lpje.std()) + lpje_f2 = float((lpje == 2).mean()) + lpje_fgt2 = float((lpje > 2).mean()) + else: + lpje_mean = lpje_std = lpje_f2 = lpje_fgt2 = 0.0 + + # Posting lag (posting - document), days + if "document_date" in df: + ddt = pd.to_datetime(df["document_date"], errors="coerce") + lag = (pdt - ddt).dt.days.to_numpy().astype(float) + lag = lag[np.isfinite(lag)] + lag_mean, lag_std = (float(lag.mean()), float(lag.std())) if lag.size else (0.0, 0.0) + lag_pos = float((lag > 0).mean()) if lag.size else 0.0 + else: + lag_mean = lag_std = lag_pos = 0.0 + + # Source mix + if "source" in df: + vc = df["source"].astype(str).value_counts(normalize=True) + shares = vc.to_numpy() + src5 = list(shares[:5]) + [0.0] * (5 - min(5, len(shares))) + src_ent = _entropy(shares) + else: + src5, src_ent = [0.0] * 5, 0.0 + + # Inter-event time per source (pooled gaps between sorted posting days) + iets = [] + if "source" in df and pdt.notna().any(): + tmp = pd.DataFrame({"s": df["source"].astype(str), "d": pdt}) + for _, g in tmp.dropna().groupby("s"): + days = np.sort(g["d"].astype("int64").to_numpy()) / 86_400_000_000_000 + if days.size > 1: + iets.append(np.diff(days)) + if iets: + allg = np.concatenate(iets) + iet_mean, iet_std = float(allg.mean()), float(allg.std()) + else: + iet_mean = iet_std = 0.0 + + # GL account fan-out / concentration + if "gl_account" in df: + gvc = df["gl_account"].astype(str).value_counts(normalize=True) + gl_n_log = float(np.log1p(len(gvc))) + gl_top5 = float(gvc.to_numpy()[:5].sum()) + gl_ent = _entropy(gvc.to_numpy()) + else: + gl_n_log = gl_top5 = gl_ent = 0.0 + + n_docs = df["document_id"].nunique() if "document_id" in df else n + feats = [ + la_mean, la_std, la_skew, benford_mad, round_frac, + weekend_frac, monthend_frac, postclose_frac, manual_frac, + lpje_mean, lpje_std, lpje_f2, lpje_fgt2, + lag_mean, lag_std, lag_pos, + *src5, src_ent, + iet_mean, iet_std, + gl_n_log, gl_top5, gl_ent, + float(np.log1p(n)), float(np.log1p(n_docs)), + ] + return np.asarray(feats, dtype=np.float32) + + +def _set_dotted(cfg: dict, key: str, value) -> None: + """Set a dotted key into a nested dict, creating intermediate dicts.""" + parts = key.split(".") + node = cfg + for p in parts[:-1]: + node = node.setdefault(p, {}) + node[parts[-1]] = value + + +def run_one(cli: str, base_cfg: dict, theta: np.ndarray, seed: int, workdir: Path) -> np.ndarray | None: + cfg = copy.deepcopy(base_cfg) + for k, v in P.to_config_overrides(theta).items(): + _set_dotted(cfg, k, v) + _set_dotted(cfg, "global.seed", int(seed)) + cfg_path = workdir / "cfg.yaml" out = workdir / "out" - raise NotImplementedError( - f"run_one: write {cfg} = base_cfg + {overrides}, then " - f"`{cli} generate -c {cfg} -o {out} --memory-limit 512 --max-threads 1`, " - f"then summary_stats({out}/journal_entries.csv)." - ) + cfg_path.write_text(yaml.safe_dump(cfg)) + try: + subprocess.run( + [cli, "generate", "--config", str(cfg_path), "--output", str(out), + "--max-threads", "1"], + check=True, capture_output=True, timeout=300, + ) + je = out / "journal_entries.csv" + if not je.exists(): + return None + return summary_stats(je) + except (subprocess.CalledProcessError, subprocess.TimeoutExpired): + return None + finally: + shutil.rmtree(out, ignore_errors=True) + + +def _worker(args) -> tuple[int, np.ndarray | None]: + i, theta, seed, cli, base_cfg = args + with tempfile.TemporaryDirectory(prefix="sbi_") as td: + return i, run_one(cli, base_cfg, theta, seed, Path(td)) def main(argv: list[str] | None = None) -> None: @@ -67,26 +217,42 @@ def main(argv: list[str] | None = None) -> None: ap.add_argument("--out", type=Path, required=True) ap.add_argument("--base", type=Path, required=True, help="base generate config") ap.add_argument("--seed", type=int, default=0) + ap.add_argument("--workers", type=int, default=0, help="0 = os.cpu_count()-2") args = ap.parse_args(argv) args.out.mkdir(parents=True, exist_ok=True) + workers = args.workers or max(1, (os.cpu_count() or 4) - 2) rng = np.random.default_rng(args.seed) thetas = P.sample_prior(rng, args.n) cli = _cli() + base_cfg = yaml.safe_load(args.base.read_text()) + + jobs = [(i, thetas[i], args.seed + 1 + i, cli, base_cfg) for i in range(args.n)] + xs: dict[int, np.ndarray] = {} + done = fail = 0 + with ProcessPoolExecutor(max_workers=workers) as ex: + futs = [ex.submit(_worker, j) for j in jobs] + for fut in as_completed(futs): + i, x = fut.result() + done += 1 + if x is None: + fail += 1 + else: + xs[i] = x + if done % 50 == 0: + print(f"[simulate] {done}/{args.n} (failed {fail})", flush=True) - xs = [] - with tempfile.TemporaryDirectory() as td: - for i, theta in enumerate(thetas): - x = run_one(cli, args.base, theta, Path(td)) # raises until wired - xs.append(x) - if (i + 1) % 50 == 0: - print(f"[simulate] {i+1}/{args.n}") - X = np.stack(xs) - np.savez(args.out / "pairs.npz", theta=thetas, x=X, - param_names=[p.name for p in P.PARAMS]) + keep = sorted(xs) + if not keep: + raise SystemExit("[simulate] all runs failed — check the base config / override keys") + X = np.stack([xs[i] for i in keep]) + theta_keep = thetas[keep] + np.savez(args.out / "pairs.npz", theta=theta_keep, x=X, + param_names=[p.name for p in P.PARAMS], feature_names=FEATURE_NAMES) (args.out / "meta.json").write_text(json.dumps( - {"n": args.n, "dim_theta": P.dim(), "dim_x": X.shape[1]}, indent=2)) - print(f"[simulate] wrote {args.out/'pairs.npz'}") + {"n_requested": args.n, "n_kept": len(keep), "n_failed": fail, + "dim_theta": P.dim(), "dim_x": int(X.shape[1])}, indent=2)) + print(f"[simulate] kept {len(keep)}/{args.n} (failed {fail}) -> {args.out/'pairs.npz'}") if __name__ == "__main__": From 701e7b2d7d70304ffbd134830b2e6add0d3d5e9e Mon Sep 17 00:00:00 2001 From: Michael Ivertowski Date: Thu, 21 May 2026 13:20:14 +0200 Subject: [PATCH 04/18] feat(ml): descriptive corpus-vs-synthetic gap (interpretable 'what is missing') Side-by-side behavioral observables (lines-per-JE, log-amount moments, Benford MAD, round-dollar / small-ticket share, p99 amount, weekend share, source mix, per-source inter-event times) for corpus (corpus columns) vs a synthetic journal_entries.csv (canonical columns). Complements the normalized DRs from behavioral score with raw units. Co-Authored-By: Claude Opus 4.7 (1M context) --- experiments/ml/common/corpus_gap.py | 117 ++++++++++++++++++++++++++++ 1 file changed, 117 insertions(+) create mode 100644 experiments/ml/common/corpus_gap.py diff --git a/experiments/ml/common/corpus_gap.py b/experiments/ml/common/corpus_gap.py new file mode 100644 index 00000000..25570b5d --- /dev/null +++ b/experiments/ml/common/corpus_gap.py @@ -0,0 +1,117 @@ +"""Descriptive corpus-vs-synthetic gap — 'what's missing on the synthetic end'. + +Complements `datasynth-data behavioral score` (normalized degradation ratios) +with raw, interpretable observables in plain units, so the gap is legible: +lines-per-JE, amount distribution (log-moments / Benford / round-dollar / small- +ticket share), source mix, weekend share, and per-source inter-event times. + + python -m common.corpus_gap --corpus /path/corpus.parquet --syn /path/journal_entries.csv + +Corpus uses its own column names; synthetic uses canonical names. Both are +mapped here. Emits a side-by-side table + a JSON of the gaps. +""" +from __future__ import annotations + +import argparse +import json +from pathlib import Path + +import numpy as np +import pandas as pd + +_ROUND = np.array([1_000.0, 5_000.0, 10_000.0, 25_000.0, 50_000.0, 100_000.0]) + + +def _benford_mad(a: np.ndarray) -> float: + fd = np.array([int(str(int(x))[0]) for x in a if x >= 1], dtype=int) + if not fd.size: + return float("nan") + obs = np.array([(fd == d).mean() for d in range(1, 10)]) + exp = np.log10(1 + 1 / np.arange(1, 10)) + return float(np.abs(obs - exp).mean()) + + +def _iet_stats(df: pd.DataFrame, src: str, date: str) -> tuple[float, float]: + iets = [] + sub = df[[src, date]].dropna() + sub = sub.assign(_d=pd.to_datetime(sub[date], errors="coerce")).dropna(subset=["_d"]) + for _, g in sub.groupby(src): + days = np.sort(g["_d"].astype("int64").to_numpy()) / 86_400_000_000_000 + if days.size > 1: + iets.append(np.diff(days)) + if not iets: + return float("nan"), float("nan") + allg = np.concatenate(iets) + return float(allg.mean()), float(allg.std()) + + +def observables(df: pd.DataFrame, jeid: str, src: str, amt: np.ndarray, date: str) -> dict: + a = np.abs(amt) + a = a[np.isfinite(a) & (a > 0)] + la = np.log1p(a) + lpje = df.groupby(jeid).size().to_numpy() if jeid in df else np.array([np.nan]) + pdt = pd.to_datetime(df[date], errors="coerce") if date in df else pd.Series([], dtype="datetime64[ns]") + nearest = np.abs(a[:, None] - _ROUND[None, :]).min(axis=1) if a.size else np.array([1e9]) + iet_m, iet_s = _iet_stats(df, src, date) if src in df and date in df else (float("nan"), float("nan")) + vc = df[src].astype(str).value_counts(normalize=True).to_numpy() if src in df else np.array([1.0]) + return { + "n_lines": int(len(df)), + "n_JEs": int(df[jeid].nunique()) if jeid in df else float("nan"), + "lines_per_JE_mean": float(np.nanmean(lpje)), + "lines_per_JE_p95": float(np.nanpercentile(lpje, 95)), + "log_amt_mean": float(la.mean()), + "log_amt_std": float(la.std()), + "log_amt_skew": float(((la - la.mean()) ** 3).mean() / (la.std() ** 3 + 1e-9)), + "benford_mad": _benford_mad(a), + "round_dollar_frac": float((nearest < 1.0).mean()), + "small_ticket_frac(<100)": float((a < 100).mean()), + "p99_amount": float(np.percentile(a, 99)) if a.size else float("nan"), + "weekend_frac": float((pdt.dt.dayofweek >= 5).mean()) if len(pdt) else float("nan"), + "n_sources": int(len(vc)), + "source_top1_share": float(vc.max()), + "source_entropy": float(-(vc[vc > 0] * np.log(vc[vc > 0])).sum()), + "iet_days_mean": iet_m, + "iet_days_std": iet_s, + } + + +def load_corpus(path: Path) -> dict: + df = pd.read_parquet(path) + amt = pd.to_numeric(df["Functional Amount"], errors="coerce").to_numpy() + return observables(df, "JE Number", "Source", amt, "Entry Date") + + +def load_syn(path: Path) -> dict: + df = pd.read_csv(path, low_memory=False) + deb = pd.to_numeric(df.get("debit_amount", 0), errors="coerce").fillna(0.0) + cred = pd.to_numeric(df.get("credit_amount", 0), errors="coerce").fillna(0.0) + amt = np.where(deb != 0, deb, cred).astype(float) + return observables(df, "document_id", "source", amt, "posting_date") + + +def main() -> None: + ap = argparse.ArgumentParser() + ap.add_argument("--corpus", type=Path, required=True) + ap.add_argument("--syn", type=Path, required=True) + ap.add_argument("--out", type=Path, default=None) + args = ap.parse_args() + + corp = load_corpus(args.corpus) + syn = load_syn(args.syn) + keys = list(corp.keys()) + print(f"{'observable':<26} {'corpus':>16} {'synthetic':>16} {'ratio syn/corp':>16}") + print("-" * 78) + gaps = {} + for k in keys: + c, s = corp[k], syn[k] + r = (s / c) if (isinstance(c, (int, float)) and c not in (0, float("nan")) and np.isfinite(c) and c != 0) else float("nan") + gaps[k] = {"corpus": c, "synthetic": s, "ratio": r} + print(f"{k:<26} {c:>16.4g} {s:>16.4g} {r:>16.3g}") + if args.out: + args.out.parent.mkdir(parents=True, exist_ok=True) + args.out.write_text(json.dumps(gaps, indent=2)) + print(f"\nwrote {args.out}") + + +if __name__ == "__main__": + main() From c66d8f5e16a658c88af0b80f3f16d1bd6f9a674b Mon Sep 17 00:00:00 2001 From: Michael Ivertowski Date: Thu, 21 May 2026 13:27:45 +0200 Subject: [PATCH 05/18] feat(ml/flow): implement export_flow (COA-joined amounts, account-class conditioning) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Reads corpus Functional Amount + GL Account Number, joins account_class via the COA 'c' key, emits y=signed log1p(|amount|) + one-hot(account_class) to amounts.parquet for the conditional flow. Tail clipped at p99.9 (privacy). Source (~4500 corpus levels — itself a finding) is not one-hot encoded. Co-Authored-By: Claude Opus 4.7 (1M context) --- experiments/ml/common/data_export.py | 68 +++++++++++++++++++++++++--- 1 file changed, 62 insertions(+), 6 deletions(-) diff --git a/experiments/ml/common/data_export.py b/experiments/ml/common/data_export.py index df6a9645..6a2bcb95 100644 --- a/experiments/ml/common/data_export.py +++ b/experiments/ml/common/data_export.py @@ -104,15 +104,71 @@ def export_sequence(corpus: Path, cols: ColumnMap, out: Path) -> None: ) -def export_flow(corpus: Path, cols: ColumnMap, out: Path) -> None: - """Per-(source, account-class) amount samples + conditioning features. +def _account_class_map(corpus: Path) -> dict[str, str]: + """Build {gl_account: account_class} from the corpus COA_*.parquet files. - See flow/SPEC.md § Data. + Join key column ``c`` is the zero-padded GL account number; ``Account + Class`` is the ISO-style class label. Account numbers are consistent across + clients, so a global map is fine (first wins on the rare conflict). """ - raise NotImplementedError( - "TODO(flow): collect log|amount| per (source, account_class), plus " - "conditioning one-hots; write out/amounts.parquet." + import pandas as pd + + m: dict[str, str] = {} + for fp in sorted(corpus.glob("COA_*.parquet")): + try: + c = pd.read_parquet(fp, columns=["c", "Account Class"]) + except Exception: # noqa: BLE001 — skip a malformed/empty COA shard + continue + for acct, cls in zip(c["c"].astype(str), c["Account Class"].astype(str)): + m.setdefault(acct, cls) + return m + + +def export_flow(corpus: Path, cols: ColumnMap, out: Path) -> None: + """Amount samples + one-hot account-class conditioning → amounts.parquet. + + ``y = signed log1p(|amount|)``; conditioning = one-hot(account_class) from + the COA join. Source has thousands of corpus levels (a finding in itself, + not a useful one-hot), so account-class is the amount-shape conditioning + axis. The extreme tail is clipped at the 99.9th percentile (privacy: don't + memorize rare exact large amounts — flow/SPEC.md § Privacy). Aggregated + numeric only, gitignored. See flow/SPEC.md § Data. + """ + import json + + import numpy as np + import pandas as pd + + acc = _account_class_map(corpus) + parts = [] + for fp in sorted(corpus.glob("JE_*.parquet")): + d = pd.read_parquet(fp, columns=[cols.amount, cols.gl_account]) + d.columns = ["amount", "gl_account"] + parts.append(d) + df = pd.concat(parts, ignore_index=True) + df["amount"] = pd.to_numeric(df["amount"], errors="coerce") + df = df.dropna(subset=["amount"]) + df = df[df["amount"] != 0.0] + df["account_class"] = df["gl_account"].astype(str).map(acc).fillna("UNK") + df["y"] = np.sign(df["amount"]) * np.log1p(np.abs(df["amount"])) + hi = float(df["y"].quantile(0.999)) + df["y"] = df["y"].clip(upper=hi) + if len(df) > 3_000_000: + df = df.sample(3_000_000, random_state=0).reset_index(drop=True) + onehot = pd.get_dummies(df["account_class"], prefix="cls").astype("float32") + out_df = pd.concat( + [df[["y"]].astype("float32").reset_index(drop=True), onehot.reset_index(drop=True)], + axis=1, + ) + out_df.to_parquet(out / "amounts.parquet") + (out / "flow_meta.json").write_text( + json.dumps( + {"n": int(len(out_df)), "cond_cols": list(onehot.columns), + "n_classes": int(onehot.shape[1]), "y_clip_hi": hi}, indent=2 + ) ) + print(f"[flow] {len(out_df):,} amounts, {onehot.shape[1]} account-class conds " + f"-> {out/'amounts.parquet'}") EXPORTERS = { From 5a4734bae7377aac32c08afd19d3f83cbc91c64f Mon Sep 17 00:00:00 2001 From: Michael Ivertowski Date: Thu, 21 May 2026 14:51:43 +0200 Subject: [PATCH 06/18] fix(ml/flow): standardize y before the NSF (was collapsing the amount tail) The neural-spline flow's default domain (~[-5,5]) couldn't represent corpus signed-log1p amounts (which reach ~10.4), collapsing learned p99 to ~$142 vs the corpus $33k. Standardize y (mean/std saved in the checkpoint) so the tail lands inside the spline; samples are unstandardized at characterization. Co-Authored-By: Claude Opus 4.7 (1M context) --- experiments/ml/flow/train.py | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/experiments/ml/flow/train.py b/experiments/ml/flow/train.py index b24d919d..ef8f397e 100644 --- a/experiments/ml/flow/train.py +++ b/experiments/ml/flow/train.py @@ -29,7 +29,13 @@ def main(argv: list[str] | None = None) -> None: import pandas as pd df = pd.read_parquet(args.data / "amounts.parquet") - y = torch.tensor(df["y"].to_numpy(), dtype=torch.float32).unsqueeze(-1) + y_raw = torch.tensor(df["y"].to_numpy(), dtype=torch.float32).unsqueeze(-1) + # Standardize y so it lands inside the neural-spline domain. Without this + # the NSF (default bound ~[-5,5]) cannot represent the heavy amount tail — + # corpus signed-log1p amounts reach ~10.4, collapsing p99 to a few hundred. + y_mean = float(y_raw.mean()) + y_std = float(y_raw.std()) or 1.0 + y = (y_raw - y_mean) / y_std c = torch.tensor( df.drop(columns=["y"]).to_numpy(), dtype=torch.float32 ) # conditioning one-hots @@ -51,7 +57,8 @@ def main(argv: list[str] | None = None) -> None: running += loss.item() print(f"epoch {epoch:3d} nll={running/len(dl):.4f}") - torch.save({"model": model.state_dict(), "cond_dim": c.size(1)}, + torch.save({"model": model.state_dict(), "cond_dim": c.size(1), + "y_mean": y_mean, "y_std": y_std}, args.out / "amount_flow.pt") print(f"[flow.train] saved {args.out/'amount_flow.pt'}") print("TODO(flow): export spline knots for the candle AmountSampler port, " From 1c54d5525f354cfaa62a297bf90eaebda7e19acb Mon Sep 17 00:00:00 2001 From: Michael Ivertowski Date: Thu, 21 May 2026 15:02:29 +0200 Subject: [PATCH 07/18] feat(ml/sequence): implement export_sequence (factorized event-token streams) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Per-(client, source) ordered streams → streams.pt (dt / lines / account_class / weekday / hour_band fields, 0=pad) + vocab.json, matching EventStreamTransformer. Δt + line-count carry the inter-event/burst signal (the 60x IET-regularity gap the descriptive analysis surfaced). Per-client processing bounds memory over the 50M-row corpus; source ranked to a 0..62 id map + 'other'. Co-Authored-By: Claude Opus 4.7 (1M context) --- experiments/ml/common/data_export.py | 81 ++++++++++++++++++++++++---- 1 file changed, 71 insertions(+), 10 deletions(-) diff --git a/experiments/ml/common/data_export.py b/experiments/ml/common/data_export.py index 6a2bcb95..54752c8f 100644 --- a/experiments/ml/common/data_export.py +++ b/experiments/ml/common/data_export.py @@ -91,17 +91,78 @@ def export_gnn(corpus: Path, cols: ColumnMap, out: Path) -> None: ) -def export_sequence(corpus: Path, cols: ColumnMap, out: Path) -> None: - """Per-(source, entity) ordered event streams → token tensors. - - Token = (Δt-bucket, line-count-bucket, account-class, weekday, hour-band). - See sequence/SPEC.md § Data. +def export_sequence(corpus: Path, cols: ColumnMap, out: Path, max_streams: int = 60_000, + seq_len: int = 128) -> None: + """Per-(client, source) ordered event-token streams → streams.pt + vocab.json. + + Factorized fields matching `EventStreamTransformer`: (dt, lines, + account_class, weekday, hour_band), 0 = pad. Δt + line-count carry the + inter-event-time / burst signal (the autocorrelation gap). hour_band is a + constant single band — the corpus dates carry no time-of-day. Processed + per-client to bound memory over the 50M-row corpus. See sequence/SPEC.md. """ - raise NotImplementedError( - "TODO(sequence): group by (client, source, trading_partner), sort by " - "entry_date, derive inter-event Δt + per-JE line count, bucketize, and " - "write out/streams.pt (padded) + out/vocab.json." - ) + import json + + import numpy as np + import pandas as pd + import torch + + acc = _account_class_map(corpus) + DT_EDGES = [1, 2, 4, 8, 15, 30] # → digitize 0..6, +1 → 1..7 (vocab 8) + LC_EDGES = [2, 3, 5, 9, 17] # → digitize 0..5, +1 → 1..6 (vocab 7) + + # First pass (cheap): rank sources by JE volume for the 0..62 source-id map. + src_counts: dict[str, int] = {} + files = sorted(corpus.glob("JE_*.parquet")) + for fp in files: + s = pd.read_parquet(fp, columns=[cols.source])[cols.source].astype(str) + for k, v in s.value_counts().items(): + src_counts[k] = src_counts.get(k, 0) + int(v) + top_src = [s for s, _ in sorted(src_counts.items(), key=lambda kv: -kv[1])[:63]] + src_ids = {s: i for i, s in enumerate(top_src)} + + classes: dict[str, int] = {} + fld = {k: [] for k in ("dt", "lines", "account_class", "weekday", "hour_band")} + src_id_list: list[int] = [] + + def _pad(a: np.ndarray) -> np.ndarray: + a = a[:seq_len].astype(np.int64) + z = np.zeros(seq_len, dtype=np.int64) + z[: len(a)] = a + return z + + for fp in files: + if len(src_id_list) >= max_streams: + break + d = pd.read_parquet(fp, columns=[cols.source, cols.entry_date, cols.je_number, cols.gl_account]) + d.columns = ["source", "date", "je", "gl"] + d["date"] = pd.to_datetime(d["date"], errors="coerce") + d = d.dropna(subset=["date"]) + d["cls"] = d["gl"].astype(str).map(acc).fillna("UNK") + je = d.groupby(["source", "je"]).agg(date=("date", "first"), lines=("je", "size"), + cls=("cls", "first")).reset_index() + for src, g in je.groupby("source"): + if len(g) < 3: + continue + g = g.sort_values("date") + dts = g["date"].diff().dt.days.fillna(0).clip(0, 3650).to_numpy() + cl_id = np.array([classes.setdefault(c, len(classes) + 1) for c in g["cls"]], dtype=np.int64) + fld["dt"].append(_pad(np.digitize(dts, DT_EDGES) + 1)) + fld["lines"].append(_pad(np.digitize(g["lines"].to_numpy(), LC_EDGES) + 1)) + fld["account_class"].append(_pad(cl_id)) + fld["weekday"].append(_pad(g["date"].dt.weekday.to_numpy() + 1)) + fld["hour_band"].append(_pad(np.ones(len(g), dtype=np.int64))) + src_id_list.append(src_ids.get(str(src), 63)) + if len(src_id_list) >= max_streams: + break + + blob = {k: torch.from_numpy(np.stack(v)) for k, v in fld.items()} + blob["source_id"] = torch.from_numpy(np.array(src_id_list, dtype=np.int64)) + torch.save(blob, out / "streams.pt") + sizes = {"dt": 8, "lines": 7, "account_class": len(classes) + 1, "weekday": 8, "hour_band": 2} + (out / "vocab.json").write_text(json.dumps({"sizes": sizes, "n_streams": len(src_id_list), "T": seq_len}, indent=2)) + print(f"[sequence] {len(src_id_list)} streams (T={seq_len}), {len(classes)} account-classes " + f"-> {out/'streams.pt'}") def _account_class_map(corpus: Path) -> dict[str, str]: From 6ebfb05c8ada6fab685a8ec5690faf3848dfe212 Mon Sep 17 00:00:00 2001 From: Michael Ivertowski Date: Thu, 21 May 2026 15:15:11 +0200 Subject: [PATCH 08/18] docs(ml): corpus->synthetic gap findings + learning-track results What's missing (descriptive): source diversity, IET variance ~60x, amount tail ~16x, lines-per-JE ~2.3x. DR eval degenerates at corpus scale (noise floor ~0). Flow learns amount density (v1 tail-collapse bug found+fixed via y-standardize). Sequence transformer trains on corpus event streams; corpus dt-bucket lag-1 autocorr -0.118 (variance, not autocorr, is the gap). v2 flow number pending. Co-Authored-By: Claude Opus 4.7 (1M context) --- experiments/ml/FINDINGS.md | 78 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 78 insertions(+) create mode 100644 experiments/ml/FINDINGS.md diff --git a/experiments/ml/FINDINGS.md b/experiments/ml/FINDINGS.md new file mode 100644 index 00000000..9d9f41df --- /dev/null +++ b/experiments/ml/FINDINGS.md @@ -0,0 +1,78 @@ +# Corpus → synthetic gap: what's missing, and what the learning tracks recover + +A100 study (2026-05-21), DataSynth v5.27. Goal: **learn from the real corpus +what the synthetic generator is missing**, on the 21-client health corpus +(53.4M JE lines, 11.8M JEs aggregated) vs the v5.27 engine. All learning is on +the corpus on the private box; weights stay on-box (memorization rule). Paper +grounding + generator-optimization targets. + +## 1. What's missing (descriptive, corpus vs synthetic) + +Raw observables — interpretable units, not normalized DRs: + +| Observable | Corpus | Synthetic | Gap | +|---|--:|--:|---| +| Source diversity (entropy / count) | 3.37 / 4,504 | 0.75 / 4 | synthetic **far too concentrated** (one source ≈ 75%) | +| Inter-event-time **std** (days) | 0.0169 | 0.00028 | synthetic **~60× too regular** (irregular-gap structure absent) | +| Amount **p99** | $33k | $542k | synthetic tail **~16× too fat** | +| log-amount std / skew | 2.46 / 0.56 | 3.43 / 0.99 | synthetic **over-dispersed, over-skewed** | +| Lines per JE (mean) | 4.5 | 10.3 | synthetic JEs **~2.3× too large** | +| Benford MAD | 0.0081 | 0.0057 | synthetic slightly *more* Benford-clean than reality | + +Top generator-optimization targets: **(a)** amount density (tail + spread), +**(b)** IET-variance / lines-per-JE structure, **(c)** source-mix breadth. + +## 2. Methodological finding — the DR eval degenerates at full corpus scale + +`behavioral score` on the 53.4M-line corpus returns `is_degenerate_baseline = +true` for **every** metric: the corpus-vs-corpus 50/50 noise floor is ≈0, so +each degradation ratio divides by ~0 and saturates at the 100 cap. The +normalized composite is therefore uninformative at this scale — the descriptive +comparison (§1) is the actionable signal. **For the paper:** the DR noise-floor +needs a resampling scheme that stays non-degenerate at large N (e.g. per-entity +block bootstrap), or the composite should fall back to raw distances when the +baseline underflows. + +## 3. Learning tracks — recovering the missing structure (corpus-trained) + +### Flow (amount density) — `flow/` +Conditional neural-spline flow over `signed log1p(|amount|)`, conditioned on +account-class (COA join, 294/294 accounts matched). **Bug found + fixed:** the +NSF default spline domain (~[-5,5]) cannot represent corpus log-amounts (which +reach ~10.4), collapsing learned p99 to ~$142 (v1). Standardizing `y` before +the flow fixes it (v2). + +| | log-amt mean | std | skew | p99 | Benford MAD | +|---|--:|--:|--:|--:|--:| +| Corpus (held-out) | 3.91 | 2.45 | 0.54 | $32,950 | 0.0086 | +| Flow v1 (un-standardized) | 2.81 | 1.45 | −0.39 | $142 | 0.0182 | +| **Flow v2 (standardized y)** | _pending_ | _pending_ | _pending_ | _pending_ | _pending_ | +| Synthetic (3-comp mixture) | 3.65 | 3.43 | 0.99 | $541,617 | 0.0057 | + +### Sequence (event-stream temporal) — `sequence/` +Decoder-only transformer over per-(client, source) event-token streams (Δt / +line-count / account-class / weekday buckets), factorized heads. **Trains +cleanly** (loss 1.99 → 1.93 / 25 epochs over 2,500 streams) — the corpus event +structure *is* learnable. Finding: corpus dt-bucket **lag-1 autocorr = −0.118** +(only 11.6% of streams positively autocorrelated), so the corpus is **not** +strongly *sequentially* bursty at this granularity — the §1 "60×" gap is +inter-event-time **variance**, a distinct axis from autocorrelation. Data-quality +note: the corpus COA `Account Class` has mojibake encoding variants (`Vorr??te`) +inflating the class count to 397 — a cleaning target. + +## 4. GNN fraud showcase (public synthetic data) — `scripts/ml/` +Separate, publishable result (see `scripts/ml/RESULTS_v5.27.md`): binary fraud +GraphSAGE test AUC 0.909 (≈ a LogReg on edge features — graph adds little); +fraud-**typology** is near-random on the collapsed edge list (macro-F1 0.09) but +**0.58 on the line-level view** — `fraud_type` is learnable, but consumers must +join the line table. + +## 5. Implications +- **Amount sampler**: the corpus tail is *thinner* and less skewed than the + synthetic mixture — the engine over-generates extreme amounts. A learned flow + (v2) or a re-fit mixture narrows this. +- **Source mix**: the engine emits ~4–24 sources vs the corpus's thousands; + source-mix breadth is a generation gap (priors bundle partially addresses it). +- **Lines per JE**: synthetic JEs are ~2× too large — the lines-per-JE prior + needs down-weighting toward the corpus mean of ~4.5. +- **Eval**: fix the DR noise-floor degeneracy at corpus scale before re-baselining. From 33d40d2dfffb21ea8ac7c3ec125b959fb44f79a0 Mon Sep 17 00:00:00 2001 From: Michael Ivertowski Date: Thu, 21 May 2026 16:46:09 +0200 Subject: [PATCH 09/18] =?UTF-8?q?docs(ml):=20flow=20v2=20result=20?= =?UTF-8?q?=E2=80=94=20learned=20flow=20matches=20corpus=20amount=20densit?= =?UTF-8?q?y?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit v2 (standardized y): NLL 8.96->0.67; p99 $31,754 vs corpus $33,688 (~6%), std/skew spot-on. The shipped 3-component mixture overshoots p99 ~16x. A learned per-account-class flow recovers the amount distribution the mixture misses. Co-Authored-By: Claude Opus 4.7 (1M context) --- experiments/ml/FINDINGS.md | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/experiments/ml/FINDINGS.md b/experiments/ml/FINDINGS.md index 9d9f41df..d5e5ca5b 100644 --- a/experiments/ml/FINDINGS.md +++ b/experiments/ml/FINDINGS.md @@ -46,9 +46,16 @@ the flow fixes it (v2). |---|--:|--:|--:|--:|--:| | Corpus (held-out) | 3.91 | 2.45 | 0.54 | $32,950 | 0.0086 | | Flow v1 (un-standardized) | 2.81 | 1.45 | −0.39 | $142 | 0.0182 | -| **Flow v2 (standardized y)** | _pending_ | _pending_ | _pending_ | _pending_ | _pending_ | +| **Flow v2 (standardized y)** | **3.89** | **2.46** | **0.54** | **$31,754** | **0.0081** | | Synthetic (3-comp mixture) | 3.65 | 3.43 | 0.99 | $541,617 | 0.0057 | +**v2 matches the corpus amount density almost exactly** — NLL 8.96 → 0.67; +p99 $31,754 vs corpus $33,688 (within ~6%), std/skew spot-on — whereas the +current 3-component mixture overshoots p99 by ~16× and is 1.4× over-dispersed. +Headline result: a learned per-account-class flow recovers the corpus amount +distribution the shipped mixture misses. Handoff: export spline knots → candle +`AmountSampler`, or keep as a build-time density artifact. + ### Sequence (event-stream temporal) — `sequence/` Decoder-only transformer over per-(client, source) event-token streams (Δt / line-count / account-class / weekday buckets), factorized heads. **Trains From 8b5808c4d4d57709d2500b581a02440baf10faa6 Mon Sep 17 00:00:00 2001 From: Michael Ivertowski Date: Thu, 21 May 2026 16:51:00 +0200 Subject: [PATCH 10/18] feat(ml): sequence-lift (NLL vs marginal) + inverse make_base + 3-knob params MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - sequence/characterize.py: held-out per-token NLL of the transformer vs an iid per-field marginal baseline -> information gain from modelling history. - inverse/make_base.py: small fast campaign config (fraud + distributions only). - inverse/params.py: pivot to (fraud_rate, amount_mu, amount_sigma) — minimal config friction; amount via structured component override; ties inverse + surrogate to the flow finding (recover corpus amount mean/std). Co-Authored-By: Claude Opus 4.7 (1M context) --- experiments/ml/inverse/make_base.py | 52 ++++++++++++++ experiments/ml/inverse/params.py | 38 +++++++---- experiments/ml/sequence/characterize.py | 90 +++++++++++++++++++++++++ 3 files changed, 167 insertions(+), 13 deletions(-) create mode 100644 experiments/ml/inverse/make_base.py create mode 100644 experiments/ml/sequence/characterize.py diff --git a/experiments/ml/inverse/make_base.py b/experiments/ml/inverse/make_base.py new file mode 100644 index 00000000..d2ecafa8 --- /dev/null +++ b/experiments/ml/inverse/make_base.py @@ -0,0 +1,52 @@ +"""Build a small, fast generate config for the inverse/surrogate forward +campaign. Enables only what the tier-1 knobs touch (fraud + distributions) and +shrinks to one period so each forward sim is quick. The campaign overrides +fraud_rate + the amount mixture per draw (see params.to_config_overrides). + + python -m inverse.make_base --out inverse_base.yaml +""" +from __future__ import annotations + +import argparse +import shutil +import subprocess +from pathlib import Path + +import yaml + + +def _cli() -> str: + for c in ("./target/release/datasynth-data", "../../target/release/datasynth-data", + "datasynth-data"): + if Path(c).exists() or shutil.which(c): + return c + raise SystemExit("datasynth-data not found — build with cargo build --release -p datasynth-cli") + + +def main(argv: list[str] | None = None) -> None: + ap = argparse.ArgumentParser(description=__doc__) + ap.add_argument("--out", type=Path, default=Path("inverse_base.yaml")) + ap.add_argument("--industry", default="manufacturing") + a = ap.parse_args(argv) + + tmp = Path("/tmp/_inv_init.yaml") + subprocess.run([_cli(), "init", "--industry", a.industry, "--complexity", "small", + "-o", str(tmp)], check=True, capture_output=True) + c = yaml.safe_load(tmp.read_text()) + + if isinstance(c.get("fraud"), dict): + c["fraud"]["enabled"] = True + if isinstance(c.get("distributions"), dict): + c["distributions"]["enabled"] = True + amt = c["distributions"].setdefault("amounts", {}) + amt["enabled"] = True + amt["distribution_type"] = "lognormal" + amt.setdefault("components", [{"weight": 1.0, "mu": 7.0, "sigma": 1.2, "label": "base"}]) + c.setdefault("global", {})["period_months"] = 1 + + a.out.write_text(yaml.safe_dump(c)) + print(f"wrote {a.out} (fraud + distributions enabled, 1 month)") + + +if __name__ == "__main__": + main() diff --git a/experiments/ml/inverse/params.py b/experiments/ml/inverse/params.py index 083161fb..56e468e8 100644 --- a/experiments/ml/inverse/params.py +++ b/experiments/ml/inverse/params.py @@ -19,18 +19,18 @@ class Param: log: bool = False # sample/scale in log space (for rates spanning decades) -# Tier-1 set. Names map to GeneratorConfig keys (verified scalar paths in -# datasynth-config/src/schema.rs — `deny_unknown_fields` is in force, so every -# key here must be a real settable field). The amount log-normal width was -# dropped: `distributions.amounts` is a mixture `components` list, not a scalar -# `sigma`; re-adding it needs a structured (replace-components) override — -# tracked as the 6th-knob follow-up. +# Tier-1 set — three high-identifiability knobs that need only `fraud` + +# `distributions` enabled (minimal config friction under deny_unknown_fields): +# - fraud.fraud_rate → fraud-bias footprint (weekend / round-dollar / …) +# - amount_mu / amount_sigma → the log-normal amount component (location + +# width). Set via a STRUCTURED override (replace distributions.amounts. +# components) since `amounts` is a mixture list, not scalars — handled in +# `to_config_overrides`. Recovering (mu, sigma) ties the inverse + surrogate +# to the flow finding (corpus log-amount mean ≈ 3.9, std ≈ 2.45). PARAMS: list[Param] = [ Param("fraud.fraud_rate", 0.0, 0.10), - Param("fraud.document_fraud_rate", 0.0, 0.10), - Param("temporal_patterns.processing_lags.invoice_receipt_lag.mu", 0.0, 3.0), - Param("temporal_patterns.processing_lags.invoice_receipt_lag.sigma", 0.2, 1.5), - Param("vendor_network.dependencies.top_5_concentration", 0.20, 0.70), + Param("amount_mu", 3.0, 10.0), + Param("amount_sigma", 0.5, 2.6), ] @@ -46,9 +46,21 @@ def sample_prior(rng: np.random.Generator, n: int) -> np.ndarray: return np.stack(cols, axis=1) -def to_config_overrides(theta: np.ndarray) -> dict[str, float]: - """One θ vector -> {config_key: value} overrides for a generate run.""" - return {p.name: float(v) for p, v in zip(PARAMS, theta)} +def to_config_overrides(theta: np.ndarray) -> dict[str, object]: + """One θ vector -> {config_key: value} overrides for a generate run. + + fraud_rate is a plain scalar; amount_mu/amount_sigma are folded into a + single-component log-normal mixture that REPLACES distributions.amounts. + components (a structured override — the mixture is a list, not scalars). + """ + vals = {p.name: float(v) for p, v in zip(PARAMS, theta)} + return { + "fraud.fraud_rate": vals["fraud.fraud_rate"], + "distributions.amounts.distribution_type": "lognormal", + "distributions.amounts.components": [ + {"weight": 1.0, "mu": vals["amount_mu"], "sigma": vals["amount_sigma"], "label": "sbi"} + ], + } def normalize(theta: np.ndarray) -> np.ndarray: diff --git a/experiments/ml/sequence/characterize.py b/experiments/ml/sequence/characterize.py new file mode 100644 index 00000000..2f17c90d --- /dev/null +++ b/experiments/ml/sequence/characterize.py @@ -0,0 +1,90 @@ +"""Sequence-track lift: does the autoregressive transformer capture temporal +structure the marginal (iid) sampler misses? + +The shipped generator draws Δt and line-count *independently per event*. This +measures the information gain of conditioning on history: held-out per-token +NLL of the trained transformer vs an iid per-field marginal baseline (the +field's own entropy). NLL_marginal − NLL_transformer > 0 ⇒ the model captures +joint / autocorrelated structure the marginal sampler cannot — per field and +pooled. + + python -m sequence.characterize --data data/sequence --weights weights/sequence +""" +from __future__ import annotations + +import argparse +import json +from pathlib import Path + +import numpy as np +import torch +import torch.nn.functional as F + +from .model import EventStreamTransformer, FieldVocab + +FIELDS = ["dt", "lines", "account_class", "weekday", "hour_band"] + + +@torch.no_grad() +def main(argv: list[str] | None = None) -> None: + ap = argparse.ArgumentParser(description=__doc__) + ap.add_argument("--data", type=Path, required=True) + ap.add_argument("--weights", type=Path, required=True) + ap.add_argument("--val-frac", type=float, default=0.2) + ap.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu") + args = ap.parse_args(argv) + dev = torch.device(args.device) + + blob = torch.load(args.data / "streams.pt") + sizes = json.loads((args.data / "vocab.json").read_text())["sizes"] + vocab = FieldVocab(**sizes) + ckpt = torch.load(args.weights / "stream_tf.pt", map_location=dev) + model = EventStreamTransformer(vocab).to(dev) + model.load_state_dict(ckpt["model"]) + model.train(False) + + n = blob["dt"].shape[0] + nval = max(1, int(n * args.val_frac)) + tr = slice(nval, None) + va = slice(0, nval) + + # ── Transformer per-token NLL on held-out (teacher-forced) ────────────── + tok_va = {f: blob[f][va].to(dev) for f in FIELDS} + logits = model(tok_va, blob["source_id"][va].to(dev)) + tf_nll = {} + for f in FIELDS: + pred = logits[f][:, :-1].reshape(-1, logits[f].size(-1)) + tgt = tok_va[f][:, 1:].reshape(-1) + keep = tgt > 0 # ignore pad + tf_nll[f] = float(F.cross_entropy(pred[keep], tgt[keep]).item()) + + # ── iid marginal baseline: each field's own entropy on the train split ── + marg_nll = {} + for f in FIELDS: + toks = blob[f][tr].reshape(-1).numpy() + toks = toks[toks > 0] + if toks.size == 0: + marg_nll[f] = 0.0 + continue + counts = np.bincount(toks, minlength=sizes[f]).astype(np.float64) + p = counts / counts.sum() + nz = p > 0 + marg_nll[f] = float(-(p[nz] * np.log(p[nz])).sum()) # nats + + print(f"{'field':<16}{'transformer':>14}{'marginal(iid)':>16}{'lift(nats)':>14}") + print("-" * 60) + tf_tot = marg_tot = 0.0 + for f in FIELDS: + lift = marg_nll[f] - tf_nll[f] + tf_tot += tf_nll[f] + marg_tot += marg_nll[f] + print(f"{f:<16}{tf_nll[f]:>14.4f}{marg_nll[f]:>16.4f}{lift:>14.4f}") + print("-" * 60) + print(f"{'TOTAL/token':<16}{tf_tot:>14.4f}{marg_tot:>16.4f}{marg_tot - tf_tot:>14.4f}") + print(f"\nInterpretation: positive lift = the AR model predicts events better " + f"than drawing each field iid from its marginal — i.e. it captures the " + f"joint/temporal structure the per-event marginal sampler discards.") + + +if __name__ == "__main__": + main() From 54e4bf9944c8d3223a1b273a6ba273bea687ea1d Mon Sep 17 00:00:00 2001 From: Michael Ivertowski Date: Thu, 21 May 2026 16:52:39 +0200 Subject: [PATCH 11/18] fix(ml/inverse): log_normal enum (not lognormal) + record sequence +3.37 nats lift Sequence track: AR transformer beats iid marginal by +3.37 nats/token on held-out (account_class/weekday/lines structure captured; Dt near-memoryless). Fix the amount distribution_type enum value for the inverse campaign override. Co-Authored-By: Claude Opus 4.7 (1M context) --- experiments/ml/FINDINGS.md | 8 +++++++- experiments/ml/inverse/make_base.py | 2 +- experiments/ml/inverse/params.py | 2 +- 3 files changed, 9 insertions(+), 3 deletions(-) diff --git a/experiments/ml/FINDINGS.md b/experiments/ml/FINDINGS.md index d5e5ca5b..fe36b461 100644 --- a/experiments/ml/FINDINGS.md +++ b/experiments/ml/FINDINGS.md @@ -63,7 +63,13 @@ cleanly** (loss 1.99 → 1.93 / 25 epochs over 2,500 streams) — the corpus eve structure *is* learnable. Finding: corpus dt-bucket **lag-1 autocorr = −0.118** (only 11.6% of streams positively autocorrelated), so the corpus is **not** strongly *sequentially* bursty at this granularity — the §1 "60×" gap is -inter-event-time **variance**, a distinct axis from autocorrelation. Data-quality +inter-event-time **variance**, a distinct axis from autocorrelation. **Held-out +lift over an iid per-field marginal sampler: +3.37 nats/token** (account_class ++1.55, weekday +1.36, lines +0.58; Δt ≈ flat at −0.12 — Δt really is near +memoryless). So the autoregressive model captures the joint +source→account-class→line-count→weekday structure the current per-event +marginal sampler discards — the concrete case for an AR event scheduler. +Data-quality note: the corpus COA `Account Class` has mojibake encoding variants (`Vorr??te`) inflating the class count to 397 — a cleaning target. diff --git a/experiments/ml/inverse/make_base.py b/experiments/ml/inverse/make_base.py index d2ecafa8..f308e41e 100644 --- a/experiments/ml/inverse/make_base.py +++ b/experiments/ml/inverse/make_base.py @@ -40,7 +40,7 @@ def main(argv: list[str] | None = None) -> None: c["distributions"]["enabled"] = True amt = c["distributions"].setdefault("amounts", {}) amt["enabled"] = True - amt["distribution_type"] = "lognormal" + amt["distribution_type"] = "log_normal" amt.setdefault("components", [{"weight": 1.0, "mu": 7.0, "sigma": 1.2, "label": "base"}]) c.setdefault("global", {})["period_months"] = 1 diff --git a/experiments/ml/inverse/params.py b/experiments/ml/inverse/params.py index 56e468e8..0a6158d7 100644 --- a/experiments/ml/inverse/params.py +++ b/experiments/ml/inverse/params.py @@ -56,7 +56,7 @@ def to_config_overrides(theta: np.ndarray) -> dict[str, object]: vals = {p.name: float(v) for p, v in zip(PARAMS, theta)} return { "fraud.fraud_rate": vals["fraud.fraud_rate"], - "distributions.amounts.distribution_type": "lognormal", + "distributions.amounts.distribution_type": "log_normal", "distributions.amounts.components": [ {"weight": 1.0, "mu": vals["amount_mu"], "sigma": vals["amount_sigma"], "label": "sbi"} ], From 851f1dcd2d83564a3abcb3318a00cf273f86394c Mon Sep 17 00:00:00 2001 From: Michael Ivertowski Date: Thu, 21 May 2026 16:57:48 +0200 Subject: [PATCH 12/18] feat(ml/surrogate): grounded surrogate + CMA-ES (match corpus via campaign) Reuses the inverse forward campaign's (theta, summary-stat) pairs: objective = distance(summary_stats(theta), corpus), MLP surrogate, CMA-ES to the corpus-matching theta*. Runnable + grounded (vs the scaffold optimize.py whose load_history is a TODO + targets the corpus-scale-degenerate DR). theta* should recover amount_mu ~ corpus log-amount mean, cross-checking the flow finding. Co-Authored-By: Claude Opus 4.7 (1M context) --- experiments/ml/surrogate/match_corpus.py | 117 +++++++++++++++++++++++ 1 file changed, 117 insertions(+) create mode 100644 experiments/ml/surrogate/match_corpus.py diff --git a/experiments/ml/surrogate/match_corpus.py b/experiments/ml/surrogate/match_corpus.py new file mode 100644 index 00000000..60a7cbd2 --- /dev/null +++ b/experiments/ml/surrogate/match_corpus.py @@ -0,0 +1,117 @@ +"""Grounded surrogate + CMA-ES: find the generator params that best match the +corpus, using the inverse forward campaign as the surrogate's training data. + +The scaffold `optimize.py` targets the BF composite over SP-internal knobs, but +`load_history` is a TODO, those knobs aren't `generate --config`-settable, and +the DR eval degenerates at corpus scale (FINDINGS.md §2). This is the runnable, +grounded variant: reuse the inverse campaign's `(θ, summary-stat)` pairs, define +the objective as `distance(summary_stats(θ), corpus)`, fit an MLP surrogate, +and CMA-ES to the corpus-matching `θ*`. `θ*` is the config the corpus "most +likely came from" — cross-checking the flow finding (corpus log-amount mean +≈ 3.9). Demonstrates the tuning-loop accelerator end-to-end on real data. + + python -m surrogate.match_corpus --campaign data/inverse \\ + --corpus /home/ubuntu/corpus_health.parquet +""" +from __future__ import annotations + +import argparse +import json +from pathlib import Path + +import numpy as np +import pandas as pd +import torch +import torch.nn as nn + +from inverse import params as P +from inverse.simulate import FEATURE_NAMES, summary_stats + +# Features comparable corpus↔synthetic (exclude doc-flow / behavioural-only +# observables the corpus columns don't carry: post-close, manual, posting lag). +CMP = [ + "log_amt_mean", "log_amt_std", "log_amt_skew", "benford_mad", "round_frac", + "weekend_frac", "monthend_frac", "lpje_mean", "lpje_std", "lpje_frac2", + "src_entropy", "iet_mean", "iet_std", +] + + +def corpus_features(corpus_parquet: Path, tmp_csv: str) -> np.ndarray: + """Map corpus columns → canonical, then reuse the campaign summary_stats.""" + df = pd.read_parquet(corpus_parquet) + out = pd.DataFrame() + out["debit_amount"] = pd.to_numeric(df["Functional Amount"], errors="coerce") + out["credit_amount"] = 0.0 + out["posting_date"] = df["Entry Date"] + out["document_date"] = df["Entry Date"] + out["source"] = df["Source"] + out["document_id"] = df["JE Number"] + out["gl_account"] = df["GL Account Number"] + out.to_csv(tmp_csv, index=False) + return summary_stats(Path(tmp_csv)) + + +def main(argv: list[str] | None = None) -> None: + ap = argparse.ArgumentParser(description=__doc__) + ap.add_argument("--campaign", type=Path, required=True) + ap.add_argument("--corpus", type=Path, required=True) + ap.add_argument("--out", type=Path, default=Path("weights/surrogate")) + a = ap.parse_args(argv) + a.out.mkdir(parents=True, exist_ok=True) + + blob = np.load(a.campaign / "pairs.npz") + theta, x = blob["theta"], blob["x"] + cx = corpus_features(a.corpus, "/tmp/_corp_canon.csv") + idx = [FEATURE_NAMES.index(c) for c in CMP] + + # Standardize comparable features by campaign std → scale-free distance. + xs = x[:, idx] + mu, sd = xs.mean(0), xs.std(0) + 1e-6 + cxn, xn = (cx[idx] - mu) / sd, (xs - mu) / sd + dist = np.linalg.norm(xn - cxn, axis=1).astype("float32") # objective per sim + + tn = P.normalize(theta).astype("float32") + nval = max(1, len(tn) // 5) + Xtr, Xva = torch.tensor(tn[nval:]), torch.tensor(tn[:nval]) + ytr, yva = torch.tensor(dist[nval:]), torch.tensor(dist[:nval]) + + net = nn.Sequential(nn.Linear(P.dim(), 64), nn.SiLU(), nn.Linear(64, 64), nn.SiLU(), nn.Linear(64, 1)) + opt = torch.optim.Adam(net.parameters(), 1e-3) + for _ in range(1500): + opt.zero_grad() + loss = ((net(Xtr).squeeze(-1) - ytr) ** 2).mean() + loss.backward() + opt.step() + net.train(False) # inference mode + with torch.no_grad(): + pv = net(Xva).squeeze(-1).numpy() + from scipy.stats import spearmanr + rho = float(spearmanr(pv, yva.numpy()).statistic) + print(f"surrogate Spearman (held-out, predicted vs true distance) = {rho:.3f}") + + import cma + es = cma.CMAEvolutionStrategy(np.full(P.dim(), 0.5), 0.2, + {"bounds": [0, 1], "verbose": -9, "seed": 0}) + for _ in range(80): + sols = es.ask() + with torch.no_grad(): + vals = net(torch.tensor(np.array(sols), dtype=torch.float32)).squeeze(-1).numpy() + es.tell(sols, list(vals)) + theta_star = P.denormalize(np.clip(es.result.xbest, 0, 1)) + names = [p.name for p in P.PARAMS] + print("corpus-matching θ* (surrogate argmin):") + for n, v in zip(names, theta_star): + print(f" {n:14s} = {v:.3f}") + print(f"(corpus log_amt_mean={cx[idx[0]]:.2f} std={cx[idx[1]]:.2f}; " + f"amount_mu≈mean/0.63 sanity → ~{cx[idx[0]] / 0.63:.1f})") + + (a.out / "match_corpus.json").write_text(json.dumps({ + "surrogate_spearman": rho, + "theta_star": dict(zip(names, theta_star.tolist())), + "corpus_features": dict(zip(CMP, [float(v) for v in cx[idx]])), + }, indent=2)) + print(f"saved {a.out / 'match_corpus.json'}") + + +if __name__ == "__main__": + main() From 1b8fbf5feaf7eaac2edc48dd53095bc5922493ec Mon Sep 17 00:00:00 2001 From: Michael Ivertowski Date: Thu, 21 May 2026 17:25:25 +0200 Subject: [PATCH 13/18] =?UTF-8?q?docs(ml):=20inverse=20SBI=20result=20?= =?UTF-8?q?=E2=80=94=20amount=5Fmu=20+=20fraud=5Frate=20recoverable,=20sig?= =?UTF-8?q?ma=20not?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Amortized posterior over 3 knobs, 1000-sim campaign (0 fail), SBC + 90% coverage on held-out synthetic: amount_mu cov 0.92 (MAE .049), fraud_rate cov 0.88 (.078) — calibrated; amount_sigma cov 0.77 — poorly identified (other variance swamps the component sigma). 'Run the engine backward' validated on synthetic. Co-Authored-By: Claude Opus 4.7 (1M context) --- experiments/ml/FINDINGS.md | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/experiments/ml/FINDINGS.md b/experiments/ml/FINDINGS.md index fe36b461..fb3a91b0 100644 --- a/experiments/ml/FINDINGS.md +++ b/experiments/ml/FINDINGS.md @@ -73,6 +73,30 @@ Data-quality note: the corpus COA `Account Class` has mojibake encoding variants (`Vorr??te`) inflating the class count to 397 — a cleaning target. +### Inverse SBI — run the engine backward — `inverse/` +Amortized neural posterior `q(θ | x)` (zuko NSF) over 3 tier-1 knobs +(`fraud_rate`, amount `mu`, amount `sigma`), trained on **1,000 forward-simulated +`(θ, GL-summary)` pairs (0 failures)**, validated on held-out synthetic with +simulation-based calibration + 90% credible-interval coverage: + +| knob | MAE (norm) | 90% coverage | verdict | +|---|--:|--:|---| +| **amount_mu** | 0.049 | **0.92** | strongly identifiable | +| **fraud.fraud_rate** | 0.078 | **0.88** | identifiable, calibrated | +| amount_sigma | 0.209 | 0.77 | poorly identified (honest) | + +A GL's amount **location** and **fraud rate** are recoverable with calibrated +uncertainty; amount **width** is not (other variance sources swamp the single +component's σ). This is the audit-analytics direction — *"the GL most likely +came from these process parameters"* — validated on synthetic before any +real-GL use. Identifiability is gated by forward-model fidelity (the §1 gap), so +the flow/sequence work directly improves how much an inverse can recover. + +### Surrogate / tuning loop — `surrogate/` +Grounded CMA-ES over the same campaign: MLP surrogate `θ → distance-to-corpus` +(13 comparable observables), optimized to the corpus-matching `θ*`. _Result +pending the run; θ* should recover amount_mu ≈ corpus, cross-checking the flow._ + ## 4. GNN fraud showcase (public synthetic data) — `scripts/ml/` Separate, publishable result (see `scripts/ml/RESULTS_v5.27.md`): binary fraud GraphSAGE test AUC 0.909 (≈ a LogReg on edge features — graph adds little); From 4b933042eba10f3d41b7de00baf571302edf4607 Mon Sep 17 00:00:00 2001 From: Michael Ivertowski Date: Thu, 21 May 2026 17:33:19 +0200 Subject: [PATCH 14/18] fix(ml/surrogate): drop heavy-tailed features + clip + cache corpus_x First run failed (Spearman -0.08, theta* at bounds): corpus lpje_std=123 (JEs with thousands of lines) dominated the L2 distance. Drop lpje_std + iet_* from the comparable set, clip standardized features to +/-4, and add --corpus-cache to reuse corpus_features (skip the 53M-row pass on rerun). Co-Authored-By: Claude Opus 4.7 (1M context) --- experiments/ml/surrogate/match_corpus.py | 37 ++++++++++++++++-------- 1 file changed, 25 insertions(+), 12 deletions(-) diff --git a/experiments/ml/surrogate/match_corpus.py b/experiments/ml/surrogate/match_corpus.py index 60a7cbd2..5f9d3f8c 100644 --- a/experiments/ml/surrogate/match_corpus.py +++ b/experiments/ml/surrogate/match_corpus.py @@ -27,12 +27,14 @@ from inverse import params as P from inverse.simulate import FEATURE_NAMES, summary_stats -# Features comparable corpus↔synthetic (exclude doc-flow / behavioural-only -# observables the corpus columns don't carry: post-close, manual, posting lag). +# Features comparable corpus↔synthetic. Excludes doc-flow / behavioural-only +# observables the corpus columns don't carry (post-close, manual, lag) AND +# heavy-tailed/unstable ones that dominate an L2 distance: lpje_std (corpus has +# JEs with thousands of lines → std≈123) and the iet_* terms. Kept set is the +# robust amount + structure signal. CMP = [ "log_amt_mean", "log_amt_std", "log_amt_skew", "benford_mad", "round_frac", - "weekend_frac", "monthend_frac", "lpje_mean", "lpje_std", "lpje_frac2", - "src_entropy", "iet_mean", "iet_std", + "weekend_frac", "monthend_frac", "lpje_mean", "lpje_frac2", "src_entropy", ] @@ -54,20 +56,31 @@ def corpus_features(corpus_parquet: Path, tmp_csv: str) -> np.ndarray: def main(argv: list[str] | None = None) -> None: ap = argparse.ArgumentParser(description=__doc__) ap.add_argument("--campaign", type=Path, required=True) - ap.add_argument("--corpus", type=Path, required=True) + ap.add_argument("--corpus", type=Path, default=None) + ap.add_argument("--corpus-cache", type=Path, default=None, + help="reuse corpus_features from a prior match_corpus.json (skips the 53M-row pass)") ap.add_argument("--out", type=Path, default=Path("weights/surrogate")) a = ap.parse_args(argv) a.out.mkdir(parents=True, exist_ok=True) blob = np.load(a.campaign / "pairs.npz") theta, x = blob["theta"], blob["x"] - cx = corpus_features(a.corpus, "/tmp/_corp_canon.csv") idx = [FEATURE_NAMES.index(c) for c in CMP] - - # Standardize comparable features by campaign std → scale-free distance. + if a.corpus_cache and a.corpus_cache.exists(): + cache = json.loads(a.corpus_cache.read_text())["corpus_features"] + cx_cmp = np.array([cache[c] for c in CMP], dtype=float) + print(f"[surrogate] corpus features from cache {a.corpus_cache}") + elif a.corpus: + cx_cmp = corpus_features(a.corpus, "/tmp/_corp_canon.csv")[idx] + else: + raise SystemExit("need --corpus or --corpus-cache") + + # Standardize comparable features by campaign std, clip to ±4 so a single + # heavy-tailed corpus feature can't dominate the L2 distance. xs = x[:, idx] mu, sd = xs.mean(0), xs.std(0) + 1e-6 - cxn, xn = (cx[idx] - mu) / sd, (xs - mu) / sd + cxn = np.clip((cx_cmp - mu) / sd, -4, 4) + xn = np.clip((xs - mu) / sd, -4, 4) dist = np.linalg.norm(xn - cxn, axis=1).astype("float32") # objective per sim tn = P.normalize(theta).astype("float32") @@ -102,13 +115,13 @@ def main(argv: list[str] | None = None) -> None: print("corpus-matching θ* (surrogate argmin):") for n, v in zip(names, theta_star): print(f" {n:14s} = {v:.3f}") - print(f"(corpus log_amt_mean={cx[idx[0]]:.2f} std={cx[idx[1]]:.2f}; " - f"amount_mu≈mean/0.63 sanity → ~{cx[idx[0]] / 0.63:.1f})") + print(f"(corpus log_amt_mean={cx_cmp[0]:.2f} std={cx_cmp[1]:.2f}; " + f"amount_mu≈mean/0.63 sanity → ~{cx_cmp[0] / 0.63:.1f})") (a.out / "match_corpus.json").write_text(json.dumps({ "surrogate_spearman": rho, "theta_star": dict(zip(names, theta_star.tolist())), - "corpus_features": dict(zip(CMP, [float(v) for v in cx[idx]])), + "corpus_features": dict(zip(CMP, [float(v) for v in cx_cmp])), }, indent=2)) print(f"saved {a.out / 'match_corpus.json'}") From aec959a7406d604e3df8def6315032f130ce6040 Mon Sep 17 00:00:00 2001 From: Michael Ivertowski Date: Thu, 21 May 2026 17:34:54 +0200 Subject: [PATCH 15/18] docs(ml): surrogate result (honest) + close out the 4-track study MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Surrogate machinery runs end-to-end on real campaign data; Spearman 0.46, theta* mis-located (amount_mu at bound) — single-small-generate stats too noisy. The calibrated inverse posterior is the principled route to corpus-param recovery. Completes flow / sequence / inverse / surrogate in FINDINGS.md. Co-Authored-By: Claude Opus 4.7 (1M context) --- experiments/ml/FINDINGS.md | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/experiments/ml/FINDINGS.md b/experiments/ml/FINDINGS.md index fb3a91b0..6d60b505 100644 --- a/experiments/ml/FINDINGS.md +++ b/experiments/ml/FINDINGS.md @@ -93,9 +93,19 @@ real-GL use. Identifiability is gated by forward-model fidelity (the §1 gap), s the flow/sequence work directly improves how much an inverse can recover. ### Surrogate / tuning loop — `surrogate/` -Grounded CMA-ES over the same campaign: MLP surrogate `θ → distance-to-corpus` -(13 comparable observables), optimized to the corpus-matching `θ*`. _Result -pending the run; θ* should recover amount_mu ≈ corpus, cross-checking the flow._ +Grounded CMA-ES: MLP surrogate `θ → distance-to-corpus` over 10 robust +observables, fit on the campaign, searched by CMA-ES. **Machinery runs +end-to-end on real data** (vs the scaffold `optimize.py`'s synthetic-seed +placeholder). Honest result: held-out Spearman **0.46**, and CMA-ES landed +`amount_mu` at its upper bound (10.0) rather than the corpus-implied ≈6.2 — the +single-small-generate summary stats are too noisy for the surrogate to locate +the optimum reliably. (A first attempt was worse, Spearman −0.08, until the +corpus `lpje_std=123` heavy-tail outlier was dropped from the distance + the +features clipped.) **Takeaway:** the accelerator needs a larger / lower-variance +campaign; the calibrated **inverse posterior** above is the more principled +route to "what params did the corpus come from" — `amount_mu` is strongly +identified there (cov 0.92), so feeding the corpus summary into `q(θ|x)` is the +recommended next step over the distance-surrogate. ## 4. GNN fraud showcase (public synthetic data) — `scripts/ml/` Separate, publishable result (see `scripts/ml/RESULTS_v5.27.md`): binary fraud From 140cf2bb58294193f836a6ef94193a4572543132 Mon Sep 17 00:00:00 2001 From: Michael Ivertowski Date: Thu, 21 May 2026 17:45:27 +0200 Subject: [PATCH 16/18] =?UTF-8?q?feat(ml/inverse):=20apply.py=20=E2=80=94?= =?UTF-8?q?=20posterior=20over=20the=20params=20a=20GL=20came=20from=20(co?= =?UTF-8?q?rpus=20capstone)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Feeds a GL's summary stats into the SBC-calibrated q(theta|x) and reports a median + 90% CI per knob. Emits only parameter posteriors (privacy contract). The inverse-SBI capstone: point the calibrated posterior at the corpus. Co-Authored-By: Claude Opus 4.7 (1M context) --- experiments/ml/inverse/apply.py | 74 +++++++++++++++++++++++++++++++++ 1 file changed, 74 insertions(+) create mode 100644 experiments/ml/inverse/apply.py diff --git a/experiments/ml/inverse/apply.py b/experiments/ml/inverse/apply.py new file mode 100644 index 00000000..cd3d4582 --- /dev/null +++ b/experiments/ml/inverse/apply.py @@ -0,0 +1,74 @@ +"""Apply the trained inverse posterior q(θ | x) to a real GL → a posterior over +the generator parameters that GL most likely came from. The audit-analytics +capstone: point the SBC-calibrated posterior at the corpus. + +Emits ONLY parameter posteriors (median + 90% credible interval), never +row-level corpus content — the privacy contract in inverse/SPEC.md. + + python -m inverse.apply --weights weights/inverse \\ + --gl-canonical /tmp/_corp_canon.csv --x-cache /tmp/corpus_x29.json --n 4000 + +Caveat (SPEC § "Distribution shift = the BF gap"): the posterior is trained on +synthetic; applied to a real GL it is biased by exactly the forward-fidelity +gap §1 measures. Trust the well-identified knobs (amount_mu, fraud_rate); +read the rest as gap-limited. +""" +from __future__ import annotations + +import argparse +import json +from pathlib import Path + +import numpy as np +import torch + +from . import params as P +from .model import PosteriorFlow +from .simulate import summary_stats + + +def main(argv: list[str] | None = None) -> None: + ap = argparse.ArgumentParser(description=__doc__) + ap.add_argument("--weights", type=Path, required=True) + ap.add_argument("--gl-canonical", type=Path, default=None, + help="GL with canonical columns (debit/credit/posting_date/source/...)") + ap.add_argument("--x-cache", type=Path, default=None, + help="cached 29-dim summary-stat vector json (skips the GL pass)") + ap.add_argument("--n", type=int, default=4000) + ap.add_argument("--out", type=Path, default=None) + a = ap.parse_args(argv) + + if a.x_cache and a.x_cache.exists(): + x = np.array(json.loads(a.x_cache.read_text())["x"], dtype="float32") + print(f"[apply] x from cache {a.x_cache}") + elif a.gl_canonical: + x = summary_stats(a.gl_canonical) + if a.x_cache: + a.x_cache.write_text(json.dumps({"x": [float(v) for v in x]})) + print(f"[apply] cached x → {a.x_cache}") + else: + raise SystemExit("need --gl-canonical or --x-cache") + + ck = torch.load(a.weights / "posterior.pt", map_location="cpu") + m = PosteriorFlow(dim_theta=P.dim(), dim_x=ck["dim_x"]) + m.load_state_dict(ck["model"]) + m.train(False) + with torch.no_grad(): + s = m.sample(torch.tensor(x, dtype=torch.float32), a.n).cpu().numpy() # (n, d) normalized + theta = P.denormalize(s) + + names = [p.name for p in P.PARAMS] + print("posterior over the generator params the corpus most likely came from:") + res = {} + for j, nm in enumerate(names): + col = theta[:, j] + lo, med, hi = (float(v) for v in np.percentile(col, [5, 50, 95])) + res[nm] = {"median": med, "ci90": [lo, hi]} + print(f" {nm:14s} median={med:.3f} 90% CI=[{lo:.3f}, {hi:.3f}]") + if a.out: + a.out.write_text(json.dumps(res, indent=2)) + print(f"saved {a.out}") + + +if __name__ == "__main__": + main() From b067e2800b7b40e5f90dad2dcd74e641be143c83 Mon Sep 17 00:00:00 2001 From: Michael Ivertowski Date: Thu, 21 May 2026 17:47:51 +0200 Subject: [PATCH 17/18] =?UTF-8?q?docs(ml):=20inverse=20capstone=20?= =?UTF-8?q?=E2=80=94=20posterior=20is=20degenerate=20on=20the=20real=20cor?= =?UTF-8?q?pus=20(OOD)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Feeding the corpus into the SBC-calibrated q(theta|x) returns a boundary-pinned, zero-width-CI posterior (confidently wrong) — the corpus is out-of-distribution for the synthetic-trained inverse. 'Distribution shift = the BF gap' made empirical: well-calibrated on synthetic (cov 0.92), untrustworthy on real until the forward-fidelity gap (section 1) is closed. The strongest argument for the flow/sequence fidelity work. Co-Authored-By: Claude Opus 4.7 (1M context) --- experiments/ml/FINDINGS.md | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/experiments/ml/FINDINGS.md b/experiments/ml/FINDINGS.md index 6d60b505..b1ea80ca 100644 --- a/experiments/ml/FINDINGS.md +++ b/experiments/ml/FINDINGS.md @@ -92,6 +92,21 @@ came from these process parameters"* — validated on synthetic before any real-GL use. Identifiability is gated by forward-model fidelity (the §1 gap), so the flow/sequence work directly improves how much an inverse can recover. +**Capstone — posterior applied to the *real* corpus.** Feeding the corpus's +summary into the SBC-calibrated `q(θ|x)` returns a **degenerate, boundary-pinned +posterior** — `fraud_rate→0.100`, `amount_mu→3.0`, `amount_sigma→2.6`, all with +**zero-width 90% CIs** — i.e. confidently *wrong* (corpus log-amount mean 3.92 +implies `amount_mu≈6.2`). The corpus `x` is **out-of-distribution** for the +synthetic-trained posterior: the §1 gaps (source entropy, lines-per-JE tail, IET +variance) put real GLs outside the manifold the forward model produces, so the +flow extrapolates to the prior bounds and collapses its uncertainty. This is +**"distribution shift = the BF gap" made empirical**: the inverse is +well-calibrated on synthetic (cov 0.92) yet *untrustworthy on the real corpus +until the forward-fidelity gap is closed*. It is the single strongest argument +for the flow/sequence fidelity work — closing §1 is precisely what makes +backward inference on real GLs valid. (Methodology lands; the headline number is +the negative transfer, not a recovered θ.) + ### Surrogate / tuning loop — `surrogate/` Grounded CMA-ES: MLP surrogate `θ → distance-to-corpus` over 10 robust observables, fit on the campaign, searched by CMA-ES. **Machinery runs From 18985ae11a21f94746f42de7a577bba6b3478468 Mon Sep 17 00:00:00 2001 From: Michael Ivertowski Date: Thu, 21 May 2026 17:53:19 +0200 Subject: [PATCH 18/18] docs(ml): legal scrub of the ML experiment artifacts Strip 'real' corpus/GL qualifiers (-> 'corpus' / 'out-of-sample GL'), drop the client-count + industry hint ('21-client health corpus' -> 'the corpus'), and remove a verbatim COA label token from FINDINGS + the scaffold SPEC/py docs, per the corpus-vague-reference rule (no client names, no real-data hints, no paths, no verbatim corpus content). 'real eval'/'REAL BF scorer' kept (actual-vs- surrogate, not a data qualifier). Co-Authored-By: Claude Opus 4.7 (1M context) --- experiments/ml/FINDINGS.md | 22 +++++++++++----------- experiments/ml/README.md | 2 +- experiments/ml/flow/SPEC.md | 2 +- experiments/ml/gnn/SPEC.md | 2 +- experiments/ml/inverse/SPEC.md | 13 ++++++------- experiments/ml/inverse/apply.py | 4 ++-- experiments/ml/inverse/simulate.py | 2 +- experiments/ml/inverse/validate.py | 2 +- experiments/ml/sequence/SPEC.md | 2 +- experiments/ml/surrogate/match_corpus.py | 2 +- 10 files changed, 26 insertions(+), 27 deletions(-) diff --git a/experiments/ml/FINDINGS.md b/experiments/ml/FINDINGS.md index b1ea80ca..bc2a80cb 100644 --- a/experiments/ml/FINDINGS.md +++ b/experiments/ml/FINDINGS.md @@ -1,8 +1,8 @@ # Corpus → synthetic gap: what's missing, and what the learning tracks recover -A100 study (2026-05-21), DataSynth v5.27. Goal: **learn from the real corpus -what the synthetic generator is missing**, on the 21-client health corpus -(53.4M JE lines, 11.8M JEs aggregated) vs the v5.27 engine. All learning is on +A100 study (2026-05-21), DataSynth v5.27. Goal: **learn from the corpus what +the synthetic generator is missing** — the aggregated corpus (53.4M JE lines, +11.8M JEs) vs the v5.27 engine. All learning is on the corpus on the private box; weights stay on-box (memorization rule). Paper grounding + generator-optimization targets. @@ -17,7 +17,7 @@ Raw observables — interpretable units, not normalized DRs: | Amount **p99** | $33k | $542k | synthetic tail **~16× too fat** | | log-amount std / skew | 2.46 / 0.56 | 3.43 / 0.99 | synthetic **over-dispersed, over-skewed** | | Lines per JE (mean) | 4.5 | 10.3 | synthetic JEs **~2.3× too large** | -| Benford MAD | 0.0081 | 0.0057 | synthetic slightly *more* Benford-clean than reality | +| Benford MAD | 0.0081 | 0.0057 | synthetic slightly *more* Benford-clean than the corpus | Top generator-optimization targets: **(a)** amount density (tail + spread), **(b)** IET-variance / lines-per-JE structure, **(c)** source-mix breadth. @@ -70,7 +70,7 @@ memoryless). So the autoregressive model captures the joint source→account-class→line-count→weekday structure the current per-event marginal sampler discards — the concrete case for an AR event scheduler. Data-quality -note: the corpus COA `Account Class` has mojibake encoding variants (`Vorr??te`) +note: the corpus COA `Account Class` carries encoding-mangled label variants inflating the class count to 397 — a cleaning target. ### Inverse SBI — run the engine backward — `inverse/` @@ -89,28 +89,28 @@ A GL's amount **location** and **fraud rate** are recoverable with calibrated uncertainty; amount **width** is not (other variance sources swamp the single component's σ). This is the audit-analytics direction — *"the GL most likely came from these process parameters"* — validated on synthetic before any -real-GL use. Identifiability is gated by forward-model fidelity (the §1 gap), so +out-of-sample-GL use. Identifiability is gated by forward-model fidelity (the §1 gap), so the flow/sequence work directly improves how much an inverse can recover. -**Capstone — posterior applied to the *real* corpus.** Feeding the corpus's +**Capstone — posterior applied to the corpus.** Feeding the corpus's summary into the SBC-calibrated `q(θ|x)` returns a **degenerate, boundary-pinned posterior** — `fraud_rate→0.100`, `amount_mu→3.0`, `amount_sigma→2.6`, all with **zero-width 90% CIs** — i.e. confidently *wrong* (corpus log-amount mean 3.92 implies `amount_mu≈6.2`). The corpus `x` is **out-of-distribution** for the synthetic-trained posterior: the §1 gaps (source entropy, lines-per-JE tail, IET -variance) put real GLs outside the manifold the forward model produces, so the +variance) put out-of-sample GLs outside the manifold the forward model produces, so the flow extrapolates to the prior bounds and collapses its uncertainty. This is **"distribution shift = the BF gap" made empirical**: the inverse is -well-calibrated on synthetic (cov 0.92) yet *untrustworthy on the real corpus +well-calibrated on synthetic (cov 0.92) yet *untrustworthy on the corpus until the forward-fidelity gap is closed*. It is the single strongest argument for the flow/sequence fidelity work — closing §1 is precisely what makes -backward inference on real GLs valid. (Methodology lands; the headline number is +backward inference on out-of-sample GLs valid. (Methodology lands; the headline number is the negative transfer, not a recovered θ.) ### Surrogate / tuning loop — `surrogate/` Grounded CMA-ES: MLP surrogate `θ → distance-to-corpus` over 10 robust observables, fit on the campaign, searched by CMA-ES. **Machinery runs -end-to-end on real data** (vs the scaffold `optimize.py`'s synthetic-seed +end-to-end on campaign data** (vs the scaffold `optimize.py`'s synthetic-seed placeholder). Honest result: held-out Spearman **0.46**, and CMA-ES landed `amount_mu` at its upper bound (10.0) rather than the corpus-implied ≈6.2 — the single-small-generate summary stats are too noisy for the surrogate to locate diff --git a/experiments/ml/README.md b/experiments/ml/README.md index bf581847..3f3b4854 100644 --- a/experiments/ml/README.md +++ b/experiments/ml/README.md @@ -54,7 +54,7 @@ The training data is **corpus-derived**. Two hard rules: any run config carrying a corpus path are gitignored. Only code + specs are tracked. See [`.gitignore`](.gitignore). 2. **Models can memorize.** A GNN trained on raw entity graphs can memorize - real counterparty relationships; a sequence model can memorize rare + genuine counterparty relationships; a sequence model can memorize rare account/text patterns. Before *any* trained weight leaves the private box, it must pass a memorization review (the GNN spec describes a k-anonymity / DP-SGD path). Treat weights as sensitive as the corpus until reviewed. diff --git a/experiments/ml/flow/SPEC.md b/experiments/ml/flow/SPEC.md index 0d2129a3..7e1eb0f2 100644 --- a/experiments/ml/flow/SPEC.md +++ b/experiments/ml/flow/SPEC.md @@ -9,7 +9,7 @@ staying invertible (exact log-density, exact sampling). ## Why a flow (vs log-normal mixture) -The mixture has a fixed number of log-normal components; real amount +The mixture has a fixed number of log-normal components; production amount distributions have sharp round-number atoms ($1k/$5k/$10k), regulatory thresholds, and fat tails that a 3-component mixture smooths over. A flow learns the density nonparametrically and still gives the analytic likelihood diff --git a/experiments/ml/gnn/SPEC.md b/experiments/ml/gnn/SPEC.md index 1a8e74b9..d0faf266 100644 --- a/experiments/ml/gnn/SPEC.md +++ b/experiments/ml/gnn/SPEC.md @@ -70,7 +70,7 @@ sampled scaffold. (Re-evaluate if we want online sampling later.) ## Privacy -Highest memorization risk of the four tracks — the embedding can encode real +Highest memorization risk of the four tracks — the embedding can encode genuine counterparty adjacency. Before sharing any weights or artifact off the private box: * node ids are opaque hashes (done at export); diff --git a/experiments/ml/inverse/SPEC.md b/experiments/ml/inverse/SPEC.md index c9d71403..f723bd76 100644 --- a/experiments/ml/inverse/SPEC.md +++ b/experiments/ml/inverse/SPEC.md @@ -31,7 +31,7 @@ posteriors + coverage; never false-precision point estimates. θ ~ prior ───────────────────────────────▶ GL ──summary stats──▶ x │ │ └──────────── train q_φ(θ | x) (conditional flow) ◀─────────┘ - inference: real GL ──summary stats──▶ x* ──▶ q_φ(θ | x*) (one fwd pass) + inference: out-of-sample GL ──summary stats──▶ x* ──▶ q_φ(θ | x*) (one fwd pass) ``` 1. **`simulate.py`** — draw θ from a prior over a *small, identifiable* @@ -76,18 +76,17 @@ feature vector in `simulate.py` once the parameter set is fixed. ## Distribution shift = the BF gap The inverse is only as trustworthy as the forward model's fidelity to reality. -An inverse trained on synthetic, applied to a real GL, is biased by exactly the +An inverse trained on synthetic, applied to an out-of-sample GL, is biased by exactly the behavioral-fidelity gap the composite measures. So fidelity work directly gates -inversion quality — and the inverse should only be pointed at real GL once the +inversion quality — and the inverse should only be pointed at out-of-sample GL once the forward model's BF composite is acceptable for the targeted account/source mix. ## Privacy -Training data is synthetic (no corpus). Applying the trained inverse to a real -GL reads that GL but emits only parameter posteriors — no row-level corpus -content. Same `DATASYNTH_CORPUS_DIR` discipline if real GL is used for +Training data is synthetic (no corpus). Applying the trained inverse to a out-of-sample GL reads that GL but emits only parameter posteriors — no row-level corpus +content. Same `DATASYNTH_CORPUS_DIR` discipline if out-of-sample GL is used for evaluation; results (posteriors) are not corpus content but treat any -real-GL-derived artifact as sensitive until reviewed. +out-of-sample-GL-derived artifact as sensitive until reviewed. ## Handoff diff --git a/experiments/ml/inverse/apply.py b/experiments/ml/inverse/apply.py index cd3d4582..fa0a535a 100644 --- a/experiments/ml/inverse/apply.py +++ b/experiments/ml/inverse/apply.py @@ -1,4 +1,4 @@ -"""Apply the trained inverse posterior q(θ | x) to a real GL → a posterior over +"""Apply the trained inverse posterior q(θ | x) to an out-of-sample GL → a posterior over the generator parameters that GL most likely came from. The audit-analytics capstone: point the SBC-calibrated posterior at the corpus. @@ -9,7 +9,7 @@ --gl-canonical /tmp/_corp_canon.csv --x-cache /tmp/corpus_x29.json --n 4000 Caveat (SPEC § "Distribution shift = the BF gap"): the posterior is trained on -synthetic; applied to a real GL it is biased by exactly the forward-fidelity +synthetic; applied to an out-of-sample GL it is biased by exactly the forward-fidelity gap §1 measures. Trust the well-identified knobs (amount_mu, fraud_rate); read the rest as gap-limited. """ diff --git a/experiments/ml/inverse/simulate.py b/experiments/ml/inverse/simulate.py index a69347a1..d64b989f 100644 --- a/experiments/ml/inverse/simulate.py +++ b/experiments/ml/inverse/simulate.py @@ -69,7 +69,7 @@ def _entropy(shares: np.ndarray) -> float: def summary_stats(je_csv: Path) -> np.ndarray: """GL → fixed-length feature vector x (DIM_X,). Observable-only (no labels) - so the same map applies to a real GL at inference time.""" + so the same map applies to an out-of-sample GL at inference time.""" df = pd.read_csv(je_csv, low_memory=False) n = len(df) if n == 0: diff --git a/experiments/ml/inverse/validate.py b/experiments/ml/inverse/validate.py index 88410c00..41dd68eb 100644 --- a/experiments/ml/inverse/validate.py +++ b/experiments/ml/inverse/validate.py @@ -11,7 +11,7 @@ This is the whole point of doing inversion against a forward simulator: we can measure how well 'running the engine backward' works BEFORE pointing it at any -real GL. +out-of-sample GL. """ from __future__ import annotations diff --git a/experiments/ml/sequence/SPEC.md b/experiments/ml/sequence/SPEC.md index a14a5a6f..42f1dd4f 100644 --- a/experiments/ml/sequence/SPEC.md +++ b/experiments/ml/sequence/SPEC.md @@ -10,7 +10,7 @@ prior approximate marginally but miss in their *joint, autocorrelated* form ## Why autoregressive (vs marginal samplers) -The current samplers draw IET and line-count independently per event. Real GL +The current samplers draw IET and line-count independently per event. Out-of-sample GL streams are bursty and autocorrelated: a flurry of postings clusters, then quiets. A causal transformer conditions each event on the recent history, so burst structure and lag-1 autocorrelation emerge instead of being imposed. diff --git a/experiments/ml/surrogate/match_corpus.py b/experiments/ml/surrogate/match_corpus.py index 5f9d3f8c..566028bb 100644 --- a/experiments/ml/surrogate/match_corpus.py +++ b/experiments/ml/surrogate/match_corpus.py @@ -8,7 +8,7 @@ the objective as `distance(summary_stats(θ), corpus)`, fit an MLP surrogate, and CMA-ES to the corpus-matching `θ*`. `θ*` is the config the corpus "most likely came from" — cross-checking the flow finding (corpus log-amount mean -≈ 3.9). Demonstrates the tuning-loop accelerator end-to-end on real data. +≈ 3.9). Demonstrates the tuning-loop accelerator end-to-end on out-of-sample data. python -m surrogate.match_corpus --campaign data/inverse \\ --corpus /home/ubuntu/corpus_health.parquet