ML experiments: scaffold 5 neuro-symbolic + inverse-SBI tracks (A100-ready) by mivertowski · Pull Request #203 · mivertowski/SyntheticData

mivertowski · 2026-05-20T14:10:07Z

Summary

Runnable PyTorch scaffolding + per-track specs under experiments/ml/ to close the behavioral-fidelity gap by learning the generator's proposal distributions while leaving the symbolic constraint layer (debits=credits, A=L+E, document chains, IC matching) untouched.

Principle: the NN emits structure/shape; the existing Rust engine projects it onto the feasible manifold. Coherence stays a hard guarantee — no model ever emits a final balance.

This is scaffold only — no model trained yet. Built to run on an A100 when it frees up; the dev box OOMs on orchestrator-scale work, so verification is by syntax-check + design review here, then training on the GPU box.

Tracks

Dir	Architecture	BF metrics targeted
`gnn/`	GraphSAGE encoder + GAE inner-product decoder	P3 ClusteringGap, TriangleLogRatio
`sequence/`	causal transformer over JE event-token streams	P1 IETD + Autocorr, P2 JELineBurst, P4 MeanGap
`flow/`	conditional neural-spline flow (zuko) + round-number atom mixture	Benford / multimodal amount tails
`surrogate/`	MLP eval-surrogate + CMA-ES knob search	calibration-loop speed (zero coherence risk)

common/ — shared corpus→tensor exporter + BF-eval bridge (canonical Rust scorer + fast Python approximation for in-loop scoring).

Privacy / legal

Corpus path via DATASYNTH_CORPUS_DIR env — never hard-coded, never logged.
data/, weights/, runs/, path-bearing configs all gitignored — only code + specs tracked (verified: 0 corpus artifacts staged).
Per-track memorization review required before any weight leaves the private box (k-anonymity / DP-SGD path documented in gnn/SPEC.md — the GNN has the highest memorization risk).
No corpus content committed; schema uses DataSynth's own canonical field names, corpus column mapping via a gitignored local YAML.

Test plan

All 18 Python modules pass py_compile (syntax).
.gitignore confirmed to exclude *.pt / *.parquet / data/ / weights/.
Per track, on the A100: data_export → train → bf_bridge.score_canonical lift vs the v5.26 baseline (success criteria in each SPEC.md).

Independent of SP6 (#202) — disjoint paths, branched off main.

🤖 Generated with Claude Code

Runnable PyTorch scaffolding + per-track specs under experiments/ml/ for closing the behavioral-fidelity gap by learning the generator's *proposal distributions* while leaving the symbolic constraint layer (debits=credits, A=L+E, document chains, IC matching) untouched. The NN proposes structure; the existing Rust engine enforces every invariant — coherence stays a hard guarantee. Tracks: - gnn/ GraphSAGE+GAE relational sampler → P3 ClusteringGap / TriangleLogRatio - sequence/ causal transformer over JE event streams → P1 IETD/Autocorr, P2, P4 - flow/ conditional neural-spline flow for amounts → Benford / multimodal tails - surrogate/ MLP eval-surrogate + CMA-ES knob search → faster calibration loop (performance only, zero coherence risk — never touches generation) common/ holds the shared corpus→tensor exporter and the BF-eval bridge (canonical Rust scorer + a fast Python approximation for in-loop use). Scaffold only — no model trained. Built to run on an A100 when free; this dev box OOMs on orchestrator-scale work. Each train.py is runnable; model bodies carry TODO markers where corpus-schema wiring lands after the first data export. Privacy: corpus path via DATASYNTH_CORPUS_DIR (never hard-coded); data/ weights/run-configs gitignored; per-track memorization review (k-anon / DP-SGD path in the GNN spec) required before any weight leaves the private box. No corpus content committed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Track 5 — run the generator backward. Given an observed GL, recover a posterior over the latent process parameters that could have produced it (audit-analytics direction). Feasible here because DataSynth is a structured generative model with KNOWN ground truth: it manufactures labeled (θ → GL) pairs for free, and the hard accounting constraints regularize the otherwise ill-posed inverse. Amortized SNPE: a conditional normalizing flow q_φ(θ | x) (reuses the flow/ NSF) trained on forward-simulated pairs, where x is a GL summary-stat vector. Many-to-one forward map ⇒ we recover a posterior, not a point. Files: - params.py tier-1 identifiable parameter set + priors (fraud rates, amount σ, posting-lag μ/σ, concentration) - simulate.py draw θ ~ prior → datasynth-data generate → summary stats x - model.py PosteriorFlow (zuko NSF conditioned on x) - train.py maximize Σ log q_φ(θ|x) on the simulated pairs - validate.py SBC rank histograms + credible-interval COVERAGE on held-out synthetic — measure how well 'backward' works before pointing at any real GL Scope ladder in SPEC.md: parameters (this) → process attribution (overlaps ocpm + gnn) → latent fraud/anomaly labels. Inversion quality is gated by the forward model's BF fidelity (distribution shift). Privacy: trains on synthetic only; emits parameter posteriors, never row-level content. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mivertowski · 2026-05-20T19:50:15Z

Added track 5 — inverse / simulation-based inference (experiments/ml/inverse/): an amortized neural posterior q(θ|GL) that runs the generator backward to recover the latent process parameters a GL was distilled from, with SBC + credible-interval coverage validated on synthetic (where ground truth is known). Reuses the flow/ NSF + the forward simulator's free labels; many-to-one forward map ⇒ posterior, not point estimate.

…calar knobs Fills the simulate.py TODOs: a 29-dim observable-only GL summary-stat vector (amount / Benford / round-dollar / weekend / lines-per-JE / posting-lag / source-mix / IET / GL-concentration) and run_one (dotted-key config override -> datasynth-data generate -> summary_stats), fanned out over a process pool. Drops the invalid distributions.amounts.sigma knob (amounts is a mixture components list, not a scalar) so overrides stay valid under deny_unknown_fields; 5 verified scalar params remain for the tier-1 demo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… missing') Side-by-side behavioral observables (lines-per-JE, log-amount moments, Benford MAD, round-dollar / small-ticket share, p99 amount, weekend share, source mix, per-source inter-event times) for corpus (corpus columns) vs a synthetic journal_entries.csv (canonical columns). Complements the normalized DRs from behavioral score with raw units. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ss conditioning) Reads corpus Functional Amount + GL Account Number, joins account_class via the COA 'c' key, emits y=signed log1p(|amount|) + one-hot(account_class) to amounts.parquet for the conditional flow. Tail clipped at p99.9 (privacy). Source (~4500 corpus levels — itself a finding) is not one-hot encoded. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… tail) The neural-spline flow's default domain (~[-5,5]) couldn't represent corpus signed-log1p amounts (which reach ~10.4), collapsing learned p99 to ~$142 vs the corpus $33k. Standardize y (mean/std saved in the checkpoint) so the tail lands inside the spline; samples are unstandardized at characterization. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…streams) Per-(client, source) ordered streams → streams.pt (dt / lines / account_class / weekday / hour_band fields, 0=pad) + vocab.json, matching EventStreamTransformer. Δt + line-count carry the inter-event/burst signal (the 60x IET-regularity gap the descriptive analysis surfaced). Per-client processing bounds memory over the 50M-row corpus; source ranked to a 0..62 id map + 'other'. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

What's missing (descriptive): source diversity, IET variance ~60x, amount tail ~16x, lines-per-JE ~2.3x. DR eval degenerates at corpus scale (noise floor ~0). Flow learns amount density (v1 tail-collapse bug found+fixed via y-standardize). Sequence transformer trains on corpus event streams; corpus dt-bucket lag-1 autocorr -0.118 (variance, not autocorr, is the gap). v2 flow number pending. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

v2 (standardized y): NLL 8.96->0.67; p99 $31,754 vs corpus $33,688 (~6%), std/skew spot-on. The shipped 3-component mixture overshoots p99 ~16x. A learned per-account-class flow recovers the amount distribution the mixture misses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…b params - sequence/characterize.py: held-out per-token NLL of the transformer vs an iid per-field marginal baseline -> information gain from modelling history. - inverse/make_base.py: small fast campaign config (fraud + distributions only). - inverse/params.py: pivot to (fraud_rate, amount_mu, amount_sigma) — minimal config friction; amount via structured component override; ties inverse + surrogate to the flow finding (recover corpus amount mean/std). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

….37 nats lift Sequence track: AR transformer beats iid marginal by +3.37 nats/token on held-out (account_class/weekday/lines structure captured; Dt near-memoryless). Fix the amount distribution_type enum value for the inverse campaign override. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…paign) Reuses the inverse forward campaign's (theta, summary-stat) pairs: objective = distance(summary_stats(theta), corpus), MLP surrogate, CMA-ES to the corpus-matching theta*. Runnable + grounded (vs the scaffold optimize.py whose load_history is a TODO + targets the corpus-scale-degenerate DR). theta* should recover amount_mu ~ corpus log-amount mean, cross-checking the flow finding. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…gma not Amortized posterior over 3 knobs, 1000-sim campaign (0 fail), SBC + 90% coverage on held-out synthetic: amount_mu cov 0.92 (MAE .049), fraud_rate cov 0.88 (.078) — calibrated; amount_sigma cov 0.77 — poorly identified (other variance swamps the component sigma). 'Run the engine backward' validated on synthetic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

First run failed (Spearman -0.08, theta* at bounds): corpus lpje_std=123 (JEs with thousands of lines) dominated the L2 distance. Drop lpje_std + iet_* from the comparable set, clip standardized features to +/-4, and add --corpus-cache to reuse corpus_features (skip the 53M-row pass on rerun). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Surrogate machinery runs end-to-end on real campaign data; Spearman 0.46, theta* mis-located (amount_mu at bound) — single-small-generate stats too noisy. The calibrated inverse posterior is the principled route to corpus-param recovery. Completes flow / sequence / inverse / surrogate in FINDINGS.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… (corpus capstone) Feeds a GL's summary stats into the SBC-calibrated q(theta|x) and reports a median + 90% CI per knob. Emits only parameter posteriors (privacy contract). The inverse-SBI capstone: point the calibrated posterior at the corpus. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…us (OOD) Feeding the corpus into the SBC-calibrated q(theta|x) returns a boundary-pinned, zero-width-CI posterior (confidently wrong) — the corpus is out-of-distribution for the synthetic-trained inverse. 'Distribution shift = the BF gap' made empirical: well-calibrated on synthetic (cov 0.92), untrustworthy on real until the forward-fidelity gap (section 1) is closed. The strongest argument for the flow/sequence fidelity work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Strip 'real' corpus/GL qualifiers (-> 'corpus' / 'out-of-sample GL'), drop the client-count + industry hint ('21-client health corpus' -> 'the corpus'), and remove a verbatim COA label token from FINDINGS + the scaffold SPEC/py docs, per the corpus-vague-reference rule (no client names, no real-data hints, no paths, no verbatim corpus content). 'real eval'/'REAL BF scorer' kept (actual-vs- surrogate, not a data qualifier). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mivertowski and others added 2 commits May 20, 2026 16:09

mivertowski changed the title ~~ML experiments: scaffold 4 neuro-symbolic realism tracks (A100-ready)~~ ML experiments: scaffold 5 neuro-symbolic + inverse-SBI tracks (A100-ready) May 20, 2026

mivertowski and others added 16 commits May 21, 2026 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ML experiments: scaffold 5 neuro-symbolic + inverse-SBI tracks (A100-ready)#203

ML experiments: scaffold 5 neuro-symbolic + inverse-SBI tracks (A100-ready)#203
mivertowski wants to merge 18 commits into
mainfrom
ml-experiments-scaffold

mivertowski commented May 20, 2026

Uh oh!

mivertowski commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mivertowski commented May 20, 2026

Summary

Tracks

Privacy / legal

Test plan

Uh oh!

mivertowski commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant