Skip to content

ML experiments: scaffold 5 neuro-symbolic + inverse-SBI tracks (A100-ready)#203

Open
mivertowski wants to merge 18 commits into
mainfrom
ml-experiments-scaffold
Open

ML experiments: scaffold 5 neuro-symbolic + inverse-SBI tracks (A100-ready)#203
mivertowski wants to merge 18 commits into
mainfrom
ml-experiments-scaffold

Conversation

@mivertowski
Copy link
Copy Markdown
Owner

Summary

Runnable PyTorch scaffolding + per-track specs under experiments/ml/ to close the behavioral-fidelity gap by learning the generator's proposal distributions while leaving the symbolic constraint layer (debits=credits, A=L+E, document chains, IC matching) untouched.

Principle: the NN emits structure/shape; the existing Rust engine projects it onto the feasible manifold. Coherence stays a hard guarantee — no model ever emits a final balance.

This is scaffold only — no model trained yet. Built to run on an A100 when it frees up; the dev box OOMs on orchestrator-scale work, so verification is by syntax-check + design review here, then training on the GPU box.

Tracks

Dir Architecture BF metrics targeted
gnn/ GraphSAGE encoder + GAE inner-product decoder P3 ClusteringGap, TriangleLogRatio
sequence/ causal transformer over JE event-token streams P1 IETD + Autocorr, P2 JELineBurst, P4 MeanGap
flow/ conditional neural-spline flow (zuko) + round-number atom mixture Benford / multimodal amount tails
surrogate/ MLP eval-surrogate + CMA-ES knob search calibration-loop speed (zero coherence risk)

common/ — shared corpus→tensor exporter + BF-eval bridge (canonical Rust scorer + fast Python approximation for in-loop scoring).

Privacy / legal

  • Corpus path via DATASYNTH_CORPUS_DIR env — never hard-coded, never logged.
  • data/, weights/, runs/, path-bearing configs all gitignored — only code + specs tracked (verified: 0 corpus artifacts staged).
  • Per-track memorization review required before any weight leaves the private box (k-anonymity / DP-SGD path documented in gnn/SPEC.md — the GNN has the highest memorization risk).
  • No corpus content committed; schema uses DataSynth's own canonical field names, corpus column mapping via a gitignored local YAML.

Test plan

  • All 18 Python modules pass py_compile (syntax).
  • .gitignore confirmed to exclude *.pt / *.parquet / data/ / weights/.
  • Per track, on the A100: data_exporttrainbf_bridge.score_canonical lift vs the v5.26 baseline (success criteria in each SPEC.md).

Independent of SP6 (#202) — disjoint paths, branched off main.

🤖 Generated with Claude Code

mivertowski and others added 2 commits May 20, 2026 16:09
Runnable PyTorch scaffolding + per-track specs under experiments/ml/ for
closing the behavioral-fidelity gap by learning the generator's *proposal
distributions* while leaving the symbolic constraint layer (debits=credits,
A=L+E, document chains, IC matching) untouched. The NN proposes structure;
the existing Rust engine enforces every invariant — coherence stays a hard
guarantee.

Tracks:
- gnn/       GraphSAGE+GAE relational sampler → P3 ClusteringGap / TriangleLogRatio
- sequence/  causal transformer over JE event streams → P1 IETD/Autocorr, P2, P4
- flow/      conditional neural-spline flow for amounts → Benford / multimodal tails
- surrogate/ MLP eval-surrogate + CMA-ES knob search → faster calibration loop
             (performance only, zero coherence risk — never touches generation)

common/ holds the shared corpus→tensor exporter and the BF-eval bridge
(canonical Rust scorer + a fast Python approximation for in-loop use).

Scaffold only — no model trained. Built to run on an A100 when free; this
dev box OOMs on orchestrator-scale work. Each train.py is runnable; model
bodies carry TODO markers where corpus-schema wiring lands after the first
data export.

Privacy: corpus path via DATASYNTH_CORPUS_DIR (never hard-coded); data/
weights/run-configs gitignored; per-track memorization review (k-anon /
DP-SGD path in the GNN spec) required before any weight leaves the private
box. No corpus content committed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Track 5 — run the generator backward. Given an observed GL, recover a
posterior over the latent process parameters that could have produced it
(audit-analytics direction). Feasible here because DataSynth is a
structured generative model with KNOWN ground truth: it manufactures
labeled (θ → GL) pairs for free, and the hard accounting constraints
regularize the otherwise ill-posed inverse.

Amortized SNPE: a conditional normalizing flow q_φ(θ | x) (reuses the
flow/ NSF) trained on forward-simulated pairs, where x is a GL
summary-stat vector. Many-to-one forward map ⇒ we recover a posterior,
not a point.

Files:
- params.py   tier-1 identifiable parameter set + priors (fraud rates,
              amount σ, posting-lag μ/σ, concentration)
- simulate.py draw θ ~ prior → datasynth-data generate → summary stats x
- model.py    PosteriorFlow (zuko NSF conditioned on x)
- train.py    maximize Σ log q_φ(θ|x) on the simulated pairs
- validate.py SBC rank histograms + credible-interval COVERAGE on
              held-out synthetic — measure how well 'backward' works
              before pointing at any real GL

Scope ladder in SPEC.md: parameters (this) → process attribution
(overlaps ocpm + gnn) → latent fraud/anomaly labels. Inversion quality is
gated by the forward model's BF fidelity (distribution shift). Privacy:
trains on synthetic only; emits parameter posteriors, never row-level
content.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mivertowski mivertowski changed the title ML experiments: scaffold 4 neuro-symbolic realism tracks (A100-ready) ML experiments: scaffold 5 neuro-symbolic + inverse-SBI tracks (A100-ready) May 20, 2026
@mivertowski
Copy link
Copy Markdown
Owner Author

Added track 5 — inverse / simulation-based inference (experiments/ml/inverse/): an amortized neural posterior q(θ|GL) that runs the generator backward to recover the latent process parameters a GL was distilled from, with SBC + credible-interval coverage validated on synthetic (where ground truth is known). Reuses the flow/ NSF + the forward simulator's free labels; many-to-one forward map ⇒ posterior, not point estimate.

mivertowski and others added 16 commits May 21, 2026 12:44
…calar knobs

Fills the simulate.py TODOs: a 29-dim observable-only GL summary-stat vector
(amount / Benford / round-dollar / weekend / lines-per-JE / posting-lag /
source-mix / IET / GL-concentration) and run_one (dotted-key config override ->
datasynth-data generate -> summary_stats), fanned out over a process pool.
Drops the invalid distributions.amounts.sigma knob (amounts is a mixture
components list, not a scalar) so overrides stay valid under
deny_unknown_fields; 5 verified scalar params remain for the tier-1 demo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… missing')

Side-by-side behavioral observables (lines-per-JE, log-amount moments, Benford
MAD, round-dollar / small-ticket share, p99 amount, weekend share, source mix,
per-source inter-event times) for corpus (corpus columns) vs a synthetic
journal_entries.csv (canonical columns). Complements the normalized DRs from
behavioral score with raw units.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ss conditioning)

Reads corpus Functional Amount + GL Account Number, joins account_class via the
COA 'c' key, emits y=signed log1p(|amount|) + one-hot(account_class) to
amounts.parquet for the conditional flow. Tail clipped at p99.9 (privacy).
Source (~4500 corpus levels — itself a finding) is not one-hot encoded.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… tail)

The neural-spline flow's default domain (~[-5,5]) couldn't represent corpus
signed-log1p amounts (which reach ~10.4), collapsing learned p99 to ~$142 vs
the corpus $33k. Standardize y (mean/std saved in the checkpoint) so the tail
lands inside the spline; samples are unstandardized at characterization.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…streams)

Per-(client, source) ordered streams → streams.pt (dt / lines / account_class /
weekday / hour_band fields, 0=pad) + vocab.json, matching EventStreamTransformer.
Δt + line-count carry the inter-event/burst signal (the 60x IET-regularity gap
the descriptive analysis surfaced). Per-client processing bounds memory over the
50M-row corpus; source ranked to a 0..62 id map + 'other'.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
What's missing (descriptive): source diversity, IET variance ~60x, amount tail
~16x, lines-per-JE ~2.3x. DR eval degenerates at corpus scale (noise floor ~0).
Flow learns amount density (v1 tail-collapse bug found+fixed via y-standardize).
Sequence transformer trains on corpus event streams; corpus dt-bucket lag-1
autocorr -0.118 (variance, not autocorr, is the gap). v2 flow number pending.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v2 (standardized y): NLL 8.96->0.67; p99 $31,754 vs corpus $33,688 (~6%),
std/skew spot-on. The shipped 3-component mixture overshoots p99 ~16x. A
learned per-account-class flow recovers the amount distribution the mixture
misses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…b params

- sequence/characterize.py: held-out per-token NLL of the transformer vs an iid
  per-field marginal baseline -> information gain from modelling history.
- inverse/make_base.py: small fast campaign config (fraud + distributions only).
- inverse/params.py: pivot to (fraud_rate, amount_mu, amount_sigma) — minimal
  config friction; amount via structured component override; ties inverse +
  surrogate to the flow finding (recover corpus amount mean/std).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
….37 nats lift

Sequence track: AR transformer beats iid marginal by +3.37 nats/token on
held-out (account_class/weekday/lines structure captured; Dt near-memoryless).
Fix the amount distribution_type enum value for the inverse campaign override.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…paign)

Reuses the inverse forward campaign's (theta, summary-stat) pairs: objective =
distance(summary_stats(theta), corpus), MLP surrogate, CMA-ES to the
corpus-matching theta*. Runnable + grounded (vs the scaffold optimize.py whose
load_history is a TODO + targets the corpus-scale-degenerate DR). theta* should
recover amount_mu ~ corpus log-amount mean, cross-checking the flow finding.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gma not

Amortized posterior over 3 knobs, 1000-sim campaign (0 fail), SBC + 90% coverage
on held-out synthetic: amount_mu cov 0.92 (MAE .049), fraud_rate cov 0.88 (.078)
— calibrated; amount_sigma cov 0.77 — poorly identified (other variance swamps
the component sigma). 'Run the engine backward' validated on synthetic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First run failed (Spearman -0.08, theta* at bounds): corpus lpje_std=123 (JEs
with thousands of lines) dominated the L2 distance. Drop lpje_std + iet_* from
the comparable set, clip standardized features to +/-4, and add --corpus-cache
to reuse corpus_features (skip the 53M-row pass on rerun).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Surrogate machinery runs end-to-end on real campaign data; Spearman 0.46, theta*
mis-located (amount_mu at bound) — single-small-generate stats too noisy. The
calibrated inverse posterior is the principled route to corpus-param recovery.
Completes flow / sequence / inverse / surrogate in FINDINGS.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (corpus capstone)

Feeds a GL's summary stats into the SBC-calibrated q(theta|x) and reports a
median + 90% CI per knob. Emits only parameter posteriors (privacy contract).
The inverse-SBI capstone: point the calibrated posterior at the corpus.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…us (OOD)

Feeding the corpus into the SBC-calibrated q(theta|x) returns a boundary-pinned,
zero-width-CI posterior (confidently wrong) — the corpus is out-of-distribution
for the synthetic-trained inverse. 'Distribution shift = the BF gap' made
empirical: well-calibrated on synthetic (cov 0.92), untrustworthy on real until
the forward-fidelity gap (section 1) is closed. The strongest argument for the
flow/sequence fidelity work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Strip 'real' corpus/GL qualifiers (-> 'corpus' / 'out-of-sample GL'), drop the
client-count + industry hint ('21-client health corpus' -> 'the corpus'), and
remove a verbatim COA label token from FINDINGS + the scaffold SPEC/py docs, per
the corpus-vague-reference rule (no client names, no real-data hints, no paths,
no verbatim corpus content). 'real eval'/'REAL BF scorer' kept (actual-vs-
surrogate, not a data qualifier).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant