ML experiments: scaffold 5 neuro-symbolic + inverse-SBI tracks (A100-ready)#203
Open
mivertowski wants to merge 18 commits into
Open
ML experiments: scaffold 5 neuro-symbolic + inverse-SBI tracks (A100-ready)#203mivertowski wants to merge 18 commits into
mivertowski wants to merge 18 commits into
Conversation
Runnable PyTorch scaffolding + per-track specs under experiments/ml/ for
closing the behavioral-fidelity gap by learning the generator's *proposal
distributions* while leaving the symbolic constraint layer (debits=credits,
A=L+E, document chains, IC matching) untouched. The NN proposes structure;
the existing Rust engine enforces every invariant — coherence stays a hard
guarantee.
Tracks:
- gnn/ GraphSAGE+GAE relational sampler → P3 ClusteringGap / TriangleLogRatio
- sequence/ causal transformer over JE event streams → P1 IETD/Autocorr, P2, P4
- flow/ conditional neural-spline flow for amounts → Benford / multimodal tails
- surrogate/ MLP eval-surrogate + CMA-ES knob search → faster calibration loop
(performance only, zero coherence risk — never touches generation)
common/ holds the shared corpus→tensor exporter and the BF-eval bridge
(canonical Rust scorer + a fast Python approximation for in-loop use).
Scaffold only — no model trained. Built to run on an A100 when free; this
dev box OOMs on orchestrator-scale work. Each train.py is runnable; model
bodies carry TODO markers where corpus-schema wiring lands after the first
data export.
Privacy: corpus path via DATASYNTH_CORPUS_DIR (never hard-coded); data/
weights/run-configs gitignored; per-track memorization review (k-anon /
DP-SGD path in the GNN spec) required before any weight leaves the private
box. No corpus content committed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Track 5 — run the generator backward. Given an observed GL, recover a
posterior over the latent process parameters that could have produced it
(audit-analytics direction). Feasible here because DataSynth is a
structured generative model with KNOWN ground truth: it manufactures
labeled (θ → GL) pairs for free, and the hard accounting constraints
regularize the otherwise ill-posed inverse.
Amortized SNPE: a conditional normalizing flow q_φ(θ | x) (reuses the
flow/ NSF) trained on forward-simulated pairs, where x is a GL
summary-stat vector. Many-to-one forward map ⇒ we recover a posterior,
not a point.
Files:
- params.py tier-1 identifiable parameter set + priors (fraud rates,
amount σ, posting-lag μ/σ, concentration)
- simulate.py draw θ ~ prior → datasynth-data generate → summary stats x
- model.py PosteriorFlow (zuko NSF conditioned on x)
- train.py maximize Σ log q_φ(θ|x) on the simulated pairs
- validate.py SBC rank histograms + credible-interval COVERAGE on
held-out synthetic — measure how well 'backward' works
before pointing at any real GL
Scope ladder in SPEC.md: parameters (this) → process attribution
(overlaps ocpm + gnn) → latent fraud/anomaly labels. Inversion quality is
gated by the forward model's BF fidelity (distribution shift). Privacy:
trains on synthetic only; emits parameter posteriors, never row-level
content.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Owner
Author
|
Added track 5 — inverse / simulation-based inference ( |
…calar knobs Fills the simulate.py TODOs: a 29-dim observable-only GL summary-stat vector (amount / Benford / round-dollar / weekend / lines-per-JE / posting-lag / source-mix / IET / GL-concentration) and run_one (dotted-key config override -> datasynth-data generate -> summary_stats), fanned out over a process pool. Drops the invalid distributions.amounts.sigma knob (amounts is a mixture components list, not a scalar) so overrides stay valid under deny_unknown_fields; 5 verified scalar params remain for the tier-1 demo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… missing') Side-by-side behavioral observables (lines-per-JE, log-amount moments, Benford MAD, round-dollar / small-ticket share, p99 amount, weekend share, source mix, per-source inter-event times) for corpus (corpus columns) vs a synthetic journal_entries.csv (canonical columns). Complements the normalized DRs from behavioral score with raw units. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ss conditioning) Reads corpus Functional Amount + GL Account Number, joins account_class via the COA 'c' key, emits y=signed log1p(|amount|) + one-hot(account_class) to amounts.parquet for the conditional flow. Tail clipped at p99.9 (privacy). Source (~4500 corpus levels — itself a finding) is not one-hot encoded. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… tail) The neural-spline flow's default domain (~[-5,5]) couldn't represent corpus signed-log1p amounts (which reach ~10.4), collapsing learned p99 to ~$142 vs the corpus $33k. Standardize y (mean/std saved in the checkpoint) so the tail lands inside the spline; samples are unstandardized at characterization. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…streams) Per-(client, source) ordered streams → streams.pt (dt / lines / account_class / weekday / hour_band fields, 0=pad) + vocab.json, matching EventStreamTransformer. Δt + line-count carry the inter-event/burst signal (the 60x IET-regularity gap the descriptive analysis surfaced). Per-client processing bounds memory over the 50M-row corpus; source ranked to a 0..62 id map + 'other'. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
What's missing (descriptive): source diversity, IET variance ~60x, amount tail ~16x, lines-per-JE ~2.3x. DR eval degenerates at corpus scale (noise floor ~0). Flow learns amount density (v1 tail-collapse bug found+fixed via y-standardize). Sequence transformer trains on corpus event streams; corpus dt-bucket lag-1 autocorr -0.118 (variance, not autocorr, is the gap). v2 flow number pending. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v2 (standardized y): NLL 8.96->0.67; p99 $31,754 vs corpus $33,688 (~6%), std/skew spot-on. The shipped 3-component mixture overshoots p99 ~16x. A learned per-account-class flow recovers the amount distribution the mixture misses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…b params - sequence/characterize.py: held-out per-token NLL of the transformer vs an iid per-field marginal baseline -> information gain from modelling history. - inverse/make_base.py: small fast campaign config (fraud + distributions only). - inverse/params.py: pivot to (fraud_rate, amount_mu, amount_sigma) — minimal config friction; amount via structured component override; ties inverse + surrogate to the flow finding (recover corpus amount mean/std). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
….37 nats lift Sequence track: AR transformer beats iid marginal by +3.37 nats/token on held-out (account_class/weekday/lines structure captured; Dt near-memoryless). Fix the amount distribution_type enum value for the inverse campaign override. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…paign) Reuses the inverse forward campaign's (theta, summary-stat) pairs: objective = distance(summary_stats(theta), corpus), MLP surrogate, CMA-ES to the corpus-matching theta*. Runnable + grounded (vs the scaffold optimize.py whose load_history is a TODO + targets the corpus-scale-degenerate DR). theta* should recover amount_mu ~ corpus log-amount mean, cross-checking the flow finding. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gma not Amortized posterior over 3 knobs, 1000-sim campaign (0 fail), SBC + 90% coverage on held-out synthetic: amount_mu cov 0.92 (MAE .049), fraud_rate cov 0.88 (.078) — calibrated; amount_sigma cov 0.77 — poorly identified (other variance swamps the component sigma). 'Run the engine backward' validated on synthetic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First run failed (Spearman -0.08, theta* at bounds): corpus lpje_std=123 (JEs with thousands of lines) dominated the L2 distance. Drop lpje_std + iet_* from the comparable set, clip standardized features to +/-4, and add --corpus-cache to reuse corpus_features (skip the 53M-row pass on rerun). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Surrogate machinery runs end-to-end on real campaign data; Spearman 0.46, theta* mis-located (amount_mu at bound) — single-small-generate stats too noisy. The calibrated inverse posterior is the principled route to corpus-param recovery. Completes flow / sequence / inverse / surrogate in FINDINGS.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (corpus capstone) Feeds a GL's summary stats into the SBC-calibrated q(theta|x) and reports a median + 90% CI per knob. Emits only parameter posteriors (privacy contract). The inverse-SBI capstone: point the calibrated posterior at the corpus. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…us (OOD) Feeding the corpus into the SBC-calibrated q(theta|x) returns a boundary-pinned, zero-width-CI posterior (confidently wrong) — the corpus is out-of-distribution for the synthetic-trained inverse. 'Distribution shift = the BF gap' made empirical: well-calibrated on synthetic (cov 0.92), untrustworthy on real until the forward-fidelity gap (section 1) is closed. The strongest argument for the flow/sequence fidelity work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Strip 'real' corpus/GL qualifiers (-> 'corpus' / 'out-of-sample GL'), drop the
client-count + industry hint ('21-client health corpus' -> 'the corpus'), and
remove a verbatim COA label token from FINDINGS + the scaffold SPEC/py docs, per
the corpus-vague-reference rule (no client names, no real-data hints, no paths,
no verbatim corpus content). 'real eval'/'REAL BF scorer' kept (actual-vs-
surrogate, not a data qualifier).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Runnable PyTorch scaffolding + per-track specs under
experiments/ml/to close the behavioral-fidelity gap by learning the generator's proposal distributions while leaving the symbolic constraint layer (debits=credits, A=L+E, document chains, IC matching) untouched.Principle: the NN emits structure/shape; the existing Rust engine projects it onto the feasible manifold. Coherence stays a hard guarantee — no model ever emits a final balance.
This is scaffold only — no model trained yet. Built to run on an A100 when it frees up; the dev box OOMs on orchestrator-scale work, so verification is by syntax-check + design review here, then training on the GPU box.
Tracks
gnn/sequence/flow/surrogate/common/— shared corpus→tensor exporter + BF-eval bridge (canonical Rust scorer + fast Python approximation for in-loop scoring).Privacy / legal
DATASYNTH_CORPUS_DIRenv — never hard-coded, never logged.data/,weights/,runs/, path-bearing configs all gitignored — only code + specs tracked (verified: 0 corpus artifacts staged).gnn/SPEC.md— the GNN has the highest memorization risk).Test plan
py_compile(syntax)..gitignoreconfirmed to exclude*.pt/*.parquet/data//weights/.data_export→train→bf_bridge.score_canonicallift vs the v5.26 baseline (success criteria in eachSPEC.md).Independent of SP6 (#202) — disjoint paths, branched off
main.🤖 Generated with Claude Code