Fix corrupted ONNX checkpoints from save_model_onnx caching by jonbinney · Pull Request #368 · jonbinney/deep_rabbit_hole

jonbinney · 2026-05-14T21:07:36Z

After this change, using train_v2.py with rust for self-play appears to be doing the right thing (good win_perc graph).

save_model_onnx cached the ONNX protobuf on the first call and only patched initializers on subsequent saves. With do_constant_folding=True, BatchNorm parameters get folded into preceding Conv weights, producing 21 transformed initializers from 55 state_dict tensors. The subsequent-save path then overwrote those folded initializers with the raw state_dict tensors, silently corrupting every checkpoint after model_0.

Rust self-play consumes these ONNX files; the corruption produced garbage NN evaluations and collapsed training (raw_win_perc 18% vs Python's 94%). Always do a full torch.onnx.export instead; the cost is negligible against the seconds-per-training-step budget.

save_model_onnx cached the ONNX protobuf on the first call and only patched initializers on subsequent saves. With do_constant_folding=True, BatchNorm parameters get folded into preceding Conv weights, producing 21 transformed initializers from 55 state_dict tensors. The subsequent-save path then overwrote those folded initializers with the raw state_dict tensors, silently corrupting every checkpoint after model_0. Rust self-play consumes these ONNX files; the corruption produced garbage NN evaluations and collapsed training (raw_win_perc 18% vs Python's 94%). Always do a full torch.onnx.export instead; the cost is negligible against the seconds-per-training-step budget. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

alejandromarcu

yey, great catch!

jonbinney assigned alejandromarcu May 14, 2026

alejandromarcu approved these changes May 15, 2026

View reviewed changes

jonbinney merged commit 190b539 into main May 15, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix corrupted ONNX checkpoints from save_model_onnx caching#368

Fix corrupted ONNX checkpoints from save_model_onnx caching#368
jonbinney merged 1 commit into
mainfrom
jdb/rust-selfplay-remove-caching

jonbinney commented May 14, 2026

Uh oh!

alejandromarcu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jonbinney commented May 14, 2026

Uh oh!

alejandromarcu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants