Skip to content

Fix corrupted ONNX checkpoints from save_model_onnx caching#368

Merged
jonbinney merged 1 commit into
mainfrom
jdb/rust-selfplay-remove-caching
May 15, 2026
Merged

Fix corrupted ONNX checkpoints from save_model_onnx caching#368
jonbinney merged 1 commit into
mainfrom
jdb/rust-selfplay-remove-caching

Conversation

@jonbinney

Copy link
Copy Markdown
Owner

After this change, using train_v2.py with rust for self-play appears to be doing the right thing (good win_perc graph).

save_model_onnx cached the ONNX protobuf on the first call and only patched initializers on subsequent saves. With do_constant_folding=True, BatchNorm parameters get folded into preceding Conv weights, producing 21 transformed initializers from 55 state_dict tensors. The subsequent-save path then overwrote those folded initializers with the raw state_dict tensors, silently corrupting every checkpoint after model_0.

Rust self-play consumes these ONNX files; the corruption produced garbage NN evaluations and collapsed training (raw_win_perc 18% vs Python's 94%). Always do a full torch.onnx.export instead; the cost is negligible against the seconds-per-training-step budget.

save_model_onnx cached the ONNX protobuf on the first call and only patched
initializers on subsequent saves. With do_constant_folding=True, BatchNorm
parameters get folded into preceding Conv weights, producing 21 transformed
initializers from 55 state_dict tensors. The subsequent-save path then
overwrote those folded initializers with the raw state_dict tensors,
silently corrupting every checkpoint after model_0.

Rust self-play consumes these ONNX files; the corruption produced garbage
NN evaluations and collapsed training (raw_win_perc 18% vs Python's 94%).
Always do a full torch.onnx.export instead; the cost is negligible against
the seconds-per-training-step budget.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@alejandromarcu alejandromarcu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yey, great catch!

@jonbinney jonbinney merged commit 190b539 into main May 15, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants