fix(checkpointing): save/restore SeedableRandomSampler for map-style datasets#4019
Open
Anai-Guo wants to merge 1 commit into
Open
fix(checkpointing): save/restore SeedableRandomSampler for map-style datasets#4019Anai-Guo wants to merge 1 commit into
Anai-Guo wants to merge 1 commit into
Conversation
…datasets DataLoader shuffle sequence replays from epoch 0 after resuming from a checkpoint when using `use_seedable_sampler=True` with map-style datasets. Root cause: the `IterableDatasetShard` guard in `save_state()` and `load_state()` prevents saving/restoring `SeedableRandomSampler.epoch` for non-iterable (map-style) datasets. On resume the epoch counter resets to 0, so the sampler seeds with `initial_seed + 0`, `initial_seed + 1`, ... replaying the exact same shuffle sequence seen before the checkpoint. Fix: remove the `IterableDatasetShard` guard and save/load any `SeedableRandomSampler` regardless of dataset type. On the load side add an `input_sampler_file.exists()` check for backwards-compat with checkpoints written before this change. Fixes huggingface#3996
Contributor
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Author
|
Still relevant — happy to address any review feedback. Please keep open. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Fixes #3996.
When using
use_seedable_sampler=Truewith a map-style (non-iterable) dataset,save_state()/load_state()silently drops theSeedableRandomSampler.epochcounter. On resume the counter resets to 0, so the sampler re-derives the same seed sequence (initial_seed + 0,initial_seed + 1, …) that was already used before the checkpoint — the model trains on the exact same shuffle order it already saw.Root cause
save_state()andload_state()both gate sampler persistence on:SeedableRandomSampleris attached to both iterable and map-style dataloaders (whenuse_seedable_sampler=True), but the guard prevents it from being saved for the map-style case.Fix
Remove the
IterableDatasetShardguard and persist anySeedableRandomSampler, regardless of dataset type. On the load side, add aninput_sampler_file.exists()guard for backwards-compat with older checkpoints that do not contain a sampler file.The fix is identical in shape for both
save_stateandload_state; the load path additionally checksinput_sampler_file.exists()to stay compatible with checkpoints produced before this fix.🤖 Generated with Claude Code