fix: seed timer at training start to avoid AttributeError#941
Merged
Conversation
TAG=agy CONV=3319ee1f-f74b-46be-8ac8-2282151c2ff3
Contributor
There was a problem hiding this comment.
Code Review
This pull request adds the on_train_start hook to initialize start_time and ckpt_time in the PyTorch Lightning simulation script. The reviewer suggested adding an explanatory comment to this method to clarify that it prevents potential AttributeError issues during mid-epoch resumption, ensuring future developers understand its necessity.
…g-cpu/helm_chart/llama_3_1_8b_cpu_sim.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #941 +/- ##
=======================================
Coverage 89.68% 89.68%
=======================================
Files 16 16
Lines 3579 3579
=======================================
Hits 3210 3210
Misses 369 369 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Yonghui-Lee
approved these changes
Jun 30, 2026
yuxin00j
approved these changes
Jun 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When resuming training from a checkpoint mid-epoch, the workload fails with the following
AttributeError:[rank6]: AttributeError: 'StepTimeCallback' object has no attribute 'start_time'
[rank6]: ^^^^^^^^^^^^^^^
[rank6]: step_time = time.perf_counter() - self.start_time - self.ckpt_time
[rank6]: File "/workload/configs/llama_3_1_8b_cpu_sim.py", line 302, in on_train_batch_end
Root Cause
The
StepTimeCallbackwas initializingself.start_timein theon_train_epoch_starthook. However, when PyTorch Lightning resumes training mid-epoch (e.g., from a checkpoint saved atstep=1), theon_train_epoch_starthook is skipped forthat epoch because the epoch is already considered in progress. As a result,
self.start_timeis never initialized, causing a crash on the first call toon_train_batch_end.Solution
Initialize
self.start_timeandself.ckpt_timein theon_train_starthook. This hook is guaranteed to run at the beginning oftrainer.fit(), regardless of whether it is a fresh start or a resume.The
on_train_epoch_starthook is retained to reset the timer at the start of subsequent epochs.