Skip to content

fix: seed timer at training start to avoid AttributeError#941

Merged
zhixiangli merged 2 commits into
fsspec:mainfrom
zhixiangli:fix-start-time-error
Jun 30, 2026
Merged

fix: seed timer at training start to avoid AttributeError#941
zhixiangli merged 2 commits into
fsspec:mainfrom
zhixiangli:fix-start-time-error

Conversation

@zhixiangli

@zhixiangli zhixiangli commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Problem

When resuming training from a checkpoint mid-epoch, the workload fails with the following AttributeError:

[rank6]: AttributeError: 'StepTimeCallback' object has no attribute 'start_time'
[rank6]: ^^^^^^^^^^^^^^^
[rank6]: step_time = time.perf_counter() - self.start_time - self.ckpt_time
[rank6]: File "/workload/configs/llama_3_1_8b_cpu_sim.py", line 302, in on_train_batch_end

Root Cause

The StepTimeCallback was initializing self.start_time in the on_train_epoch_start hook. However, when PyTorch Lightning resumes training mid-epoch (e.g., from a checkpoint saved at step=1), the on_train_epoch_start hook is skipped for
that epoch because the epoch is already considered in progress. As a result, self.start_time is never initialized, causing a crash on the first call to on_train_batch_end.

Solution

Initialize self.start_time and self.ckpt_time in the on_train_start hook. This hook is guaranteed to run at the beginning of trainer.fit(), regardless of whether it is a fresh start or a resume.

The on_train_epoch_start hook is retained to reset the timer at the start of subsequent epochs.

TAG=agy
CONV=3319ee1f-f74b-46be-8ac8-2282151c2ff3

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds the on_train_start hook to initialize start_time and ckpt_time in the PyTorch Lightning simulation script. The reviewer suggested adding an explanatory comment to this method to clarify that it prevents potential AttributeError issues during mid-epoch resumption, ensuring future developers understand its necessity.

…g-cpu/helm_chart/llama_3_1_8b_cpu_sim.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@codecov

codecov Bot commented Jun 29, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.68%. Comparing base (7e658e0) to head (5dd0b53).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #941   +/-   ##
=======================================
  Coverage   89.68%   89.68%           
=======================================
  Files          16       16           
  Lines        3579     3579           
=======================================
  Hits         3210     3210           
  Misses        369      369           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zhixiangli zhixiangli merged commit 3fb3782 into fsspec:main Jun 30, 2026
10 checks passed
@zhixiangli zhixiangli deleted the fix-start-time-error branch June 30, 2026 01:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants