Skip to content

Enhance CPU simulator logging and throughput metrics#937

Merged
zhixiangli merged 3 commits into
fsspec:mainfrom
yuxin00j:apply-sim-logging
Jun 26, 2026
Merged

Enhance CPU simulator logging and throughput metrics#937
zhixiangli merged 3 commits into
fsspec:mainfrom
yuxin00j:apply-sim-logging

Conversation

@yuxin00j

@yuxin00j yuxin00j commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Description

This PR adds logging enhancements and throughput calculation fixes to the macrobenchmark CPU simulator in gcsfs.

Key Changes

  • Checkpoint Overhead Exclusion: Checkpointing duration (both saving and removing) is now tracked and deducted from step_time to prevent skewed throughput numbers during checkpoint steps.
  • Accurate Data Loading Metrics: The step timer is initialized at on_train_epoch_start to ensure the first batch captures the initial data loading delay.
  • Detailed Throughput Logging: Emits both local_throughput and global_throughput per optimizer step.
  • Targeted Profiling Hooks: Adds profiler hooks to isolate and measure FitLoop.setup_data and _PrefetchDataFetcher.__iter__ (worker spawn times).
  • Dataset Load Timing: Injects a timer around datasets.load_dataset to measure HF dataset preparation time.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces profiling hooks and improves throughput tracking in the Llama 3.1 CPU simulation script by excluding checkpointing overhead from step time calculations and adding local/global throughput metrics. The review feedback highlights critical issues with this implementation: a potential division-by-zero error if the calculated step time is zero or negative, and inaccurate step time tracking on non-zero ranks in DDP environments because checkpoint saving and deletion are primarily executed on rank 0. To address these, the reviewer suggests guarding against non-positive step times and broadcasting checkpoint durations from rank 0 to all other ranks.

@yuxin00j yuxin00j changed the title Update cpu sim with logging enhancements and throughput fixes Enhance CPU simulator logging and throughput metrics Jun 26, 2026
@codecov

codecov Bot commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.77%. Comparing base (381c33e) to head (817e3d2).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #937   +/-   ##
=======================================
  Coverage   89.77%   89.77%           
=======================================
  Files          16       16           
  Lines        3569     3569           
=======================================
  Hits         3204     3204           
  Misses        365      365           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yuxin00j yuxin00j marked this pull request as ready for review June 26, 2026 03:34
@yuxin00j yuxin00j requested a review from zhixiangli June 26, 2026 03:34
@zhixiangli zhixiangli merged commit 3bd3383 into fsspec:main Jun 26, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants