Skip to content

Plot losses against elapsed training time via --x-axis flag#2501

Open
florianscheidl wants to merge 88 commits into
ecmwf:developfrom
florianscheidl:flo/plot-training-time-axis
Open

Plot losses against elapsed training time via --x-axis flag#2501
florianscheidl wants to merge 88 commits into
ecmwf:developfrom
florianscheidl:flo/plot-training-time-axis

Conversation

@florianscheidl

Copy link
Copy Markdown
Contributor

Summary

Adds --x-axis support to plot_train so losses can be plotted against elapsed_training_time_seconds instead of samples.

Changes

  • CLI: Replace --x_type ('step'/'reltime') with --x-axis accepting column names samples (default) or elapsed_training_time
  • plot_loss_avg: Added x_axis param (previously hardcoded num_samples); xlabel is auto-derived
  • Friendly labels: When plotting against elapsed_training_time_seconds, x-axis labels read "elapsed training time [s]" instead of the raw column name
  • All four plotting functions (plot_lr, plot_loss_avg, plot_loss_per_stream, plot_loss_per_run) now respect the x_axis param
  • Removed dead x_type parameter from plot_loss_per_stream

Usage

# Plot against samples (default)
python -m weathergen.utils.plot_training -fy runs.yml -o ./plots

# Plot against elapsed training time
python -m weathergen.utils.plot_training -fy runs.yml -o ./plots --x-axis elapsed_training_time

florianscheidl and others added 30 commits April 17, 2026 14:43
- Track startup_time_seconds: time from run() start to training loop
- Track total_training_time_seconds: time in training/validation cycles
- Track overall_time_seconds: total wall-clock time from launch to finish
- All metrics logged only on root rank to avoid file contention
- Metrics written to metrics.json, automatically uploaded to MLflow
- Console logs show timing summaries for quick monitoring
- Created .hermes/ directory with skills/, tasks/, docs/ subfolders
- Added skills overview (README.md) with task-type skills
- Implemented 'planning' and 'metrics' skills
- Documented timing metrics task in tasks/2026-04-17-timing-metrics/
- Added agent structure documentation
- Updated .gitignore with optional .hermes/ entry
- Added 2-3 month review cycle recommendation
- Defined criteria for skill consolidation
- Included usage frequency thresholds
- Documented when to merge or remove skills
…dl/WeatherGenerator into ekfs/scaling-plots-20260417
…dl/WeatherGenerator into ekfs/scaling-plots-20260417
florianscheidl and others added 9 commits May 4, 2026 11:42
- Replace --x_type ('step'/'reltime') with --x-axis column selector
  ('samples', 'elapsed_training_time')
- Add x_axis param to plot_loss_avg (previously hardcoded num_samples)
- Add friendly x-axis labels: 'elapsed training time [s]' when
  plotting against elapsed_training_time_seconds
- plot_lr, plot_loss_avg, plot_loss_per_stream, plot_loss_per_run all
  now respect x_axis; xlabel is auto-derived from column name
- Remove dead x_type parameter from plot_loss_per_stream

@clessig clessig left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up. It's important to keep Trainer as clean as possible so I would introduce only what is really necessary.

Comment thread src/weathergen/run_train.py Outdated

try:
trainer.run(cf, devices)
trainer.run(cf, devices, t_start=t_start)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we move this to Trainer?

Comment thread config/streams/era5_georing/era5.yml Outdated
@@ -0,0 +1,36 @@
# (C) Copyright 2024 WeatherGenerator contributors.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove from PR

Comment thread config/streams/era5_georing/geos.yml Outdated
@@ -0,0 +1,77 @@

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove from PR

Comment thread config/config_era5_georing.yml Outdated
@@ -0,0 +1,289 @@
# (C) Copyright 2025 WeatherGenerator contributors.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove from PR

logger.info(f"Startup time: {startup_time:.2f} seconds")

# training loop
self.t_training_start = time.time()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For plot_train, timing should start here. This would also avoid that run_train is modified.

Comment thread src/weathergen/train/trainer.py Outdated
self.validate_before_training()

# Log startup time
if is_root() and t_start is not None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed.

Comment thread src/weathergen/train/trainer.py Outdated
"train",
{
"completed_mini_epoch": mini_epoch,
"training_time_after_mini_epoch_seconds": total_training_time,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use this explicitly? Otherwise I would remove it. It can always be recovered by the time elapsed up to step k and the samples_per_mini_epoch (which is available through the config).

@florianscheidl

Copy link
Copy Markdown
Contributor Author

Thanks for picking this up. It's important to keep Trainer as clean as possible so I would introduce only what is really necessary.

Sorry, this should not have been a PR, only a commit. The correct PR is in #2231

@clessig

clessig commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Thanks for picking this up. It's important to keep Trainer as clean as possible so I would introduce only what is really necessary.

Sorry, this should not have been a PR, only a commit. The correct PR is in #2231

But we should have plot_train with elapsed time on the x-axis and it should be a separate PR.

@clessig clessig reopened this Jun 12, 2026
@florianscheidl florianscheidl mentioned this pull request Jun 12, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants