Plot losses against elapsed training time via --x-axis flag by florianscheidl · Pull Request #2501 · ecmwf/WeatherGenerator

florianscheidl · 2026-06-12T15:22:32Z

Summary

Adds --x-axis support to plot_train so losses can be plotted against elapsed_training_time_seconds instead of samples.

Changes

CLI: Replace --x_type ('step'/'reltime') with --x-axis accepting column names samples (default) or elapsed_training_time
plot_loss_avg: Added x_axis param (previously hardcoded num_samples); xlabel is auto-derived
Friendly labels: When plotting against elapsed_training_time_seconds, x-axis labels read "elapsed training time [s]" instead of the raw column name
All four plotting functions (plot_lr, plot_loss_avg, plot_loss_per_stream, plot_loss_per_run) now respect the x_axis param
Removed dead x_type parameter from plot_loss_per_stream

Usage

# Plot against samples (default)
python -m weathergen.utils.plot_training -fy runs.yml -o ./plots

# Plot against elapsed training time
python -m weathergen.utils.plot_training -fy runs.yml -o ./plots --x-axis elapsed_training_time

- Track startup_time_seconds: time from run() start to training loop - Track total_training_time_seconds: time in training/validation cycles - Track overall_time_seconds: total wall-clock time from launch to finish - All metrics logged only on root rank to avoid file contention - Metrics written to metrics.json, automatically uploaded to MLflow - Console logs show timing summaries for quick monitoring

- Created .hermes/ directory with skills/, tasks/, docs/ subfolders - Added skills overview (README.md) with task-type skills - Implemented 'planning' and 'metrics' skills - Documented timing metrics task in tasks/2026-04-17-timing-metrics/ - Added agent structure documentation - Updated .gitignore with optional .hermes/ entry

- Added 2-3 month review cycle recommendation - Defined criteria for skill consolidation - Included usage frequency thresholds - Documented when to merge or remove skills

…dl/WeatherGenerator into ekfs/scaling-plots-20260417

- Replace --x_type ('step'/'reltime') with --x-axis column selector ('samples', 'elapsed_training_time') - Add x_axis param to plot_loss_avg (previously hardcoded num_samples) - Add friendly x-axis labels: 'elapsed training time [s]' when plotting against elapsed_training_time_seconds - plot_lr, plot_loss_avg, plot_loss_per_stream, plot_loss_per_run all now respect x_axis; xlabel is auto-derived from column name - Remove dead x_type parameter from plot_loss_per_stream

clessig

Thanks for picking this up. It's important to keep Trainer as clean as possible so I would introduce only what is really necessary.

clessig · 2026-06-12T15:26:10Z


    try:
-        trainer.run(cf, devices)
+        trainer.run(cf, devices, t_start=t_start)


Can't we move this to Trainer?

clessig · 2026-06-12T16:55:25Z

@@ -0,0 +1,36 @@
+# (C) Copyright 2024 WeatherGenerator contributors.


Remove from PR

clessig · 2026-06-12T16:55:31Z

@@ -0,0 +1,77 @@
+


Remove from PR

clessig · 2026-06-12T16:55:40Z

@@ -0,0 +1,289 @@
+# (C) Copyright 2025 WeatherGenerator contributors.


Remove from PR

clessig · 2026-06-12T16:56:51Z

+            logger.info(f"Startup time: {startup_time:.2f} seconds")
+
        # training loop
+        self.t_training_start = time.time()


For plot_train, timing should start here. This would also avoid that run_train is modified.

clessig · 2026-06-12T17:01:25Z

        self.validate_before_training()

+        # Log startup time
+        if is_root() and t_start is not None:


I don't think this is needed.

clessig · 2026-06-12T17:08:19Z

+                "train",
+                {
+                    "completed_mini_epoch": mini_epoch,
+                    "training_time_after_mini_epoch_seconds": total_training_time,


Do we use this explicitly? Otherwise I would remove it. It can always be recovered by the time elapsed up to step k and the samples_per_mini_epoch (which is available through the config).

florianscheidl · 2026-06-12T17:18:52Z

Thanks for picking this up. It's important to keep Trainer as clean as possible so I would introduce only what is really necessary.

Sorry, this should not have been a PR, only a commit. The correct PR is in #2231

clessig · 2026-06-12T17:28:03Z

Thanks for picking this up. It's important to keep Trainer as clean as possible so I would introduce only what is really necessary.

Sorry, this should not have been a PR, only a commit. The correct PR is in #2231

But we should have plot_train with elapsed time on the x-axis and it should be a separate PR.

…260417

…dl/WeatherGenerator into ekfs/scaling-plots-20260417

…e-axis

florianscheidl and others added 30 commits April 17, 2026 14:43

docs: add skills review cycle for periodic compactification

df69dcb

- Added 2-3 month review cycle recommendation - Defined criteria for skill consolidation - Included usage frequency thresholds - Documented when to merge or remove skills

configs

7f0648f

Merge branch 'feature/timing-metrics' into ekfs/scaling-plots-20260417

8fe45a0

Remove hermes tool tracking for now

dd55fb0

Try duration metrics

09b6e82

Update metrics, store after each mini-epoch

da3c29b

Refactor configs/streams

fc9a111

Extract scaling data

cfc4c62

Script to generate scaling plots

82b503a

Script update

70053b1

Repeat data in mini epoch

0c2df97

corrected time window length

2c79d28

Merge branch 'ekfs/scaling-plots-20260417' of github.com:florianschei…

6374986

…dl/WeatherGenerator into ekfs/scaling-plots-20260417

Lower to 512 samples per mini epoch

b5d70f6

Updated extraction script

f46828c

Merge branch 'ekfs/scaling-plots-20260417' of github.com:florianschei…

89ac519

…dl/WeatherGenerator into ekfs/scaling-plots-20260417

Log time more often

7cad6b5

Fix training start scope

30ac102

Minimal validation

5e7f63e

Increase samples_per_mini_epoch to 1024

2be95c6

Final training duration and terminal/metric logging

93b203b

log metrics after mini-epoch

2b708e3

Log metrics after mini-epoch, change schema

0d8407d

MEtric typo

422fc60

Logging refactor

f63cba9

Update extraction script

b596c14

NNode extraction

42ba646

Logs path

c9fa64d

florianscheidl and others added 9 commits May 4, 2026 11:42

Formatting and style fixes

4f11519

Update config

b02b38f

Avoid duplicate metrics

55d8219

Fix lint issues

904713d

t_training in __init__

9ecd544

Renamed metric

9f02dc1

Merge branch 'develop' into ekfs/scaling-plots-20260417

bfd5424

mv performance package

0785e3b

github-project-automation Bot added this to WeatherGen-dev Jun 12, 2026

florianscheidl closed this Jun 12, 2026

github-project-automation Bot moved this to Done in WeatherGen-dev Jun 12, 2026

clessig reviewed Jun 12, 2026

View reviewed changes

Fewer changes

367454d

clessig reopened this Jun 12, 2026

florianscheidl and others added 12 commits June 12, 2026 21:05

rm configs

d0f851c

Remove startup time

2f25ee8

Remove startup time

198b542

Merge branch 'develop' into flo/plot-training-time-axis

6f7ff39

Formatting and removed time per epoch

0de8660

Merge branch 'flo/plot-training-time-axis' into ekfs/scaling-plots-20…

11bcd2b

…260417

Undo pyproject change

39f8075

Merge branch 'develop' into ekfs/scaling-plots-20260417

41b1fbf

ploting changes wip

61ad9cf

Undo pyproject changes

e099f84

Merge branch 'ekfs/scaling-plots-20260417' of github.com:florianschei…

a45c255

…dl/WeatherGenerator into ekfs/scaling-plots-20260417

Merge branch 'ekfs/scaling-plots-20260417' into flo/plot-training-tim…

68c0428

…e-axis

florianscheidl mentioned this pull request Jun 12, 2026

[2186] Scaling utils #2231

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plot losses against elapsed training time via --x-axis flag#2501

Plot losses against elapsed training time via --x-axis flag#2501
florianscheidl wants to merge 88 commits into
ecmwf:developfrom
florianscheidl:flo/plot-training-time-axis

florianscheidl commented Jun 12, 2026

Uh oh!

clessig left a comment

Uh oh!

clessig Jun 12, 2026

Uh oh!

clessig Jun 12, 2026

Uh oh!

clessig Jun 12, 2026

Uh oh!

clessig Jun 12, 2026

Uh oh!

clessig Jun 12, 2026

Uh oh!

clessig Jun 12, 2026

Uh oh!

clessig Jun 12, 2026

Uh oh!

florianscheidl commented Jun 12, 2026

Uh oh!

clessig commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,36 @@
		# (C) Copyright 2024 WeatherGenerator contributors.

		@@ -0,0 +1,289 @@
		# (C) Copyright 2025 WeatherGenerator contributors.

Conversation

florianscheidl commented Jun 12, 2026

Summary

Changes

Usage

Uh oh!

clessig left a comment

Choose a reason for hiding this comment

Uh oh!

clessig Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

clessig Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

clessig Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

clessig Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

clessig Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

clessig Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

clessig Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

florianscheidl commented Jun 12, 2026

Uh oh!

clessig commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants