[2186] Scaling utils by florianscheidl · Pull Request #2231 · ecmwf/WeatherGenerator

florianscheidl · 2026-04-17T13:55:37Z

Description

Contains utility we used for scaling experiments, namely to capture elapsed time metrics in the trainer and run_train.
Small code modifications became necessary as we trained on more than 16 nodes. Specifically:

Lower bound on warmup, cooldown, and decay steps in lr_scheduler
Lower bound on beta2 used in Adam optimizer (negative values are illegal). Set to 0.9 now, based on Internet suggestion, but open to suggestions.
Lower bound on len_per_rank when setting mini_epoch_base (avoiding zero-division error for small configs).

The data extraction and plot generation script are addressed in https://gitlab.jsc.fz-juelich.de/esde/WeatherGenerator-private/-/merge_requests/180.

Example plots and tables:

Issue Number

Fixes #2186

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

- Track startup_time_seconds: time from run() start to training loop - Track total_training_time_seconds: time in training/validation cycles - Track overall_time_seconds: total wall-clock time from launch to finish - All metrics logged only on root rank to avoid file contention - Metrics written to metrics.json, automatically uploaded to MLflow - Console logs show timing summaries for quick monitoring

- Created .hermes/ directory with skills/, tasks/, docs/ subfolders - Added skills overview (README.md) with task-type skills - Implemented 'planning' and 'metrics' skills - Documented timing metrics task in tasks/2026-04-17-timing-metrics/ - Added agent structure documentation - Updated .gitignore with optional .hermes/ entry

- Added 2-3 month review cycle recommendation - Defined criteria for skill consolidation - Included usage frequency thresholds - Documented when to merge or remove skills

…dl/WeatherGenerator into ekfs/scaling-plots-20260417

florianscheidl · 2026-06-08T08:21:24Z

@clessig, anything blocking?

clessig

See detailed comments.

clessig · 2026-06-08T08:40:11Z

        stddev_all: dict,
        avg_loss: list[float] = None,
        lr: float = None,
+        elapsed_training_time_seconds: float | None = None,


Can you open a PR that we change plot_train so that we can plot losses against training time.

- Replace --x_type ('step'/'reltime') with --x-axis column selector ('samples', 'elapsed_training_time') - Add x_axis param to plot_loss_avg (previously hardcoded num_samples) - Add friendly x-axis labels: 'elapsed training time [s]' when plotting against elapsed_training_time_seconds - plot_lr, plot_loss_avg, plot_loss_per_stream, plot_loss_per_run all now respect x_axis; xlabel is auto-derived from column name - Remove dead x_type parameter from plot_loss_per_stream

…260417

…dl/WeatherGenerator into ekfs/scaling-plots-20260417

florianscheidl · 2026-06-12T19:24:27Z

I've implemented the changes discussed here and in #2501. The latter will only have the plotting changes.

florianscheidl added 4 commits April 17, 2026 14:43

docs: add skills review cycle for periodic compactification

df69dcb

- Added 2-3 month review cycle recommendation - Defined criteria for skill consolidation - Included usage frequency thresholds - Documented when to merge or remove skills

configs

7f0648f

github-project-automation Bot added this to WeatherGen-dev Apr 17, 2026

florianscheidl and others added 25 commits April 17, 2026 15:58

Merge branch 'feature/timing-metrics' into ekfs/scaling-plots-20260417

8fe45a0

Remove hermes tool tracking for now

dd55fb0

Try duration metrics

09b6e82

Update metrics, store after each mini-epoch

da3c29b

Refactor configs/streams

fc9a111

Extract scaling data

cfc4c62

Script to generate scaling plots

82b503a

Script update

70053b1

Repeat data in mini epoch

0c2df97

corrected time window length

2c79d28

Merge branch 'ekfs/scaling-plots-20260417' of github.com:florianschei…

6374986

…dl/WeatherGenerator into ekfs/scaling-plots-20260417

Lower to 512 samples per mini epoch

b5d70f6

Updated extraction script

f46828c

Merge branch 'ekfs/scaling-plots-20260417' of github.com:florianschei…

89ac519

…dl/WeatherGenerator into ekfs/scaling-plots-20260417

Log time more often

7cad6b5

Fix training start scope

30ac102

Minimal validation

5e7f63e

Increase samples_per_mini_epoch to 1024

2be95c6

Final training duration and terminal/metric logging

93b203b

log metrics after mini-epoch

2b708e3

Log metrics after mini-epoch, change schema

0d8407d

MEtric typo

422fc60

Logging refactor

f63cba9

Update extraction script

b596c14

NNode extraction

42ba646

florianscheidl added 4 commits May 4, 2026 11:41

Refactor logging and move time for mini epoch logging outside loop

dad5462

Formatting and style fixes

4f11519

Update config

b02b38f

Avoid duplicate metrics

55d8219

florianscheidl marked this pull request as ready for review May 4, 2026 09:49

florianscheidl added 2 commits May 4, 2026 11:57

Fix lint issues

904713d

t_training in __init__

9ecd544

github-actions Bot added the performance Work related to performance improvements label May 4, 2026

ekouts suggested changes May 8, 2026

View reviewed changes

Comment thread src/weathergen/train/trainer.py Outdated

github-project-automation Bot moved this to In Progress in WeatherGen-dev May 8, 2026

Renamed metric

9f02dc1

ekouts approved these changes May 8, 2026

View reviewed changes

Merge branch 'develop' into ekfs/scaling-plots-20260417

bfd5424

clessig reviewed Jun 8, 2026

View reviewed changes

florianscheidl added 2 commits June 12, 2026 17:16

mv performance package

0785e3b

florianscheidl mentioned this pull request Jun 12, 2026

Plot losses against elapsed training time via --x-axis flag #2501

Open

florianscheidl and others added 11 commits June 12, 2026 19:26

Fewer changes

367454d

rm configs

d0f851c

Remove startup time

2f25ee8

Remove startup time

198b542

Merge branch 'develop' into flo/plot-training-time-axis

6f7ff39

Formatting and removed time per epoch

0de8660

Merge branch 'flo/plot-training-time-axis' into ekfs/scaling-plots-20…

11bcd2b

…260417

Undo pyproject change

39f8075

Merge branch 'develop' into ekfs/scaling-plots-20260417

41b1fbf

Undo pyproject changes

e099f84

Merge branch 'ekfs/scaling-plots-20260417' of github.com:florianschei…

a45c255

…dl/WeatherGenerator into ekfs/scaling-plots-20260417

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2186] Scaling utils#2231

[2186] Scaling utils#2231
florianscheidl wants to merge 86 commits into
ecmwf:developfrom
florianscheidl:ekfs/scaling-plots-20260417

florianscheidl commented Apr 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

florianscheidl commented Jun 8, 2026

Uh oh!

clessig left a comment

Uh oh!

Uh oh!

clessig Jun 8, 2026

Uh oh!

Uh oh!

florianscheidl commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

florianscheidl commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue Number

Checklist before asking for review

Uh oh!

Uh oh!

florianscheidl commented Jun 8, 2026

Uh oh!

clessig left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

clessig Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

florianscheidl commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

florianscheidl commented Apr 17, 2026 •

edited

Loading