Plot losses against elapsed training time via --x-axis flag#2501
Plot losses against elapsed training time via --x-axis flag#2501florianscheidl wants to merge 88 commits into
Conversation
- Track startup_time_seconds: time from run() start to training loop - Track total_training_time_seconds: time in training/validation cycles - Track overall_time_seconds: total wall-clock time from launch to finish - All metrics logged only on root rank to avoid file contention - Metrics written to metrics.json, automatically uploaded to MLflow - Console logs show timing summaries for quick monitoring
- Created .hermes/ directory with skills/, tasks/, docs/ subfolders - Added skills overview (README.md) with task-type skills - Implemented 'planning' and 'metrics' skills - Documented timing metrics task in tasks/2026-04-17-timing-metrics/ - Added agent structure documentation - Updated .gitignore with optional .hermes/ entry
- Added 2-3 month review cycle recommendation - Defined criteria for skill consolidation - Included usage frequency thresholds - Documented when to merge or remove skills
…dl/WeatherGenerator into ekfs/scaling-plots-20260417
…dl/WeatherGenerator into ekfs/scaling-plots-20260417
- Replace --x_type ('step'/'reltime') with --x-axis column selector
('samples', 'elapsed_training_time')
- Add x_axis param to plot_loss_avg (previously hardcoded num_samples)
- Add friendly x-axis labels: 'elapsed training time [s]' when
plotting against elapsed_training_time_seconds
- plot_lr, plot_loss_avg, plot_loss_per_stream, plot_loss_per_run all
now respect x_axis; xlabel is auto-derived from column name
- Remove dead x_type parameter from plot_loss_per_stream
clessig
left a comment
There was a problem hiding this comment.
Thanks for picking this up. It's important to keep Trainer as clean as possible so I would introduce only what is really necessary.
|
|
||
| try: | ||
| trainer.run(cf, devices) | ||
| trainer.run(cf, devices, t_start=t_start) |
There was a problem hiding this comment.
Can't we move this to Trainer?
| @@ -0,0 +1,36 @@ | |||
| # (C) Copyright 2024 WeatherGenerator contributors. | |||
| @@ -0,0 +1,77 @@ | |||
|
|
|||
| @@ -0,0 +1,289 @@ | |||
| # (C) Copyright 2025 WeatherGenerator contributors. | |||
| logger.info(f"Startup time: {startup_time:.2f} seconds") | ||
|
|
||
| # training loop | ||
| self.t_training_start = time.time() |
There was a problem hiding this comment.
For plot_train, timing should start here. This would also avoid that run_train is modified.
| self.validate_before_training() | ||
|
|
||
| # Log startup time | ||
| if is_root() and t_start is not None: |
There was a problem hiding this comment.
I don't think this is needed.
| "train", | ||
| { | ||
| "completed_mini_epoch": mini_epoch, | ||
| "training_time_after_mini_epoch_seconds": total_training_time, |
There was a problem hiding this comment.
Do we use this explicitly? Otherwise I would remove it. It can always be recovered by the time elapsed up to step k and the samples_per_mini_epoch (which is available through the config).
Sorry, this should not have been a PR, only a commit. The correct PR is in #2231 |
But we should have plot_train with elapsed time on the x-axis and it should be a separate PR. |
…dl/WeatherGenerator into ekfs/scaling-plots-20260417
Summary
Adds
--x-axissupport toplot_trainso losses can be plotted againstelapsed_training_time_secondsinstead of samples.Changes
--x_type('step'/'reltime') with--x-axisaccepting column namessamples(default) orelapsed_training_timex_axisparam (previously hardcodednum_samples); xlabel is auto-derivedelapsed_training_time_seconds, x-axis labels read "elapsed training time [s]" instead of the raw column nameplot_lr,plot_loss_avg,plot_loss_per_stream,plot_loss_per_run) now respect thex_axisparamx_typeparameter fromplot_loss_per_streamUsage