Sophiex/dev ssl/main by sophie-xhonneux · Pull Request #2511 · ecmwf/WeatherGenerator

sophie-xhonneux · 2026-06-16T14:05:32Z

Description

This has several small fixes, including:

small feet at the end of training

- Updating the configs - Multiple validation periods - Predict latent being called during forecasting ## Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

Fix was the dataset length computation in the workers

shmh40 · 2026-06-16T14:20:45Z

+        }
+
+        if channels is not None:
+            # Respect the order given in the stream config so the channel layout is identical


Do you think we need this? Are we unsure at the moment whether this is necessary?

We are unsure if it is necessary

shmh40 · 2026-06-16T14:26:48Z

-            output = self.predict_decoders(model_params, step, tokens, batch, output)
-            # latent predictions (raw and with SSL heads)
-            output = self.predict_latent(model_params, step, tokens, batch, output, intermediates)
+            if "masking" in self.cf.training_config.training_mode:


shmh40 · 2026-06-16T14:28:54Z


        for module in model.encoder.ae_local_engine.ae_local_blocks.modules():
-            if isinstance(module, modules_to_shard):
+            if isinstance(module, modules_to_shard) and _has_trainable_params(module):


This is the fsdp fix from this morning?

shmh40 · 2026-06-16T14:32:34Z

+                continue  # set disabled, e.g. by a train_continue override
+            stage_label = f"val_{name}"
+            extra_cfg = get_active_stage_config(self.validation_cfg, overrides, cfg_keys_to_filter)
+            # extra sets never write sample output files (would collide with primary val output)


So we only look at these in plot_train? And then if we run inference (using test that I suppose inherits from validation) we only ever write out one of the validation periods? Not important, just want to check the mechanics here.

not sure, didn't check to be honest

shmh40 · 2026-06-16T14:34:25Z

        self.validate(0, self.test_cfg, self.batch_size_test_per_gpu)
        logger.info(f"Finished inference run with id: {cf.general.run_id}")

+    def _check_channel_order_consistency(


Can we keep this guard and remove the reordering you did above if we don't think we need it maybe?

shmh40 · 2026-06-16T14:34:53Z

        self.dataset = MultiStreamDataSampler(cf, self.training_cfg, stage=TRAIN)
        self.dataset_val = MultiStreamDataSampler(cf, self.validation_cfg, stage=VAL)

+        if run_id_contd is not None:


Thank you!!!

Sophie Xhonneux and others added 10 commits June 9, 2026 16:43

Implement first prototype to be tested

9af4d0b

Add geoinfo check and stream dir w/ era5 ch order

7e26241

Fix missing start date in oper forecast finetune

1a0b721

New configs

120ebb4

Add multiple validation periods

ab42cbf

Configs

6d6f347

Update configs

a8f4c5c

Update time_window_step for validation losses

8ccb7fd

Fix predict_latent being called in forecasting

ac53521

Fix small "feet" at the end of training in lr

90e180a

Fix was the dataset length computation in the workers

github-project-automation Bot added this to WeatherGen-dev Jun 16, 2026

shmh40 reviewed Jun 16, 2026

View reviewed changes

shmh40 approved these changes Jun 16, 2026

View reviewed changes

github-actions Bot added infra Issues related to infrastructure model Related to model training or definition (not generic infra) labels Jun 16, 2026

Configs

1433194

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sophiex/dev ssl/main#2511

Sophiex/dev ssl/main#2511
sophie-xhonneux wants to merge 11 commits into
develop-sslfrom
sophiex/dev-ssl/main

sophie-xhonneux commented Jun 16, 2026

Uh oh!

shmh40 Jun 16, 2026

Uh oh!

sophie-xhonneux Jun 16, 2026

Uh oh!

shmh40 Jun 16, 2026

Uh oh!

shmh40 Jun 16, 2026

Uh oh!

sophie-xhonneux Jun 16, 2026

Uh oh!

shmh40 Jun 16, 2026

Uh oh!

sophie-xhonneux Jun 16, 2026

Uh oh!

shmh40 Jun 16, 2026

Uh oh!

shmh40 Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sophie-xhonneux commented Jun 16, 2026

Description

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants