Skip to content

[2502][evaluation] score bug fixes#2503

Open
moritzhauschulz wants to merge 2 commits into
ecmwf:developfrom
moritzhauschulz:mh/develop-score-bug-fixes
Open

[2502][evaluation] score bug fixes#2503
moritzhauschulz wants to merge 2 commits into
ecmwf:developfrom
moritzhauschulz:mh/develop-score-bug-fixes

Conversation

@moritzhauschulz

Copy link
Copy Markdown
Contributor

Description

Enables the previously dead probabilistic metrics (spread, ssr, crps, rank_histogram) in the evaluation package, corrects the spread-skill ratio to the standard ensemble-mean definition, and makes the spatial score-map path robust for metrics that collapse the ensemble dimension. This unblocks GenCast-style spread-skill diagnostics over lead time and ensemble-spread maps.

I am not very familiar with the eval pipeline, so it would be good if this could be reviewed by someone more knowledgeable

scores/score.py

  • Enable probabilistic-metric dispatch. Replace the dead assert self.ens_dim … / return None
    (undefined self.ens_dim, unconditional return) with a real check: warn and skip when the
    ensemble dim self._ens_dim is absent from the predictions (e.g. deterministic runs), otherwise
    dispatch to the metric function. This activates spread, ssr, crps, and rank_histogram.
  • Correct the spread-skill ratio. calc_ssr now divides the ensemble spread by the RMSE of
    the ensemble mean
    (the "skill", GenCast / WeatherBench2 convention) —
    calc_spread(p) / calc_rmse(p.mean("ens"), gt) — instead of the full-ensemble per-member RMSE.
    SSR is now a single value per variable/level/lead-time with the standard calibration
    interpretation (under-/over-dispersion), consistent with the already ensemble-reduced spread
    numerator.

plotting/plot_orchestration.py (_plot_score_maps_per_stream)

  • Fix CoordinateValidationError crash. Guard the ensemble-label assignment on
    "ens" in plot_metrics.dims (the concatenated result) rather than preds.dims. When every
    selected metric reduces the ensemble dim (e.g. metrics: ["ssr"]), plot_metrics has no ens
    dim and the previous unconditional plot_metrics["ens"] = preds.ens raised.
  • Avoid redundant per-member maps. Track which metrics retain a per-member ens dim in their
    own result (ens_metrics). Iterate ensemble members only for those; ensemble-reduced metrics
    (spread, crps, ssr) get a single map instead of one identical map per member.
  • Collapse broadcast ensemble axis. When xr.concat broadcasts a reduced metric across ens,
    select a single member (isel(ens=0, drop=True)) so the plotted field is 2-D.

Issue Number

Closes #2502

Checklist before asking for review

  • I have performed a self-review of my code
  • My changes comply with basic sanity checks:
    • I have fixed formatting issues with ./scripts/actions.sh lint
    • I have run unit tests with ./scripts/actions.sh unit-test
    • I have documented my code and I have updated the docstrings.
    • I have added unit tests, if relevant
  • [] I have tried my changes with data and code:
    • I have run the integration tests with ./scripts/actions.sh integration-test
    • (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
    • (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
  • I have informed and aligned with people impacted by my change:
    • for config changes: the MatterMost channels and/or a design doc
    • for changes of dependencies: the MatterMost software development channel

Use the robust ens-detection variant in _plot_score_maps_per_stream so the
block is identical (apart from the branch-specific tag string) to
mh/full-pipeline-diffusion-adjusted-scores, minimising future merge conflicts:

- restore ens labels via assign_coords(ens=preds.ens.values) (positional)
  instead of plot_metrics["ens"] = preds.ens (index-aligned)
- gate has-ens detection on the ens *dimension* (all_ens) rather than the
  coordinate
- compute per-metric ens iteration via metric_has_ens / ens_values

Behaviour is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions github-actions Bot added bug Something isn't working eval anything related to the model evaluation pipeline labels Jun 14, 2026
@clessig

clessig commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

@jpolz : could you have a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working eval anything related to the model evaluation pipeline

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Ensemble wide scores currently disabled

2 participants