Skip to content

[2507][Evaluation] Implement Adjusted SSR#2508

Open
moritzhauschulz wants to merge 3 commits into
ecmwf:developfrom
moritzhauschulz:mh/develop-score-bug-fixes-adjusted-ssr
Open

[2507][Evaluation] Implement Adjusted SSR#2508
moritzhauschulz wants to merge 3 commits into
ecmwf:developfrom
moritzhauschulz:mh/develop-score-bug-fixes-adjusted-ssr

Conversation

@moritzhauschulz

Copy link
Copy Markdown
Contributor

Description

Adds two probabilistic metrics following GenCast (Price et al., 2024, App. A),
leaving the existing spread/ssr untouched.

Changes

  • spread_adj — unbiased ensemble spread sqrt(mean Var_ens(ddof=1)) (GenCast Eq. A.6).
  • ssr_adj — adjusted spread-skill ratio sqrt((M+1)/M) · spread_adj / RMSE(ens_mean)
    (GenCast Eq. A.9), where M is the ensemble size. The sqrt((M+1)/M) factor removes the
    finite-ensemble bias so a perfectly calibrated ensemble gives SSR = 1.
  • Plotting: ssr_adj line plots draw a horizontal reference line at the optimal value of 1.

Notes

  • Legacy spread (biased ddof=0) and ssr are unchanged; GenCast has no uncorrected SSR,
    so these remain only as non-standard diagnostics.
  • Use via evaluation.metrics in the eval config (e.g. add ssr_adj, spread_adj).

Should be reviewed by someone from the evaluation team

On Santis, run with uv run evaluate --config ./config/evaluate/eval_config_test.yml -run-ids qvim6zb3 using the attached eval config.

Issue Number

Closes #2507

Note this depends on #2503, which should be merged first.

Checklist before asking for review

  • I have performed a self-review of my code
  • My changes comply with basic sanity checks:
    • I have fixed formatting issues with ./scripts/actions.sh lint
    • I have run unit tests with ./scripts/actions.sh unit-test
    • I have documented my code and I have updated the docstrings.
    • I have added unit tests, if relevant
  • I have tried my changes with data and code:
    • I have run the integration tests with ./scripts/actions.sh integration-test
    • (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
    • (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
  • I have informed and aligned with people impacted by my change:
    • for config changes: the MatterMost channels and/or a design doc
    • for changes of dependencies: the MatterMost software development channel

moritzhauschulz and others added 2 commits June 14, 2026 11:20
Use the robust ens-detection variant in _plot_score_maps_per_stream so the
block is identical (apart from the branch-specific tag string) to
mh/full-pipeline-diffusion-adjusted-scores, minimising future merge conflicts:

- restore ens labels via assign_coords(ens=preds.ens.values) (positional)
  instead of plot_metrics["ens"] = preds.ens (index-aligned)
- gate has-ens detection on the ens *dimension* (all_ens) rather than the
  coordinate
- compute per-metric ens iteration via metric_has_ens / ens_values

Behaviour is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@moritzhauschulz moritzhauschulz changed the title [2507][Evaluation] Implement Correct SSR [2507][Evaluation] Implement Adjusted SSR Jun 15, 2026
@github-actions github-actions Bot added the eval anything related to the model evaluation pipeline label Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

eval anything related to the model evaluation pipeline

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

GenCast Style Corrected SSR

1 participant