Hi, for example I am training a job using this [yaml](https://github.com/mosaicml/diffusion/blob/main/yamls/hydra-yamls/SD-2-base-512.yaml), how to do continue training if this job failed? Thanks.