Skip to content

sbatch submission failures do not continue#1

Open
trickytank wants to merge 1 commit intonathanhaigh:masterfrom
trickytank:master
Open

sbatch submission failures do not continue#1
trickytank wants to merge 1 commit intonathanhaigh:masterfrom
trickytank:master

Conversation

@trickytank
Copy link

This is to prevent errors from sbatch causing trouble.

I sometimes have the following error from sbatch:

sbatch: error: Batch job submission failed: Socket timed out on send/recv operation

This causes the JOBID to be an empty string, which later causes an error in sacct. This does not resolve as the status script assumes the job is still running.

This fixes the problem by waiting until the job is properly submitted. There is a 10 second wait between submissions as submission failures appear to cluster at the same time.

@nathanhaigh
Copy link
Owner

Thanks for the contribution!

I've had similar sporadic fails with the sacct command used in the status script. I dealt with it using a "retry" function. A similar function in the submit script could be used like this:

function retry {
  local n=1
  local max=5
  local delay=5
  while true; do
    "$@" && break || {
      if [[ $n -lt $max ]]; then
        >&2 echo "WARN: Command ($@) failed on attempt $n/$max:"
        sleep $delay
      else
        >&2 echo "ERROR: Command ($@) failed after $n attempts."
        exit 1
      fi
      ((n++))
    }
  done
}

set -o pipefail
JOBID=$(retry sbatch ${DEP_STRING} ${SBATCH_ARGS} $@ | cut -f4 -d' ')
echo -n "${JOBID}"

This has the advantage of also not becoming stuck in an infinite loop as it breaks out after 5 failed attempts.

What do you think?

@nathanhaigh
Copy link
Owner

See now: https://github.com/UofABioinformaticsHub/snakemake-tutorial/blob/master/profiles/slurm/status

@trickytank
Copy link
Author

It's much nicer to have a generic retry function. For my purposes I'd set the max to a large value, as there have been ~15 minute periods that submission has failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants