Improved Slurm backend for job submission#5177
Conversation
- Submit each build as an individual Slurm job (same as before) - Handle dependencies between Slurm jobs (same as before) - Respect the --job-max-jobs setting, and manage a queue accordingly (new) - Track the state of the Slurm jobs and print a summary at the end (new) - Synchronous behavior, ie block until the end of the execution (new)
| super().__init__(*args, **kwargs) | ||
|
|
||
| # Add maximum jobs submitted to a queue | ||
| self.job_polling_interval = build_option('job_polling_interval') |
There was a problem hiding this comment.
I see no default value for polling interval, what is it? It should be impossible to run with e.g. 0, else the code will spam SLURM. Some number of minutes would seem like a sensible minimum?
The comment above polling interval refers to maximum jobs, which is actually set below.
There was a problem hiding this comment.
Thanks for checking:
- The default value for
job_polling_intervalis 30s and it is set in tools/options.py line 951. - I added an additional check to make sure we don't get anything less than 1s.
- I fixed the mixed-up comments.
I have been running it many times with the default polling interval of 30s without any issue.
I believe it is fine to let the user use a low value if he wishes, even though I would not recommend it.
There was a problem hiding this comment.
I see now the default value was already there -- I thought this was a new parameter, so was expecting to see it defined in this PR.
I'll resolve this thread (if I can).
50fde24 to
35f2340
Compare
This is an improved version of the Slurm backend for job submission (option
--job).Same functionalities
New functionalities
--job-max-jobssetting, and manage a submission queue accordingly--job-polling-intervalChange of behavior
--job Slurmoption will be broken.--job) or to the behavior of other backend likeGC3Pie. This differs from the current behavior of the Slurm backend where EasyBuild exits once all the jobs have been submitted.If this change of behavior is too important to be merged as it is, I can suggest two options:
--job --job-backend Slurmwithout--job-max-jobs, then use the old behavior, ie asynchronous. If using--job --job-backend Slurmwith--job-max-jobs, then use the new behavior, ie synchronous. However, I think it could be a bit confusing.SlurmQueue.Use of AI tools