Use tasks_per_node to split sweep across tasks#2633
Conversation
e119fcd to
016dc28
Compare
|
This would be a very useful feature to have! Will this be merged into the main branch at some point? In my particular case, slurm is configured to allocate a full node per job, where each node comes with 4 gpus. My models are quite small though and easily fit on a single gpu. Having the hydra sweeper submitting a new job (which seems to be the default at the moment) per hyperparameter value is hence very wasteful for me whereas parallelizing within a slurm job (and hence a single node) across tasks sounds exactly like the thing I am looking for. If there is a different solution to this, I would of course also be interested in that. Thank you very much! |
I wouldn't count on it -- I currently don't need it anymore and as I mentioned in the PR description, I think the current implementation may break some existing use cases. So it would require someone to re-work it a bit (for instance with one of my suggestions, but maybe there's a better way too). Note however that I've used it successfully so you should be able to cherry-pick this commit and use it if it's helpful to you. |
|
This is a very useful feature. Can we try to work on this and get it merged? Anyone else is interested in this? |
|
Can I implement option 1) and then can we hope this can be merged to mainline? |
|
@Jasha10 @odelalleau What do you think? Is it possible to get this up as discussed about? |
|
Is |
|
Yes, I agree it seems like a specific usecase within hydra. But it is a wonderful usecase when we want to run 5-6 jobs within one node without worrying too much. In particular, my usecase is a large multirun, but with a quick arg, I can just run N number of tasks on a node (this translates to sharing GPU resources, when the indivitual tasks are not GPU intensive). |
|
Yes I totally agree and would need the same thing. What I wanted to ask is what was the effect of |
|
Well, to me it seems like something which cannot technically work with the hydra framework. But its a broader question whether to make sure that the |
|
Can we try to merge this? |
Motivation
When running a sweep, someone may want to be able to use the same GPU for multiple jobs in a sweep. This PR makes it possible by leveraging the
tasks_per_nodeargument (if set to 2 for instance, then 2 jobs may share the same GPU).Discussion
This is currently a draft, open for feedback. I don't think it's actually a good idea to systematically use
tasks_per_nodefor this, because some users may be using this setting for multiprocess jobs.Two options could be:
split_sweep_over_tasks(default=False) to enable this behavior (my preferred solution at this time)jobs_group_size, default=1) so that it can be combined with multi-task jobs (would be more complex to implement: would need to spawn multiple processes from each SLURM job, instead of just relying on SLURM's tasks mechanism as implemented here)Feedback and other ideas welcome!
The current implementation also has a small hack when we end up launching a single job => not sure if there's a better way to deal with this situation (basically I would like to force submitit to create a job array even for a single-job array).
Have you read the Contributing Guidelines on pull requests?
Yes
Test Plan
TBD
Related Issues and PRs
Fixes #2632