Skip to content

Use tasks_per_node to split sweep across tasks#2633

Draft
odelalleau wants to merge 1 commit into
mainfrom
od/submitit-multi-task
Draft

Use tasks_per_node to split sweep across tasks#2633
odelalleau wants to merge 1 commit into
mainfrom
od/submitit-multi-task

Conversation

@odelalleau

@odelalleau odelalleau commented Apr 4, 2023

Copy link
Copy Markdown
Collaborator

Motivation

When running a sweep, someone may want to be able to use the same GPU for multiple jobs in a sweep. This PR makes it possible by leveraging the tasks_per_node argument (if set to 2 for instance, then 2 jobs may share the same GPU).

Discussion

This is currently a draft, open for feedback. I don't think it's actually a good idea to systematically use tasks_per_node for this, because some users may be using this setting for multiprocess jobs.

Two options could be:

  1. Add another flag split_sweep_over_tasks (default=False) to enable this behavior (my preferred solution at this time)
  2. Make it an entirely different setting (ex: jobs_group_size, default=1) so that it can be combined with multi-task jobs (would be more complex to implement: would need to spawn multiple processes from each SLURM job, instead of just relying on SLURM's tasks mechanism as implemented here)

Feedback and other ideas welcome!

The current implementation also has a small hack when we end up launching a single job => not sure if there's a better way to deal with this situation (basically I would like to force submitit to create a job array even for a single-job array).

Have you read the Contributing Guidelines on pull requests?

Yes

Test Plan

TBD

Related Issues and PRs

Fixes #2632

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 4, 2023
@odelalleau odelalleau marked this pull request as draft April 4, 2023 16:05
@odelalleau

Copy link
Copy Markdown
Collaborator Author

@Jasha10 and @jrapin what do you think?

@odelalleau odelalleau force-pushed the od/submitit-multi-task branch from e119fcd to 016dc28 Compare April 4, 2023 16:08
@soerenab

soerenab commented Feb 29, 2024

Copy link
Copy Markdown

This would be a very useful feature to have! Will this be merged into the main branch at some point?

In my particular case, slurm is configured to allocate a full node per job, where each node comes with 4 gpus. My models are quite small though and easily fit on a single gpu. Having the hydra sweeper submitting a new job (which seems to be the default at the moment) per hyperparameter value is hence very wasteful for me whereas parallelizing within a slurm job (and hence a single node) across tasks sounds exactly like the thing I am looking for.

If there is a different solution to this, I would of course also be interested in that. Thank you very much!

@odelalleau

Copy link
Copy Markdown
Collaborator Author

Will this be merged into the main branch at some point?

I wouldn't count on it -- I currently don't need it anymore and as I mentioned in the PR description, I think the current implementation may break some existing use cases. So it would require someone to re-work it a bit (for instance with one of my suggestions, but maybe there's a better way too).

Note however that I've used it successfully so you should be able to cherry-pick this commit and use it if it's helpful to you.

@chaithyagr

Copy link
Copy Markdown

This is a very useful feature. Can we try to work on this and get it merged? Anyone else is interested in this?

@chaithyagr

Copy link
Copy Markdown

Can I implement option 1) and then can we hope this can be merged to mainline?

@chaithyagr

Copy link
Copy Markdown

@Jasha10 @odelalleau What do you think? Is it possible to get this up as discussed about?

@matteobettini

matteobettini commented Jul 3, 2024

Copy link
Copy Markdown

Is tasks_per_node currently usable for anything else in the hydra plugin? It seems to me that this is the only envisionable use case for it in hydra

@chaithyagr

Copy link
Copy Markdown

Yes, I agree it seems like a specific usecase within hydra. But it is a wonderful usecase when we want to run 5-6 jobs within one node without worrying too much. In particular, my usecase is a large multirun, but with a quick arg, I can just run N number of tasks on a node (this translates to sharing GPU resources, when the indivitual tasks are not GPU intensive).

@matteobettini

Copy link
Copy Markdown

Yes I totally agree and would need the same thing. What I wanted to ask is what was the effect of tasks_per_node before this PR, as this seemed unused in hydra

@chaithyagr

Copy link
Copy Markdown

Well, to me it seems like something which cannot technically work with the hydra framework. But its a broader question whether to make sure that the tasks_per_node still reflects what SLURM defines it as pointed by @odelalleau .

@chaithyagr

Copy link
Copy Markdown

Can we try to merge this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Let multiple jobs share the same GPU when launching a sweep on SLURM with submitit

5 participants