Skip to content

Adding W&B Plugin for Hydra Sweeper#2160

Draft
captain-pool wants to merge 20 commits into
facebookresearch:mainfrom
captain-pool:adding-wandb-sweeper
Draft

Adding W&B Plugin for Hydra Sweeper#2160
captain-pool wants to merge 20 commits into
facebookresearch:mainfrom
captain-pool:adding-wandb-sweeper

Conversation

@captain-pool

@captain-pool captain-pool commented Apr 19, 2022

Copy link
Copy Markdown

Motivation

This PR contains a plugin that provides seamless integration of W&B Sweeps with Hydra Sweeps (aka Multirun). This also contains an example on how the submitit launcher can be used for launching W&B Sweeps on SLURM clusters.

Have you read the Contributing Guidelines on pull requests?

Yes

Original URL of Plugin: https://github.com/captain-pool/hydra-wandb-sweeper

Test Plan

Proper Unittests on the way.

CC: @tesfaldet

captain-pool and others added 20 commits February 16, 2022 03:55
* Compatibility with hydra experiment overrides. Quantized step for distributions now supported. Sweep parameterization now more aligned with Hydra's standardized approach. Some bug fixes. Sweep overrides now support RangeSweep properly via quantized uniform. Wandb run now directly passed to task function instead of run id to avoid unnecessary init overhead.

* Added some useful logs. Added some comments. Small changes.

* Monkeypatching some wandb functions to allow for using wandb's code save when in a hydra cwd that's diff than the code's directory. Being more explicit with function arg types.

* Can now pass in tags (as list) and notes via wandb_sweep_config yaml

* Logging to .wandb instead of wandb, had to resort to a lot of monkeypatching but it was worth it. Some code cleanup and clarification.

* Updated example. Fixed some comments. Making sure to use abs path for creating and using sweep directory. Updated README but still needs an update in the range section of categorical parameters since now it's using wandb's q_uniform.

* Added agent budget functionality for launching batches of agents until budget reached. Added some comments. Retrieving results from task function via its agent so that it can be visible to the sweeper and launcher for inspecting JobReturn status and return value.

* Returning results and statuses from task function to launcher. Improved task function error handling and propagation to launcher. Fixed bug with renaming wandb dir to .wandb when config_filepath has text 'wandb' anywhere that doesn't relate to a directory.

* Moved main sweeper code to _impl.py so that it doesn't get accidentally loaded along with all its monkeypatching when the user doesn't want to use the sweeper. Upon running a task function, hydra loads all available plugins, even those that won't be used. Improved code typing. Added implicit start to yamlfmt pre-commit since it clashed with hydra's # @Package _global_ indicator.

* Improved error handling for agent failures such as sweep not existing anymore, job failures unrelated to the agent, and run failures within an agent.

* Revamped example training script to showcase new pre-emption capabilities via the submitit launcher plugin. Note, not tested with Joblib, Ray, or RQ launcher plugins.

* Support for Hyperband early termination. Graceful response towards remote killing/stopping of sweeps and runs.

* More comments for clarity. Tweaking example so that it's easier to test out wandb alert during preemption. Slight code tweaks elswhere. Small pre-commit modifications.

* Grid search method now supported for both config-style parameterization and override-style. Validation of metric arguments, early termination args, and search args.
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 19, 2022
@lgtm-com

lgtm-com Bot commented Apr 19, 2022

Copy link
Copy Markdown
Contributor

This pull request introduces 1 alert when merging 0731be8 into a78d1f6 - view on LGTM.com

new alerts:

  • 1 for Unused import

@pixelb

pixelb commented Apr 19, 2022

Copy link
Copy Markdown
Contributor

We may add this to a contrib/ directory
Thanks for your contribution!

@pixelb pixelb self-assigned this Apr 19, 2022
@tesfaldet

Copy link
Copy Markdown

Heads up @captain-pool I came across an interesting issue or bug perhaps a couple of weeks ago. The logging rate of runs severely slowed down at some point during a sweep, specifically after 30k train iterations or so (for my specific project). There seems to be an issue with the sweeper agent severely rate limiting logging rates for its associated run. To the point where it's clear it's a bug and not an issue with what's being logged. I have some telemetry data I can share. I'm not sure if the issue is with how I implemented this sweeper or if it's inherent to wandb.agent.

@jieru-hu

Copy link
Copy Markdown
Contributor

hi @captain-pool - thanks very much for your contribution!

We've created contrib here for plugin contributions. We've also included a README there.
Could you update this PR and move this plugin to the contrib folder and add integration tests?

pls let us know if you have any question on this. thanks again.

@tesfaldet

Copy link
Copy Markdown

hi @captain-pool - thanks very much for your contribution!

We've created contrib here for plugin contributions. We've also included a README there. Could you update this PR and move this plugin to the contrib folder and add integration tests?

pls let us know if you have any question on this. thanks again.

@captain-pool I can help out with this after the NeurIPS deadline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants