Skip to content

Refactor WLM integration and iterate2 functionality with plugins#60

Merged
romeokienzler merged 12 commits into
mainfrom
wlm_as_plugin
May 13, 2026
Merged

Refactor WLM integration and iterate2 functionality with plugins#60
romeokienzler merged 12 commits into
mainfrom
wlm_as_plugin

Conversation

@romeokienzler

Copy link
Copy Markdown
Member

This pull request modernizes and clarifies the workflow for running hyperparameter optimization (HPO) with iterate2, focusing on a plugin-based workload manager (WLM) interface, improved documentation, and streamlined example scripts. The changes shift from built-in WLM logic to a flexible plugin system, clarify environment variable usage, and update example configurations and scripts to match the new approach. There are also quality-of-life improvements to the example HPO function and YAML search space.

Key changes include:

Documentation and Workflow Overhaul

  • Replaced the built-in WLM backend system with a plugin-based WLM interface: iterate2 now delegates all workload management to user-supplied plugin scripts, passing configuration via environment variables (e.g., ITERATE_WLM_GPU_COUNT, ITERATE_TRIAL_CMD). This enables support for any cluster or local execution environment and decouples iterate2 from cluster-specific logic. [1] [2] [3] [4] [5] [6]

  • Updated the documentation (docs/iterate2.md) to describe the new plugin system, environment variable interface, and revised command-line options. Removed legacy WLM-specific options in favor of --wlm-plugin, and clarified how to configure resources via the HPO YAML wlm: section. [1] [2] [3] [4] [5] [6]

Example and Configuration Updates

  • Refactored example cluster submission scripts (examples/run_lsf_gridfm_example_postgres.sh, examples/run_ccc_gridfm_example.sh) to use the new plugin interface, removing embedded cluster logic from the scripts and delegating it to dedicated plugin scripts. [1] [2]

  • Updated the example HPO YAML (examples/bumpy_hpo.yaml) to clarify the structure and ensure correct formatting for static and metric sections, matching the expectations of iterate2.

  • Fixed a typo in the data path in the gridfm HPO config (configs/gridfm_graphkit_hpo.yaml).

Example Trial Script Improvements

  • Simplified and clarified the example trial script (examples/bumpy_function.py): it now reads all parameters and output paths from environment variables as set by iterate2, and writes metric output to the required file. This ensures compatibility with the new plugin-based workflow. [1] [2]

Most important changes:

Plugin-based WLM interface and documentation

  • Replaced legacy WLM backend logic with a plugin system: all cluster or local execution is handled by user-supplied scripts, with configuration passed via environment variables such as ITERATE_WLM_GPU_COUNT. [1] [2] [3] [4]
  • Thoroughly updated documentation to reflect the new plugin interface, environment variable contract, and revised CLI options. [1] [2] [3] [4] [5] [6]

Example scripts and configuration

  • Migrated example cluster submission scripts to use the new plugin interface, removing embedded cluster logic and improving clarity. [1] [2]
  • Updated example HPO YAML and fixed a path typo in the gridfm config. [1] [2]

Example trial script

  • Refactored examples/bumpy_function.py to use environment variables for all parameters and output, ensuring compatibility with the new iterate2 workflow. [1] [2]

… and Vela

- Removed hardcoded WLM options and parameters from the argument parser.
- Added support for a user-defined WLM plugin via `--wlm-plugin` argument.
- Implemented `load_wlm_config` function to read WLM settings from HPO YAML.
- Created reference implementations for LSF and Vela plugins in `examples/wlm_plugins/`.
- Updated `run_and_stream` to handle local execution and WLM plugin invocation.
- Enhanced logging to provide clearer feedback on trial execution and WLM interactions.
- Cleaned up unused functions and parameters related to previous WLM handling.

Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>
- Updated `run_setter_example.sh` to use `bumpy_function.py` instead of `bumpy_setter.py`, simplifying the example for local trials.
- Modified `run_vela_example.sh` to clarify usage of the Vela/OpenShift job submission, ensuring better documentation and example clarity.
- Refined `lsf_plugin.sh` to streamline job submission for IBM Spectrum LSF, enhancing clarity on environment variable usage and command construction.
- Overhauled `_iterate2.py` to simplify the command-line interface, improve YAML loading, and enhance metric extraction logic.
- Removed deprecated features and improved logging for better traceability during execution.
- Enhanced the objective function to better handle parameter suggestions and metrics extraction.

Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>
Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>
… clusters

Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>
…parameter

Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>
Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>
… directory, and performance reporting options

Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>
…o require PostgreSQL URL

Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>
… across examples

Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>
Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>
Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>
Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>
@romeokienzler romeokienzler merged commit 9f68a9f into main May 13, 2026
0 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant