Refactor WLM integration and iterate2 functionality with plugins by romeokienzler · Pull Request #60 · claimed-framework/iterate

romeokienzler · 2026-05-13T14:24:30Z

This pull request modernizes and clarifies the workflow for running hyperparameter optimization (HPO) with iterate2, focusing on a plugin-based workload manager (WLM) interface, improved documentation, and streamlined example scripts. The changes shift from built-in WLM logic to a flexible plugin system, clarify environment variable usage, and update example configurations and scripts to match the new approach. There are also quality-of-life improvements to the example HPO function and YAML search space.

Key changes include:

Documentation and Workflow Overhaul

Replaced the built-in WLM backend system with a plugin-based WLM interface: iterate2 now delegates all workload management to user-supplied plugin scripts, passing configuration via environment variables (e.g., ITERATE_WLM_GPU_COUNT, ITERATE_TRIAL_CMD). This enables support for any cluster or local execution environment and decouples iterate2 from cluster-specific logic. [1] [2] [3] [4] [5] [6]
Updated the documentation (docs/iterate2.md) to describe the new plugin system, environment variable interface, and revised command-line options. Removed legacy WLM-specific options in favor of --wlm-plugin, and clarified how to configure resources via the HPO YAML wlm: section. [1] [2] [3] [4] [5] [6]

Example and Configuration Updates

Refactored example cluster submission scripts (examples/run_lsf_gridfm_example_postgres.sh, examples/run_ccc_gridfm_example.sh) to use the new plugin interface, removing embedded cluster logic from the scripts and delegating it to dedicated plugin scripts. [1] [2]
Updated the example HPO YAML (examples/bumpy_hpo.yaml) to clarify the structure and ensure correct formatting for static and metric sections, matching the expectations of iterate2.
Fixed a typo in the data path in the gridfm HPO config (configs/gridfm_graphkit_hpo.yaml).

Example Trial Script Improvements

Simplified and clarified the example trial script (examples/bumpy_function.py): it now reads all parameters and output paths from environment variables as set by iterate2, and writes metric output to the required file. This ensures compatibility with the new plugin-based workflow. [1] [2]

Most important changes:

Plugin-based WLM interface and documentation

Replaced legacy WLM backend logic with a plugin system: all cluster or local execution is handled by user-supplied scripts, with configuration passed via environment variables such as ITERATE_WLM_GPU_COUNT. [1] [2] [3] [4]
Thoroughly updated documentation to reflect the new plugin interface, environment variable contract, and revised CLI options. [1] [2] [3] [4] [5] [6]

Example scripts and configuration

Migrated example cluster submission scripts to use the new plugin interface, removing embedded cluster logic and improving clarity. [1] [2]
Updated example HPO YAML and fixed a path typo in the gridfm config. [1] [2]

Example trial script

Refactored examples/bumpy_function.py to use environment variables for all parameters and output, ensuring compatibility with the new iterate2 workflow. [1] [2]

… and Vela - Removed hardcoded WLM options and parameters from the argument parser. - Added support for a user-defined WLM plugin via `--wlm-plugin` argument. - Implemented `load_wlm_config` function to read WLM settings from HPO YAML. - Created reference implementations for LSF and Vela plugins in `examples/wlm_plugins/`. - Updated `run_and_stream` to handle local execution and WLM plugin invocation. - Enhanced logging to provide clearer feedback on trial execution and WLM interactions. - Cleaned up unused functions and parameters related to previous WLM handling. Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>

- Updated `run_setter_example.sh` to use `bumpy_function.py` instead of `bumpy_setter.py`, simplifying the example for local trials. - Modified `run_vela_example.sh` to clarify usage of the Vela/OpenShift job submission, ensuring better documentation and example clarity. - Refined `lsf_plugin.sh` to streamline job submission for IBM Spectrum LSF, enhancing clarity on environment variable usage and command construction. - Overhauled `_iterate2.py` to simplify the command-line interface, improve YAML loading, and enhance metric extraction logic. - Removed deprecated features and improved logging for better traceability during execution. - Enhanced the objective function to better handle parameter suggestions and metrics extraction. Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>

Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>

… clusters Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>

…parameter Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>

Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>

… directory, and performance reporting options Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>

…o require PostgreSQL URL Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>

… across examples Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>

Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>

romeokienzler added 12 commits April 24, 2026 11:31

fix lsf plugin

be709e2

Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>

Add example scripts for running gridfm-graphkit HPO on CCC and ZuVela…

b444ac3

… clusters Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>

Update GPU_COUNT handling in ZuVela and CCC plugins to use HPO group …

560087d

…parameter Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>

Fix data_path in gridfm_graphkit HPO configuration

a0168b7

Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>

Enhance training command in zuvela_plugin.sh to include run name, log…

05d14f0

… directory, and performance reporting options Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>

Update database connection handling in run_zuvela_gridfm_example.sh t…

acf8441

…o require PostgreSQL URL Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>

Refactor scripts to replace 'iterate2' with 'iterate' for consistency…

e564e2c

… across examples Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>

Update venv option description in iterate2 documentation for clarity

62334b6

Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>

Bump version to 0.4rc1 in pyproject.toml

a2143e2

Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>

Update version to 0.4 in pyproject.toml

b32d1bc

Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>

romeokienzler requested review from paolofraccaro and rosielickorish May 13, 2026 14:24

romeokienzler merged commit 9f68a9f into main May 13, 2026
0 of 3 checks passed

romeokienzler requested a review from naomi-simumba May 13, 2026 14:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor WLM integration and iterate2 functionality with plugins#60

Refactor WLM integration and iterate2 functionality with plugins#60
romeokienzler merged 12 commits into
mainfrom
wlm_as_plugin

romeokienzler commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

romeokienzler commented May 13, 2026

Documentation and Workflow Overhaul

Example and Configuration Updates

Example Trial Script Improvements

Plugin-based WLM interface and documentation

Example scripts and configuration

Example trial script

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant