Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion configs/gridfm_graphkit_hpo.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ hpo:
choices:
case118:
config: ./examples/config/HGNS_PF_datakit_case118.yaml
data_path: /u/rkie/
data_path: /u/rki/

static:
run_name: run1
Expand Down
128 changes: 69 additions & 59 deletions docs/iterate2.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,27 @@ Key capabilities:

- **Multi-objective optimisation** — extract and optimise several metrics simultaneously (Pareto front)
- **Five HPO parameter types** — `float`, `int`, `categorical`, `flag` (store-true), `group` (bundled arg sets)
- **Dynamic GPU count per trial** — `gpu_num` in the HPO space controls the WLM resource request per trial
- **Dynamic GPU count per trial** — `gpu_num` in the HPO space is passed to the WLM plugin via `ITERATE_WLM_GPU_COUNT`
- **Null-omission** — `null` in a `categorical` choice causes the flag to be completely absent from the command line
- **Workload manager backends** — LSF, Slurm, or direct local execution
- **WLM plugin system** — any executable (bash, Python, …) can be used as a workload-manager backend; reference implementations for LSF and Vela/OpenShift are in `examples/wlm_plugins/`

## Quick start

```sh
iterate2 \
--script train.py \
--wlm lsf \
--gpu-count 1 \
--cpu-count 20 \
--mem-gb 512 \
--wlm-plugin examples/wlm_plugins/lsf_plugin.sh \
--optuna-study-name my_study \
--optuna-db-path sqlite:///hpo.db \
--optuna-n-trials 50 \
--hpo-yaml hpo_space.yaml # wlm: section sets gpu-count, cpu-count, …
```

For local execution (no cluster) simply omit `--wlm-plugin`:

```sh
iterate2 \
--script train.py \
--optuna-study-name my_study \
--optuna-db-path sqlite:///hpo.db \
--optuna-n-trials 50 \
Expand All @@ -33,29 +41,12 @@ iterate2 \
|---|---|---|
| `--script` | *(required)* | Training script to execute |
| `--root-dir` | `.` | Working directory; derived from `--script` if omitted |
| `--venv` | `.venv` | Virtual-environment directory to activate. Set to empty string to disable |
| `--venv` | *(none)* | Virtual-environment directory to activate. Omit to skip venv activation entirely |
| `--interpreter` | `python` | Python interpreter to invoke |
| `--param-setter` | `None` | Use setter-style argument passing (see [Setter-style arguments](#setter-style-arguments)) |
| `--wlm` | `none` | Workload manager: `lsf`, `slurm`, `vela`, or `none` |
| `--gpu-count` | `1` | Number of GPUs per trial |
| `--cpu-count` | `4` | Number of CPUs per trial |
| `--mem-gb` | `128` | Memory (GB) per trial |
| `--lsf-gpu-config-string` | `None` | Optional verbatim LSF `-gpu` option string (see [GPU configuration](#gpu-configuration-on-lsf)) |
| `--wlm-plugin` | *(local)* | Path to an executable WLM plugin script. When omitted, trials run locally in the current process |
| `--parallelism` | `1` | Number of trials to run in parallel (see [Parallel execution](#parallel-execution)) |

### Vela (OpenShift) options

Required when `--wlm vela`.

| Option | Default | Description |
|---|---|---|
| `--vela-job-template` | *(required)* | Path to the Vela job YAML template. `{{HPO_COMMAND}}` in `setupCommands` is replaced per trial |
| `--vela-chart-path` | *(required)* | Path to the `pytorchjob-generator` helm chart directory |
| `--vela-namespace` | *(current context)* | OpenShift/Kubernetes namespace |
| `--vela-cmd-placeholder` | `{{HPO_COMMAND}}` | String in `setupCommands` that is replaced with the HPO-parametrised CLI call |
| `--vela-pod-ready-timeout` | `600` | Seconds to wait for the trial pod to reach Running state |
| `--vela-job-timeout` | `86400` | Seconds to wait (streaming logs) for the job to complete |

### Optuna options

| Option | Default | Description |
Expand Down Expand Up @@ -189,14 +180,20 @@ Optuna tracks the choice as a single categorical (`dataset = "case2000"`), but t

##### `gpu_num` — dynamic GPU count

The special key `gpu_num` (as `categorical` or `int`) overrides `--gpu-count` for the **WLM resource request** of each individual trial. It is consumed by `iterate2` and never forwarded to the wrapped script.
The special key `gpu_num` (as `categorical` or `int`) is automatically extracted
from the sampled parameters and forwarded to the WLM plugin as
`ITERATE_WLM_GPU_COUNT`. It does **not** appear in the wrapped script's command
line. The WLM plugin uses it to set the cluster resource request for the trial.

```yaml
gpu_num:
type: categorical
choices: [1, 2, 4]
```

Alternatively, set a fixed `gpu-count` in the `wlm:` section of the HPO YAML
when all trials use the same number of GPUs.

### Static arguments

Arguments passed unchanged to every trial. Can be supplied inline or via file:
Expand Down Expand Up @@ -270,46 +267,59 @@ iterate2 --param-setter set ...

---

## GPU configuration on LSF
## WLM plugin system

iteate2 has no built-in knowledge of any workload manager. Instead it calls a
user-supplied **plugin script** once per trial. The plugin can be any
executable (bash, Python, …).

When `--wlm lsf` is selected, `iterate2` constructs a `bsub` command for each trial.
### Plugin interface

### Default behaviour
iterate2 calls the plugin with no positional arguments. All information is
delivered through environment variables:

| `--gpu-count` | Generated fragment |
| Variable | Description |
|---|---|
| `> 0` (default `1`) | `-gpu num=<N>` |
| `0` | *(no `-gpu` flag, CPU-only job)* |
| `ITERATE_TRIAL_NUMBER` | Integer trial ID |
| `ITERATE_TRIAL_CMD` | Full shell command (with `cd`, `source venv`) – suited for HPC WLMs |
| `ITERATE_TRIAL_CONTAINER_CMD` | Bare CLI invocation (no `cd`/`source`) – suited for container-based systems |
| `ITERATE_OUT_FILE` | File where **stdout** must be written |
| `ITERATE_ERR_FILE` | File where **stderr** must be written |
| `ITERATE_WLM_<KEY>` | Every key from the YAML `wlm:` section (uppercased, hyphens → underscores) |

### `--lsf-gpu-config-string`
The plugin must exit **0** on success; any other exit code marks the trial as
failed in Optuna.

For advanced LSF GPU scheduling you can supply the full value of the `-gpu` option as a string. When set, it **completely replaces** the auto-generated `-gpu num=<N>` fragment.
### WLM configuration in the HPO YAML

```sh
iterate2 \
--wlm lsf \
--lsf-gpu-config-string "num=1:mode=exclusive_process:mps=yes:gmodel=NVIDIAA100_SXM4_80GB" \
--cpu-count 20 \
--mem-gb 512 \
...
```
All WLM-specific parameters (GPU count, memory, queue, job template path, …)
live in an optional `wlm:` section of the HPO YAML:

This produces a `bsub` submission resembling:
```yaml
hpo:
lr: { type: float, low: 1e-5, high: 1e-2, log: true }

```sh
bsub -n 20 -R "span[hosts=1]" \
-gpu "num=1:mode=exclusive_process:mps=yes:gmodel=NVIDIAA100_SXM4_80GB" \
-M 512G -J hpo_trial_0 \
"cd /my/root && source .venv/bin/activate && python train.py ..."
static:
epochs: 50

# WLM config – forwarded as ITERATE_WLM_* env vars to the plugin
wlm:
gpu-count: 1
cpu-count: 8
mem-gb: 32
lsf-gpu-config: "num=1:mode=exclusive_process:mps=no:gmodel=NVIDIAA100_SXM4_80GB"
```

!!! note
`--gpu-count` is still used for the `rusage` memory/CPU reservation string even when `--lsf-gpu-config-string` is set. Set it to match the `num=` value in your GPU string.
### Reference plugins

!!! tip
Use exclusive process mode (`mode=exclusive_process`) together with MPS (`mps=yes`) to share a single A100 across multiple MPS clients while still pinning the job to one physical GPU.
See `examples/wlm_plugins/` for fully documented reference implementations:

---
| Plugin | WLM |
|---|---|
| `lsf_plugin.sh` | IBM Spectrum LSF (`bsub -K`) |
| `vela_plugin.py` | OpenShift / MLBatch PyTorchJob (`helm template \| oc create`) |

Writing a SLURM plugin follows the same pattern as `lsf_plugin.sh`.

---

Expand All @@ -320,7 +330,7 @@ By default `iterate2` runs one trial at a time. Pass `--parallelism N` to run up
```sh
iterate2 \
--parallelism 4 \
--wlm lsf \
--wlm-plugin examples/wlm_plugins/lsf_plugin.sh \
...
```

Expand All @@ -344,12 +354,12 @@ Output from concurrent trials is prefixed so you can follow individual workers:

### Output files

| WLM | stdout | stderr |
|---|---|---|
| `none` | `trial_N.out` (written by iterate2) | `trial_N.err` (written by iterate2) |
| `lsf` / `slurm` | `trial_N.out` (written by WLM on cluster) | `trial_N.err` (written by WLM on cluster) |
iteate2 tells the plugin where to write output via `ITERATE_OUT_FILE` /
`ITERATE_ERR_FILE`. The plugin is responsible for directing its job's
stdout/stderr to those files. iterate2 extracts metrics from them after the
plugin exits.

For WLM backends the local WLM tool output (bsub/srun status messages) is written to `trial_N_wlm.out` / `trial_N_wlm.err` so the cluster-managed files are never overwritten.
For local execution (no plugin) iterate2 writes them directly:

### SQLite and parallelism

Expand Down
122 changes: 57 additions & 65 deletions examples/bumpy_function.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,23 @@
#!/usr/bin/env python3
import argparse
"""
Bumpy 3-D multimodal function — called by iterate2 as a trial script.

iterate2 sets the following environment variables before calling this script:
ITERATE_TRIAL_NUMBER – integer trial index
ITERATE_OUT_FILE – path where metrics must be written
ITERATE_ERR_FILE – path for error logging
ITERATE_PARAM_X – HPO parameter x
ITERATE_PARAM_Y – HPO parameter y
ITERATE_PARAM_Z – HPO parameter z
ITERATE_PARAM_GLOBAL_MU – static parameter (three space-separated floats)

All output that iterate2 uses to extract metrics must be written to
ITERATE_OUT_FILE (not stdout), one metric per line in "name: value" format.
"""

import math
import os
import sys


def bumpy_function_3d(
Expand All @@ -9,86 +26,61 @@ def bumpy_function_3d(
mu_rest, sigma_rest, amps_rest,
):
"""
3D smooth multimodal function with:
- one global optimum = 1 at global_mu = (mx,my,mz)
- multiple local optima < 1
3D smooth multimodal function.
- one global optimum = 1 at global_mu = (mx, my, mz)
- multiple local optima < 1

f(p) = 1 - Π_k (1 - a_k * exp(-||p - mu_k||^2 / (2 sigma_k^2)))
f(p) = 1 - prod_k (1 - a_k * exp(-||p - mu_k||^2 / (2 sigma_k^2)))
"""

def sqdist(p, q):
return (p[0] - q[0])**2 + (p[1] - q[1])**2 + (p[2] - q[2])**2

p = (x, y, z)

# Global peak (amplitude = 1)
val = 1.0 - math.exp(
-sqdist(p, global_mu) / (2.0 * global_sigma**2)
)
val = 1.0 - math.exp(-sqdist(p, global_mu) / (2.0 * global_sigma**2))

# Local peaks
for mu_k, sig_k, a_k in zip(mu_rest, sigma_rest, amps_rest):
term = 1.0 - a_k * math.exp(
-sqdist(p, mu_k) / (2.0 * sig_k**2)
)
val *= term
val *= 1.0 - a_k * math.exp(-sqdist(p, mu_k) / (2.0 * sig_k**2))

return 1.0 - val


if __name__ == "__main__":
parser = argparse.ArgumentParser("Evaluate the 3D bumpy multimodal function.")

parser.add_argument("--x", type=float, required=True)
parser.add_argument("--y", type=float, required=True)
parser.add_argument("--z", type=float, required=True)
parser.add_argument("--trial-number", type=int, default=0)

parser.add_argument(
"--global-mu",
type=float,
nargs=3,
default=[0.0, 0.0, 0.0],
metavar=("MX", "MY", "MZ"),
)
parser.add_argument("--global-sigma", type=float, default=0.7)

parser.add_argument(
"--mu-rest",
type=float,
nargs="*",
default=[-2.0, 0.0, 0.0, 2.0, 0.0, 0.0],
help="Flat list of (x y z) triplets",
)
parser.add_argument(
"--sigma-rest",
type=float,
nargs="*",
default=[0.6, 0.6],
)
parser.add_argument(
"--amps-rest",
type=float,
nargs="*",
default=[0.5, 0.8],
)

args = parser.parse_args()

mu_rest = [
tuple(args.mu_rest[i:i+3])
for i in range(0, len(args.mu_rest), 3)
]

# --- read parameters from environment ---------------------------------- #
try:
x = float(os.environ["ITERATE_PARAM_X"])
y = float(os.environ["ITERATE_PARAM_Y"])
z = float(os.environ["ITERATE_PARAM_Z"])
global_mu = tuple(map(float, os.environ["ITERATE_PARAM_GLOBAL_MU"].split()))
out_file = os.environ["ITERATE_OUT_FILE"]
trial_num = os.environ.get("ITERATE_TRIAL_NUMBER", "?")
except KeyError as exc:
print(f"ERROR: missing required environment variable {exc}", file=sys.stderr)
sys.exit(1)

if len(global_mu) != 3:
print("ERROR: ITERATE_PARAM_GLOBAL_MU must contain exactly three floats", file=sys.stderr)
sys.exit(1)

# Fixed defaults for the local-optima configuration
mu_rest = [(-2.0, 0.0, 0.0), (2.0, 0.0, 0.0)]
sigma_rest = [0.6, 0.6]
amps_rest = [0.5, 0.8]
global_sigma = 0.7

# --- evaluate ---------------------------------------------------------- #
yval = bumpy_function_3d(
x=args.x,
y=args.y,
z=args.z,
global_mu=tuple(args.global_mu),
global_sigma=args.global_sigma,
x=x, y=y, z=z,
global_mu=global_mu,
global_sigma=global_sigma,
mu_rest=mu_rest,
sigma_rest=args.sigma_rest,
amps_rest=args.amps_rest,
sigma_rest=sigma_rest,
amps_rest=amps_rest,
)

print(f'yval: {yval}, trial_number: {args.trial_number}')
# --- write metrics to ITERATE_OUT_FILE --------------------------------- #
with open(out_file, "w") as fh:
fh.write(f"yval: {yval}\n")

print(f"[trial-{trial_num}] yval={yval:.6f}")
20 changes: 11 additions & 9 deletions examples/bumpy_hpo.yaml
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
# =======================
# Static parameters - passed to the underlying training script as is
# =======================

# HPO search space for the bumpy 3-D multimodal function.
#
# Only three sections are recognised by iterate2:
# metrics: – names to extract from the trial script output
# static: – fixed parameters passed to every trial
# hpo: – parameters Optuna will optimise

metrics:
- yval

static:
global-mu: 23 42 66

# ========================
# Training hyperparameters - evaluated by optuna and passed to the underlying training script
# ========================
global-mu: "23 42 66"

hpo:
x:
Expand Down
Loading
Loading