claimed-framework · romeokienzler · May 13, 2026 · Apr 24, 2026 · Apr 24, 2026 · Apr 24, 2026
diff --git a/configs/gridfm_graphkit_hpo.yaml b/configs/gridfm_graphkit_hpo.yaml
@@ -46,7 +46,7 @@ hpo:
     choices:
       case118:
         config: ./examples/config/HGNS_PF_datakit_case118.yaml
-        data_path: /u/rkie/
+        data_path: /u/rki/
 
 static:
   run_name: run1

diff --git a/docs/iterate2.md b/docs/iterate2.md
@@ -6,19 +6,27 @@ Key capabilities:
 
 - **Multi-objective optimisation** — extract and optimise several metrics simultaneously (Pareto front)
 - **Five HPO parameter types** — `float`, `int`, `categorical`, `flag` (store-true), `group` (bundled arg sets)
-- **Dynamic GPU count per trial** — `gpu_num` in the HPO space controls the WLM resource request per trial
+- **Dynamic GPU count per trial** — `gpu_num` in the HPO space is passed to the WLM plugin via `ITERATE_WLM_GPU_COUNT`
 - **Null-omission** — `null` in a `categorical` choice causes the flag to be completely absent from the command line
-- **Workload manager backends** — LSF, Slurm, or direct local execution
+- **WLM plugin system** — any executable (bash, Python, …) can be used as a workload-manager backend; reference implementations for LSF and Vela/OpenShift are in `examples/wlm_plugins/`
 
 ## Quick start
 
 ```sh
 iterate2 \
   --script train.py \
-  --wlm lsf \
-  --gpu-count 1 \
-  --cpu-count 20 \
-  --mem-gb 512 \
+  --wlm-plugin examples/wlm_plugins/lsf_plugin.sh \
+  --optuna-study-name my_study \
+  --optuna-db-path sqlite:///hpo.db \
+  --optuna-n-trials 50 \
+  --hpo-yaml hpo_space.yaml   # wlm: section sets gpu-count, cpu-count, …
+```
+
+For local execution (no cluster) simply omit `--wlm-plugin`:
+
+```sh
+iterate2 \
+  --script train.py \
   --optuna-study-name my_study \
   --optuna-db-path sqlite:///hpo.db \
   --optuna-n-trials 50 \
@@ -33,29 +41,12 @@ iterate2 \
 |---|---|---|
 | `--script` | *(required)* | Training script to execute |
 | `--root-dir` | `.` | Working directory; derived from `--script` if omitted |
-| `--venv` | `.venv` | Virtual-environment directory to activate. Set to empty string to disable |
+| `--venv` | *(none)* | Virtual-environment directory to activate. Omit to skip venv activation entirely |
 | `--interpreter` | `python` | Python interpreter to invoke |
 | `--param-setter` | `None` | Use setter-style argument passing (see [Setter-style arguments](#setter-style-arguments)) |
-| `--wlm` | `none` | Workload manager: `lsf`, `slurm`, `vela`, or `none` |
-| `--gpu-count` | `1` | Number of GPUs per trial |
-| `--cpu-count` | `4` | Number of CPUs per trial |
-| `--mem-gb` | `128` | Memory (GB) per trial |
-| `--lsf-gpu-config-string` | `None` | Optional verbatim LSF `-gpu` option string (see [GPU configuration](#gpu-configuration-on-lsf)) |
+| `--wlm-plugin` | *(local)* | Path to an executable WLM plugin script. When omitted, trials run locally in the current process |
 | `--parallelism` | `1` | Number of trials to run in parallel (see [Parallel execution](#parallel-execution)) |
 
-### Vela (OpenShift) options
-
-Required when `--wlm vela`.
-
-| Option | Default | Description |
-|---|---|---|
-| `--vela-job-template` | *(required)* | Path to the Vela job YAML template. `{{HPO_COMMAND}}` in `setupCommands` is replaced per trial |
-| `--vela-chart-path` | *(required)* | Path to the `pytorchjob-generator` helm chart directory |
-| `--vela-namespace` | *(current context)* | OpenShift/Kubernetes namespace |
-| `--vela-cmd-placeholder` | `{{HPO_COMMAND}}` | String in `setupCommands` that is replaced with the HPO-parametrised CLI call |
-| `--vela-pod-ready-timeout` | `600` | Seconds to wait for the trial pod to reach Running state |
-| `--vela-job-timeout` | `86400` | Seconds to wait (streaming logs) for the job to complete |
-
 ### Optuna options
 
 | Option | Default | Description |
@@ -189,14 +180,20 @@ Optuna tracks the choice as a single categorical (`dataset = "case2000"`), but t
 
 ##### `gpu_num` — dynamic GPU count
 
-The special key `gpu_num` (as `categorical` or `int`) overrides `--gpu-count` for the **WLM resource request** of each individual trial. It is consumed by `iterate2` and never forwarded to the wrapped script.
+The special key `gpu_num` (as `categorical` or `int`) is automatically extracted
+from the sampled parameters and forwarded to the WLM plugin as
+`ITERATE_WLM_GPU_COUNT`.  It does **not** appear in the wrapped script's command
+line.  The WLM plugin uses it to set the cluster resource request for the trial.
 
 ```yaml
 gpu_num:
   type: categorical
   choices: [1, 2, 4]
 ```
 
+Alternatively, set a fixed `gpu-count` in the `wlm:` section of the HPO YAML
+when all trials use the same number of GPUs.
+
 ### Static arguments
 
 Arguments passed unchanged to every trial. Can be supplied inline or via file:
@@ -270,46 +267,59 @@ iterate2 --param-setter set ...
 
 ---
 
-## GPU configuration on LSF
+## WLM plugin system
+
+iteate2 has no built-in knowledge of any workload manager.  Instead it calls a
+user-supplied **plugin script** once per trial.  The plugin can be any
+executable (bash, Python, …).
 
-When `--wlm lsf` is selected, `iterate2` constructs a `bsub` command for each trial.
+### Plugin interface
 
-### Default behaviour
+iterate2 calls the plugin with no positional arguments.  All information is
+delivered through environment variables:
 
-| `--gpu-count` | Generated fragment |
+| Variable | Description |
 |---|---|
-| `> 0` (default `1`) | `-gpu num=<N>` |
-| `0` | *(no `-gpu` flag, CPU-only job)* |
+| `ITERATE_TRIAL_NUMBER` | Integer trial ID |
+| `ITERATE_TRIAL_CMD` | Full shell command (with `cd`, `source venv`) – suited for HPC WLMs |
+| `ITERATE_TRIAL_CONTAINER_CMD` | Bare CLI invocation (no `cd`/`source`) – suited for container-based systems |
+| `ITERATE_OUT_FILE` | File where **stdout** must be written |
+| `ITERATE_ERR_FILE` | File where **stderr** must be written |
+| `ITERATE_WLM_<KEY>` | Every key from the YAML `wlm:` section (uppercased, hyphens → underscores) |
 
-### `--lsf-gpu-config-string`
+The plugin must exit **0** on success; any other exit code marks the trial as
+failed in Optuna.
 
-For advanced LSF GPU scheduling you can supply the full value of the `-gpu` option as a string. When set, it **completely replaces** the auto-generated `-gpu num=<N>` fragment.
+### WLM configuration in the HPO YAML
 
-```sh
-iterate2 \
-  --wlm lsf \
-  --lsf-gpu-config-string "num=1:mode=exclusive_process:mps=yes:gmodel=NVIDIAA100_SXM4_80GB" \
-  --cpu-count 20 \
-  --mem-gb 512 \
-  ...
-```
+All WLM-specific parameters (GPU count, memory, queue, job template path, …)
+live in an optional `wlm:` section of the HPO YAML:
 
-This produces a `bsub` submission resembling:
+```yaml
+hpo:
+  lr: { type: float, low: 1e-5, high: 1e-2, log: true }
 
-```sh
-bsub -n 20 -R "span[hosts=1]" \
-     -gpu "num=1:mode=exclusive_process:mps=yes:gmodel=NVIDIAA100_SXM4_80GB" \
-     -M 512G -J hpo_trial_0 \
-     "cd /my/root && source .venv/bin/activate && python train.py ..."
+static:
+  epochs: 50
+
+# WLM config – forwarded as ITERATE_WLM_* env vars to the plugin
+wlm:
+  gpu-count: 1
+  cpu-count: 8
+  mem-gb: 32
+  lsf-gpu-config: "num=1:mode=exclusive_process:mps=no:gmodel=NVIDIAA100_SXM4_80GB"
 ```
 
-!!! note
-    `--gpu-count` is still used for the `rusage` memory/CPU reservation string even when `--lsf-gpu-config-string` is set. Set it to match the `num=` value in your GPU string.
+### Reference plugins
 
-!!! tip
-    Use exclusive process mode (`mode=exclusive_process`) together with MPS (`mps=yes`) to share a single A100 across multiple MPS clients while still pinning the job to one physical GPU.
+See `examples/wlm_plugins/` for fully documented reference implementations:
 
----
+| Plugin | WLM |
+|---|---|
+| `lsf_plugin.sh` | IBM Spectrum LSF (`bsub -K`) |
+| `vela_plugin.py` | OpenShift / MLBatch PyTorchJob (`helm template \| oc create`) |
+
+Writing a SLURM plugin follows the same pattern as `lsf_plugin.sh`.
 
 ---
 
@@ -320,7 +330,7 @@ By default `iterate2` runs one trial at a time. Pass `--parallelism N` to run up
 ```sh
 iterate2 \
   --parallelism 4 \
-  --wlm lsf \
+  --wlm-plugin examples/wlm_plugins/lsf_plugin.sh \
   ...
 ```
 
@@ -344,12 +354,12 @@ Output from concurrent trials is prefixed so you can follow individual workers:
 
 ### Output files
 
-| WLM | stdout | stderr |
-|---|---|---|
-| `none` | `trial_N.out` (written by iterate2) | `trial_N.err` (written by iterate2) |
-| `lsf` / `slurm` | `trial_N.out` (written by WLM on cluster) | `trial_N.err` (written by WLM on cluster) |
+iteate2 tells the plugin where to write output via `ITERATE_OUT_FILE` /
+`ITERATE_ERR_FILE`.  The plugin is responsible for directing its job's
+stdout/stderr to those files.  iterate2 extracts metrics from them after the
+plugin exits.
 
-For WLM backends the local WLM tool output (bsub/srun status messages) is written to `trial_N_wlm.out` / `trial_N_wlm.err` so the cluster-managed files are never overwritten.
+For local execution (no plugin) iterate2 writes them directly:
 
 ### SQLite and parallelism
 

diff --git a/examples/bumpy_function.py b/examples/bumpy_function.py
@@ -1,6 +1,23 @@
 #!/usr/bin/env python3
-import argparse
+"""
+Bumpy 3-D multimodal function — called by iterate2 as a trial script.
+
+iterate2 sets the following environment variables before calling this script:
+  ITERATE_TRIAL_NUMBER   – integer trial index
+  ITERATE_OUT_FILE       – path where metrics must be written
+  ITERATE_ERR_FILE       – path for error logging
+  ITERATE_PARAM_X        – HPO parameter x
+  ITERATE_PARAM_Y        – HPO parameter y
+  ITERATE_PARAM_Z        – HPO parameter z
+  ITERATE_PARAM_GLOBAL_MU – static parameter (three space-separated floats)
+
+All output that iterate2 uses to extract metrics must be written to
+ITERATE_OUT_FILE (not stdout), one metric per line in "name: value" format.
+"""
+
 import math
+import os
+import sys
 
 
 def bumpy_function_3d(
@@ -9,86 +26,61 @@ def bumpy_function_3d(
     mu_rest, sigma_rest, amps_rest,
 ):
     """
-    3D smooth multimodal function with:
-    - one global optimum = 1 at global_mu = (mx,my,mz)
-    - multiple local optima < 1
+    3D smooth multimodal function.
+      - one global optimum = 1 at global_mu = (mx, my, mz)
+      - multiple local optima < 1
 
-    f(p) = 1 - Π_k (1 - a_k * exp(-||p - mu_k||^2 / (2 sigma_k^2)))
+    f(p) = 1 - prod_k (1 - a_k * exp(-||p - mu_k||^2 / (2 sigma_k^2)))
     """
 
     def sqdist(p, q):
         return (p[0] - q[0])**2 + (p[1] - q[1])**2 + (p[2] - q[2])**2
 
     p = (x, y, z)
 
-    # Global peak (amplitude = 1)
-    val = 1.0 - math.exp(
-        -sqdist(p, global_mu) / (2.0 * global_sigma**2)
-    )
+    val = 1.0 - math.exp(-sqdist(p, global_mu) / (2.0 * global_sigma**2))
 
-    # Local peaks
     for mu_k, sig_k, a_k in zip(mu_rest, sigma_rest, amps_rest):
-        term = 1.0 - a_k * math.exp(
-            -sqdist(p, mu_k) / (2.0 * sig_k**2)
-        )
-        val *= term
+        val *= 1.0 - a_k * math.exp(-sqdist(p, mu_k) / (2.0 * sig_k**2))
 
     return 1.0 - val
 
 
 if __name__ == "__main__":
-    parser = argparse.ArgumentParser("Evaluate the 3D bumpy multimodal function.")
-
-    parser.add_argument("--x", type=float, required=True)
-    parser.add_argument("--y", type=float, required=True)
-    parser.add_argument("--z", type=float, required=True)
-    parser.add_argument("--trial-number", type=int, default=0)
-
-    parser.add_argument(
-        "--global-mu",
-        type=float,
-        nargs=3,
-        default=[0.0, 0.0, 0.0],
-        metavar=("MX", "MY", "MZ"),
-    )
-    parser.add_argument("--global-sigma", type=float, default=0.7)
-
-    parser.add_argument(
-        "--mu-rest",
-        type=float,
-        nargs="*",
-        default=[-2.0, 0.0, 0.0,  2.0, 0.0, 0.0],
-        help="Flat list of (x y z) triplets",
-    )
-    parser.add_argument(
-        "--sigma-rest",
-        type=float,
-        nargs="*",
-        default=[0.6, 0.6],
-    )
-    parser.add_argument(
-        "--amps-rest",
-        type=float,
-        nargs="*",
-        default=[0.5, 0.8],
-    )
-
-    args = parser.parse_args()
-
-    mu_rest = [
-        tuple(args.mu_rest[i:i+3])
-        for i in range(0, len(args.mu_rest), 3)
-    ]
-
+    # --- read parameters from environment ---------------------------------- #
+    try:
+        x = float(os.environ["ITERATE_PARAM_X"])
+        y = float(os.environ["ITERATE_PARAM_Y"])
+        z = float(os.environ["ITERATE_PARAM_Z"])
+        global_mu = tuple(map(float, os.environ["ITERATE_PARAM_GLOBAL_MU"].split()))
+        out_file  = os.environ["ITERATE_OUT_FILE"]
+        trial_num = os.environ.get("ITERATE_TRIAL_NUMBER", "?")
+    except KeyError as exc:
+        print(f"ERROR: missing required environment variable {exc}", file=sys.stderr)
+        sys.exit(1)
+
+    if len(global_mu) != 3:
+        print("ERROR: ITERATE_PARAM_GLOBAL_MU must contain exactly three floats", file=sys.stderr)
+        sys.exit(1)
+
+    # Fixed defaults for the local-optima configuration
+    mu_rest    = [(-2.0, 0.0, 0.0), (2.0, 0.0, 0.0)]
+    sigma_rest = [0.6, 0.6]
+    amps_rest  = [0.5, 0.8]
+    global_sigma = 0.7
+
+    # --- evaluate ---------------------------------------------------------- #
     yval = bumpy_function_3d(
-        x=args.x,
-        y=args.y,
-        z=args.z,
-        global_mu=tuple(args.global_mu),
-        global_sigma=args.global_sigma,
+        x=x, y=y, z=z,
+        global_mu=global_mu,
+        global_sigma=global_sigma,
         mu_rest=mu_rest,
-        sigma_rest=args.sigma_rest,
-        amps_rest=args.amps_rest,
+        sigma_rest=sigma_rest,
+        amps_rest=amps_rest,
     )
 
-    print(f'yval: {yval}, trial_number: {args.trial_number}')
+    # --- write metrics to ITERATE_OUT_FILE --------------------------------- #
+    with open(out_file, "w") as fh:
+        fh.write(f"yval: {yval}\n")
+
+    print(f"[trial-{trial_num}] yval={yval:.6f}")
diff --git a/examples/bumpy_hpo.yaml b/examples/bumpy_hpo.yaml
@@ -1,13 +1,15 @@
-# =======================
-# Static parameters - passed to the underlying training script as is
-# =======================  
-
+# HPO search space for the bumpy 3-D multimodal function.
+#
+# Only three sections are recognised by iterate2:
+#   metrics: – names to extract from the trial script output
+#   static:  – fixed parameters passed to every trial
+#   hpo:     – parameters Optuna will optimise
+
+metrics:
+  - yval
+
 static:
-  global-mu: 23 42 66
-
-# ========================
-# Training hyperparameters - evaluated by optuna and passed to the underlying training script
-# ========================
+  global-mu: "23 42 66"
 
 hpo:
   x: