Skip to content

ML extension: Sweep step not triggered in pipeline (status: not started) and indicates fail after while #25101

@leonieroos

Description

@leonieroos

Dear team,

I have a pipeline with a sweep component that stopped working and gives a overall failure because the sweep step never initiates so it leaves me without error message or logs.

The command with extension:
az ml job create --file ./pipelines/pipeline_demandmodel_hp.yml

az version:
{
"azure-cli": "2.42.0",
"azure-cli-core": "2.42.0",
"azure-cli-telemetry": "1.0.8",
"extensions": {
"ml": "2.12.1"
}
}
within the environment I have azure-ai-ml==1.1.0

I'm expecting the pipeline to produce child runs and trials for the parameters as it did a month ago but instead it gets stuck on never initiating the sweep step at all and after a while will 'fail'. I tried with a registered data set as in put as well as the data passed on from previous step (which will complete with green tick) and both have the same issue.

image

this is the sweep step:

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json

display_name: Demand Hyperparameter tuning Pipeline
description: Pipeline prepares data and finds best set of parameters
experiment_name: demand_model_demo
type: pipeline

settings:
  default_compute: azureml:train
  default_datastore: azureml:spot_train

inputs:
  model_input:
    type: uri_folder
    path: azureml:test_input_hp@latest
    mode: ro_mount

jobs:
  sweep_step:
    type: sweep
    inputs:
      data: ${{parent.inputs.model_input}}
      start_new_run: True
      register_model: False
      gamma: 0
      sample_weights: True
      reg_alpha: 0
      reg_lambda: 1
    outputs:
      data_out:
        mode: rw_mount
    sampling_algorithm: bayesian
    trial: ../components/component_train_extraparam.yaml
    search_space:
      learning_rate:
        type: choice
        values: [0.05, 0.1, 0.15]
      max_depth:
        type: choice
        values: [5, 7, 10, 15, 20]
      n_estimators:
        type: choice
        values: [70, 100, 120, 150]
      max_delta_step:
        type: uniform
        min_value: 0.0
        max_value: 3.0
    objective:
      goal: minimize
      primary_metric: probability_difference
    limits:
      max_total_trials: 50
      max_concurrent_trials: 4
      timeout: 14400

#######

Component:

$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: train_demandmodel
display_name: Training Demand Model
type: command
inputs:
  data:
    type: uri_folder
  start_new_run:
    type: string
    default: True
  register_model:
    type: string
    default: False
  learning_rate:
    type: number
    default: 0.1
  n_estimators:
    type: integer
    default: 130
  max_depth:
    type: integer
    default: 10
  max_delta_step:
    type: number
    default: 0
  sample_weights:
    type: string
    default: True
  gamma:
    type: number
    default: 0
  reg_alpha:
    type: number
    default: 0
  reg_lambda:
    type: number
    default: 1
outputs:
  data_out:
    type: uri_folder
code: ../
environment: azureml:optimiser@latest
is_deterministic: false
command: >-
  python aml_train.py  
    --data ${{inputs.data}}  
    --data_out ${{outputs.data_out}}
    --start_new_run ${{inputs.start_new_run}}
    --register_model ${{inputs.register_model}}
    --learning_rate ${{inputs.learning_rate}}
    --n_estimators ${{inputs.n_estimators}}
    --max_depth ${{inputs.max_depth}}
    --max_delta_step ${{inputs.max_delta_step}}
    --sample_weights ${{inputs.sample_weights}}
    --gamma ${{inputs.gamma}}
    --reg_alpha ${{inputs.reg_alpha}}
    --reg_lambda ${{inputs.reg_lambda}}

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions