ML extension: Sweep step not triggered in pipeline (status: not started) and indicates fail after while

Dear team,

I have a pipeline with a sweep component that stopped working and gives a overall failure because the sweep step never initiates so it leaves me without error message or logs.

The command with extension:
az ml job create --file ./pipelines/pipeline_demandmodel_hp.yml

az version:
{
  "azure-cli": "2.42.0",
  "azure-cli-core": "2.42.0",
  "azure-cli-telemetry": "1.0.8",
  "extensions": {
    "ml": "2.12.1"
  }
}
within the environment I have azure-ai-ml==1.1.0

I'm expecting the pipeline to produce child runs and trials for the parameters as it did a month ago but instead it gets stuck on never initiating the sweep step at all and after a while will 'fail'. I tried with a registered data set as in put as well as the data passed on from previous step (which will complete with green tick) and both have the same issue. 

<img width="881" alt="image" src="https://user-images.githubusercontent.com/22007913/211777032-12d47d8a-94cb-429f-b05a-1c5fb5ff29e4.png">

this is the sweep step:

#####
```
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json

display_name: Demand Hyperparameter tuning Pipeline
description: Pipeline prepares data and finds best set of parameters
experiment_name: demand_model_demo
type: pipeline

settings:
  default_compute: azureml:train
  default_datastore: azureml:spot_train

inputs:
  model_input:
    type: uri_folder
    path: azureml:test_input_hp@latest
    mode: ro_mount

jobs:
  sweep_step:
    type: sweep
    inputs:
      data: ${{parent.inputs.model_input}}
      start_new_run: True
      register_model: False
      gamma: 0
      sample_weights: True
      reg_alpha: 0
      reg_lambda: 1
    outputs:
      data_out:
        mode: rw_mount
    sampling_algorithm: bayesian
    trial: ../components/component_train_extraparam.yaml
    search_space:
      learning_rate:
        type: choice
        values: [0.05, 0.1, 0.15]
      max_depth:
        type: choice
        values: [5, 7, 10, 15, 20]
      n_estimators:
        type: choice
        values: [70, 100, 120, 150]
      max_delta_step:
        type: uniform
        min_value: 0.0
        max_value: 3.0
    objective:
      goal: minimize
      primary_metric: probability_difference
    limits:
      max_total_trials: 50
      max_concurrent_trials: 4
      timeout: 14400
```
#######

Component:
```
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: train_demandmodel
display_name: Training Demand Model
type: command
inputs:
  data:
    type: uri_folder
  start_new_run:
    type: string
    default: True
  register_model:
    type: string
    default: False
  learning_rate:
    type: number
    default: 0.1
  n_estimators:
    type: integer
    default: 130
  max_depth:
    type: integer
    default: 10
  max_delta_step:
    type: number
    default: 0
  sample_weights:
    type: string
    default: True
  gamma:
    type: number
    default: 0
  reg_alpha:
    type: number
    default: 0
  reg_lambda:
    type: number
    default: 1
outputs:
  data_out:
    type: uri_folder
code: ../
environment: azureml:optimiser@latest
is_deterministic: false
command: >-
  python aml_train.py  
    --data ${{inputs.data}}  
    --data_out ${{outputs.data_out}}
    --start_new_run ${{inputs.start_new_run}}
    --register_model ${{inputs.register_model}}
    --learning_rate ${{inputs.learning_rate}}
    --n_estimators ${{inputs.n_estimators}}
    --max_depth ${{inputs.max_depth}}
    --max_delta_step ${{inputs.max_delta_step}}
    --sample_weights ${{inputs.sample_weights}}
    --gamma ${{inputs.gamma}}
    --reg_alpha ${{inputs.reg_alpha}}
    --reg_lambda ${{inputs.reg_lambda}}
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML extension: Sweep step not triggered in pipeline (status: not started) and indicates fail after while #25101

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ML extension: Sweep step not triggered in pipeline (status: not started) and indicates fail after while #25101

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions