diff --git a/.github/workflows/manual-trigger-job.yml b/.github/workflows/manual-trigger-job.yml index 32dc8fe..6f279b7 100644 --- a/.github/workflows/manual-trigger-job.yml +++ b/.github/workflows/manual-trigger-job.yml @@ -15,7 +15,4 @@ jobs: uses: azure/login@v2 with: creds: ${{secrets.AZURE_CREDENTIALS}} - - name: Run Azure Machine Learning training job - run: az ml job create -f src/job.yml --stream --resource-group ${{vars.AZURE_RESOURCE_GROUP}} --workspace-name ${{vars.AZURE_WORKSPACE_NAME}} - diff --git a/.github/workflows/train-dev.yml b/.github/workflows/train-dev.yml index a2d02e4..2cc5a3f 100644 --- a/.github/workflows/train-dev.yml +++ b/.github/workflows/train-dev.yml @@ -1,13 +1,7 @@ -name: Train model in dev (PR) +name: Train model in dev on: workflow_dispatch: - pull_request: - branches: - - main - paths: - - 'src/train-model-parameters.py' - - 'src/job.yml' permissions: contents: read diff --git a/docs/05-plan-and-prepare.md b/docs/05-plan-and-prepare.md index 8ac1d9a..1ec2eff 100644 --- a/docs/05-plan-and-prepare.md +++ b/docs/05-plan-and-prepare.md @@ -32,34 +32,34 @@ You can manually create necessary resources and assets to work with Azure Machin 1. Check that the correct subscription is specified and that **No storage account required** is selected. Select **Apply**. 1. In the terminal, enter the following commands to clone this repo: - ```azurecli - rm -r mslearn-mlops -f - git clone https://github.com/MicrosoftLearning/mslearn-mlops.git mslearn-mlops - ``` + ```azurecli + rm -r mslearn-mlops -f + git clone https://github.com/MicrosoftLearning/mslearn-mlops.git mslearn-mlops + ``` - > Use `SHIFT + INSERT` to paste your copied code into the Cloud Shell. + > Use `SHIFT + INSERT` to paste your copied code into the Cloud Shell. 1. After the repo has been cloned, enter the following commands to change to the `infra` folder and open the setup script: - ```azurecli - cd mslearn-mlops/infra - code setup.sh - ``` + ```azurecli + cd mslearn-mlops/infra + code setup.sh + ``` - > [!NOTE] - > If the `code` command is not available, you are in the new Cloud Shell experience. Switch to Classic Cloud Shell by selecting **Switch to Classic Cloud Shell** in the toolbar and selecting **Confirm**. Then run the commands again. + > [!NOTE] + > If the `code` command is not available, you are in the new Cloud Shell experience. Switch to Classic Cloud Shell by selecting **Switch to Classic Cloud Shell** in the toolbar and selecting **Confirm**. Then run the commands again. 1. Review the script and identify the resources that are created for your current **development** environment: - - A resource group with a randomized suffix, for example `rg-ai300-l...`. - - An Azure Machine Learning workspace, for example `mlw-ai300-l...`. - - A compute instance for interactive work. - - A compute cluster for training jobs. - - Data assets for the diabetes training data in the `data/diabetes-data` folder. + - A resource group with a randomized suffix, for example `rg-ai300-l...`. + - An Azure Machine Learning workspace, for example `mlw-ai300-l...`. + - A compute instance for interactive work. + - A compute cluster for training jobs. + - Data assets for the diabetes training data in the `data/diabetes-data` folder. 1. Note how the script: - - Generates a random suffix to avoid name collisions. - - Registers the **Microsoft.MachineLearningServices** resource provider. - - Sets default values for the resource group and workspace so subsequent `az ml` commands use them automatically. + - Generates a random suffix to avoid name collisions. + - Registers the **Microsoft.MachineLearningServices** resource provider. + - Sets default values for the resource group and workspace so subsequent `az ml` commands use them automatically. By understanding what this script does for development, you're ready to think about what you would change or add for production. @@ -96,75 +96,93 @@ Next, you map your target architecture to Azure CLI commands. Instead of running 1. In the Cloud Shell editor, create a new file based on the existing script so you can experiment safely: - ```bash - cp setup.sh setup-prod-design.sh - code setup-prod-design.sh - ``` + ```bash + cp setup.sh setup-prod-design.sh + code setup-prod-design.sh + ``` 1. At the top of the new file, add variables for both environments and the shared registry. For example: - ```bash - # Existing random suffix - guid=$(cat /proc/sys/kernel/random/uuid) - suffix=${guid//[-]/} - suffix=${suffix:0:18} + ```bash + # Existing random suffix + guid=$(cat /proc/sys/kernel/random/uuid) + suffix=${guid//[-]/} + suffix=${suffix:0:18} - # Dev environment - DEV_RESOURCE_GROUP="rg-ai300-dev-${suffix}" - DEV_WORKSPACE_NAME="mlw-ai300-dev-${suffix}" + # Dev environment + DEV_RESOURCE_GROUP="rg-ai300-dev-${suffix}" + DEV_WORKSPACE_NAME="mlw-ai300-dev-${suffix}" - # Prod environment - PROD_RESOURCE_GROUP="rg-ai300-prod-${suffix}" - PROD_WORKSPACE_NAME="mlw-ai300-prod-${suffix}" + # Prod environment + PROD_RESOURCE_GROUP="rg-ai300-prod-${suffix}" + PROD_WORKSPACE_NAME="mlw-ai300-prod-${suffix}" - # Shared registry (one per subscription/region) - REGISTRY_RESOURCE_GROUP="rg-ai300-reg-${suffix}" - REGISTRY_NAME="mlr-ai300-shared-${suffix}" - ``` + # Shared registry (one per subscription/region) + REGISTRY_RESOURCE_GROUP="rg-ai300-reg-${suffix}" + REGISTRY_NAME="mlr-ai300-shared-${suffix}" + ``` + +1. In the `infra` folder, open `registry.yml` and review the values that define the shared registry. The Azure CLI reads this YAML file literally, so the Bash script needs to inject the dynamic registry name and primary region into the file before running the create command. In this lab, use placeholders in `registry.yml` like this: + + ```yml + name: REGISTRY_NAME_PLACEHOLDER + tags: + description: Shared registry for approved machine learning assets across workspaces + location: PRIMARY_REGION_PLACEHOLDER + replication_locations: + - location: PRIMARY_REGION_PLACEHOLDER + ``` 1. Plan the commands that would create the **shared registry** in its own resource group. For example: - ```azurecli - # Create a resource group for the shared registry - az group create --name $REGISTRY_RESOURCE_GROUP --location $RANDOM_REGION + ```azurecli + # Create a resource group for the shared registry + az group create --name $REGISTRY_RESOURCE_GROUP --location $RANDOM_REGION + + # Render registry.yml with the dynamic values from the script + sed \ + -e "s|REGISTRY_NAME_PLACEHOLDER|$REGISTRY_NAME|g" \ + -e "s|PRIMARY_REGION_PLACEHOLDER|$RANDOM_REGION|g" \ + registry.yml > registry.generated.yml + + # Create an Azure Machine Learning registry from the rendered YAML file + az ml registry create \ + --file registry.generated.yml \ + --resource-group $REGISTRY_RESOURCE_GROUP + ``` - # Create an Azure Machine Learning registry - az ml registry create \ - --name $REGISTRY_NAME \ - --resource-group $REGISTRY_RESOURCE_GROUP \ - --location $RANDOM_REGION - ``` + The primary registry region appears twice in the YAML definition: once in `location` and again in `replication_locations`. Rendering the YAML from the Bash variables keeps those values consistent. 1. Plan the commands that would create the **production** resource group and workspace. They follow the same pattern as the existing dev workspace, but use the prod names: - ```azurecli - # Create the production resource group - az group create --name $PROD_RESOURCE_GROUP --location $RANDOM_REGION + ```azurecli + # Create the production resource group + az group create --name $PROD_RESOURCE_GROUP --location $RANDOM_REGION - # Create the production Azure Machine Learning workspace - az ml workspace create \ - --name $PROD_WORKSPACE_NAME \ - --resource-group $PROD_RESOURCE_GROUP \ - --location $RANDOM_REGION - ``` + # Create the production Azure Machine Learning workspace + az ml workspace create \ + --name $PROD_WORKSPACE_NAME \ + --resource-group $PROD_RESOURCE_GROUP \ + --location $RANDOM_REGION + ``` 1. Finally, plan the data assets that keep dev and prod data separated. Use the **dev** folder for experimentation and the **production** folder for production training: - ```azurecli - # In the dev workspace: data asset that points to experimentation data - az configure --defaults group=$DEV_RESOURCE_GROUP workspace=$DEV_WORKSPACE_NAME - az ml data create \ - --type uri_folder \ - --name diabetes-dev-folder \ - --path ../data/diabetes-data - - # In the prod workspace: data asset that points to production data - az configure --defaults group=$PROD_RESOURCE_GROUP workspace=$PROD_WORKSPACE_NAME - az ml data create \ - --type uri_folder \ - --name diabetes-prod-folder \ - --path ../production/data - ``` + ```azurecli + # In the dev workspace: data asset that points to experimentation data + az configure --defaults group=$DEV_RESOURCE_GROUP workspace=$DEV_WORKSPACE_NAME + az ml data create \ + --type uri_folder \ + --name diabetes-dev-folder \ + --path ../data/diabetes-data + + # In the prod workspace: data asset that points to production data + az configure --defaults group=$PROD_RESOURCE_GROUP workspace=$PROD_WORKSPACE_NAME + az ml data create \ + --type uri_folder \ + --name diabetes-prod-folder \ + --path ../production/data + ``` > [!IMPORTANT] > For this lab, you **don't need** to run the new commands that create extra resource groups and workspaces. Focus on understanding how you would structure the script so that dev and prod resources are clearly separated and production data stays out of the development environment. If you do want to run the script, follow the optional steps in the next section. @@ -177,29 +195,29 @@ If you want to see your design in action, you can validate your script against a 1. In the Cloud Shell terminal, make sure you're in the `infra` folder: - ```bash - cd mslearn-mlops/infra - ``` + ```bash + cd mslearn-mlops/infra + ``` 1. The repo includes a reference script `infra/setup-mlops-envs.sh` that shows what the complete script should look like. Compare it with your own `setup-prod-design.sh` to check your work: - ```bash - diff setup-prod-design.sh setup-mlops-envs.sh - ``` + ```bash + diff setup-prod-design.sh setup-mlops-envs.sh + ``` - Review any differences and update your script if needed. + Review any differences and update your script if needed. 1. Once you're satisfied with your script, make it executable and run it: - ```bash - chmod +x setup-prod-design.sh - ./setup-prod-design.sh - ``` + ```bash + chmod +x setup-prod-design.sh + ./setup-prod-design.sh + ``` 1. When the script completes, verify the resources in the Azure portal: - - New resource groups for dev, prod, and the shared registry. - - Separate workspaces for dev and prod. - - Data assets `diabetes-dev-folder` and `diabetes-prod-folder` in the respective workspaces. + - New resource groups for dev, prod, and the shared registry. + - Separate workspaces for dev and prod. + - Data assets `diabetes-dev-folder` and `diabetes-prod-folder` in the respective workspaces. 1. When you're done exploring, be sure to delete any extra resource groups you created so you don't continue to incur charges. @@ -209,29 +227,29 @@ In a real MLOps project, you want a single automation entry point that can provi 1. Decide how you would pass the **target environment** into the script. For example, you could accept a parameter such as `dev` or `prod`: - ```bash - ENVIRONMENT=${1:-dev} - ``` + ```bash + ENVIRONMENT=${1:-dev} + ``` 1. Based on the environment, plan how you would set the resource group and workspace variables. For example: - ```bash - if [ "$ENVIRONMENT" = "prod" ]; then - RESOURCE_GROUP=$PROD_RESOURCE_GROUP - WORKSPACE_NAME=$PROD_WORKSPACE_NAME - else - RESOURCE_GROUP=$DEV_RESOURCE_GROUP - WORKSPACE_NAME=$DEV_WORKSPACE_NAME - fi - ``` + ```bash + if [ "$ENVIRONMENT" = "prod" ]; then + RESOURCE_GROUP=$PROD_RESOURCE_GROUP + WORKSPACE_NAME=$PROD_WORKSPACE_NAME + else + RESOURCE_GROUP=$DEV_RESOURCE_GROUP + WORKSPACE_NAME=$DEV_WORKSPACE_NAME + fi + ``` 1. Think through which resources should be **shared** and which should be **isolated**: - - The registry is shared between dev and prod, so you'd create it once and reuse it. - - Workspaces, compute, and data assets are environment-specific so that you can apply different security and access controls. + - The registry is shared between dev and prod, so you'd create it once and reuse it. + - Workspaces, compute, and data assets are environment-specific so that you can apply different security and access controls. 1. Consider how this script would fit into your broader MLOps workflows: - - In **GitHub Actions**, you could call the script with `dev` when validating pull requests and `prod` when deploying approved models. - - In **local development**, data scientists could call the script with `dev` to recreate the experimentation environment from scratch. + - In **GitHub Actions**, you could call the script with `dev` when validating pull requests and `prod` when deploying approved models. + - In **local development**, data scientists could call the script with `dev` to recreate the experimentation environment from scratch. You now have a clear plan for how to evolve the existing script into a more flexible provisioning tool without changing how earlier labs work. diff --git a/docs/06-automate-model-training.md b/docs/06-automate-model-training.md index fc331e8..cb9b7e9 100644 --- a/docs/06-automate-model-training.md +++ b/docs/06-automate-model-training.md @@ -34,14 +34,14 @@ First, you create the Azure Machine Learning workspace and compute resources you 1. Make sure the correct subscription is selected and that **No storage account required** is selected. Then select **Apply**. 1. In the Cloud Shell terminal, clone the original lab repo and run the setup script: - ```azurecli - rm -r mslearn-mlops -f - git clone https://github.com/MicrosoftLearning/mslearn-mlops.git mslearn-mlops - cd mslearn-mlops/infra - ./setup.sh - ``` + ```azurecli + rm -r mslearn-mlops -f + git clone https://github.com/MicrosoftLearning/mslearn-mlops.git mslearn-mlops + cd mslearn-mlops/infra + ./setup.sh + ``` - > Ignore any messages that say that extensions couldn't be installed. + > Ignore any messages that say that extensions couldn't be installed. 1. Wait for the script to finish. It creates a resource group, an Azure Machine Learning workspace, and compute resources. 1. In the Azure portal, go to **Resource groups** and open the `rg-ai300-...` resource group that was created. @@ -71,11 +71,11 @@ To let GitHub Actions authenticate to Azure Machine Learning, you use a service 1. Make sure the correct subscription is selected for your Azure Machine Learning workspace. 1. In Cloud Shell, create a service principal that has **Contributor** access to the resource group that contains your Azure Machine Learning workspace. Replace ``, ``, and `` with your own values before you run the command. Use a descriptive name such as `sp-mslearn-mlops-github`: - ```azurecli - az ad sp create-for-rbac --name "" --role contributor \ - --scopes /subscriptions//resourceGroups/ \ - --sdk-auth - ``` + ```azurecli + az ad sp create-for-rbac --name "" --role contributor \ + --scopes /subscriptions//resourceGroups/ \ + --sdk-auth + ``` 1. Copy the full JSON output of the command to a safe location. You use the values in the next steps and in later challenges. 1. In the GitHub repository you created from the template, navigate to **Settings** > **Secrets and variables** > **Actions**. @@ -112,6 +112,14 @@ Now that you understand the network options, you are ready to automate a trainin In this section, you connect your GitHub workflow to Azure Machine Learning and run a command job to train a model. The workflow uses the `AZURE_CREDENTIALS` secret you created earlier. 1. Clone your `mslearn-mlops` repository that you created from the template to a development environment where you can edit files and push changes back to GitHub. +1. In the cloned repository, open `src/job.yml` and replace the placeholder values for the `training_data` input so the command job uses the single file data asset created by the setup script: + + ```yml + inputs: + training_data: + type: uri_file + path: azureml:diabetes-data@latest + ``` 1. In the cloned repository, locate the `.github/workflows/manual-trigger-job.yml` workflow file. 1. Open `manual-trigger-job.yml` and review the existing steps. The workflow should: - Check out the repository code. @@ -119,10 +127,10 @@ In this section, you connect your GitHub workflow to Azure Machine Learning and - Use the `AZURE_CREDENTIALS` secret to sign in to Azure via `azure/login@v2`. 1. At the end of the workflow, add a new step that submits the Azure Machine Learning job defined in `src/job.yml`. The command requires explicit `--resource-group` and `--workspace-name` flags, supplied from the GitHub Actions variables you created: - ```yml - - name: Run Azure Machine Learning training job - run: az ml job create -f src/job.yml --stream --resource-group ${{vars.AZURE_RESOURCE_GROUP}} --workspace-name ${{vars.AZURE_WORKSPACE_NAME}} - ``` + ```yml + - name: Run Azure Machine Learning training job + run: az ml job create -f src/job.yml --stream --resource-group ${{vars.AZURE_RESOURCE_GROUP}} --workspace-name ${{vars.AZURE_WORKSPACE_NAME}} + ``` 1. Save your changes, commit them to your local repository, and push the changes to the **main** branch of your fork. 1. In GitHub, go to the **Actions** tab for your repository. @@ -139,13 +147,13 @@ Running workflows manually is useful for initial testing, but in a team environm 1. In your GitHub repository, open the `.github/workflows/manual-trigger-job.yml` workflow file. 1. Update the `on` section so that the workflow can run both manually and when a pull request targets the **main** branch. For example: - ```yml - on: - workflow_dispatch: - pull_request: - branches: - - main - ``` + ```yml + on: + workflow_dispatch: + pull_request: + branches: + - main + ``` 1. Commit the updated workflow file and push it to the **main** branch of your repository. 1. In GitHub, go to **Settings** > **Branches** and select **Add branch protection rule**. @@ -155,18 +163,18 @@ Running workflows manually is useful for initial testing, but in a team environm 1. Save the branch protection rule. 1. In your local clone of the repository, create a new branch for a feature change. For example: - ```bash - git checkout -b feature/update-parameters - ``` + ```bash + git checkout -b feature/update-parameters + ``` 1. Make a small, safe change to the training configuration. For example, adjust a hyperparameter value in `src/train-model-parameters.py` or in `src/job.yml`. 1. Commit your change to the feature branch and push the branch to GitHub: - ```bash - git add . - git commit -m "Adjust training parameters" - git push --set-upstream origin feature/update-parameters - ``` + ```bash + git add . + git commit -m "Adjust training parameters" + git push --set-upstream origin feature/update-parameters + ``` 1. In GitHub, create a pull request from your feature branch into **main**. 1. On the pull request page, observe that the workflow defined in `manual-trigger-job.yml` runs automatically because of the `pull_request` trigger you added. diff --git a/docs/07-deploy-monitor.md b/docs/07-deploy-monitor.md index d914650..c3cddfc 100644 --- a/docs/07-deploy-monitor.md +++ b/docs/07-deploy-monitor.md @@ -127,6 +127,14 @@ Now you use a GitHub Actions workflow in your template-based repository that tra 1. In your local clone, open `src/train-model-parameters.py` and review how it: - Reads training data from a file or folder path. - Trains a logistic regression model and logs metrics such as **Accuracy** and **AUC**. +1. Open `src/job.yml` and replace the placeholder values for the `training_data` input so the command job uses the dev folder data asset by default: + + ```yml + inputs: + training_data: + type: uri_folder + path: azureml:diabetes-dev-folder@latest + ``` 1. Open `src/job.yml` and review how the Azure Machine Learning command job: - Runs `train-model-parameters.py` on the `aml-cluster` compute. - Uses a `training_data` input that points to the `diabetes-dev-folder` data asset by default. @@ -136,13 +144,25 @@ Now you use a GitHub Actions workflow in your template-based repository that tra - Detects the resource group and workspace that the `infra/setup.sh` script created. - Submits the Azure Machine Learning job defined in `src/job.yml`, overriding the `training_data` input to use the `diabetes-dev-folder` data asset. - Streams the job logs, parses the **Accuracy** and **AUC** values from the output, and posts them as a comment on the pull request. +1. In your local clone, open `.github/workflows/train-dev.yml` and add a `pull_request` trigger so the workflow only runs automatically when a pull request changes the training code: + + ```yml + on: + workflow_dispatch: + pull_request: + branches: + - main + paths: + - 'src/train-model-parameters.py' + - 'src/job.yml' + ``` 1. In your local clone, create a new feature branch and make a small, safe hyperparameter change. For example, adjust the default value of `--reg_rate` in `src/train-model-parameters.py`. 1. Commit your change and push the new branch to GitHub. 1. In GitHub, create a pull request from your feature branch into `main`. -1. On the pull request page, observe that the **Train model in dev (PR)** workflow runs automatically because of the `pull_request` trigger, and wait for it to complete. +1. On the pull request page, observe that the **Train model in dev** workflow runs automatically because of the `pull_request` trigger you added, and wait for it to complete. 1. When the workflow run has finished, review the comments on the pull request. You should see a comment from the workflow that includes the dev **Accuracy** and **AUC** values from the training job. -The dev workflow now validates training changes against the dev data asset and surfaces key evaluation metrics directly in the pull request so that reviewers can make an informed decision. +The dev workflow stays manual by default and only starts running automatically for pull requests after you add the trigger. That keeps unnecessary runs out of the live repo while still letting you enable PR validation when you're ready. ## Retrain the model on prod data from a pull request comment @@ -219,7 +239,7 @@ In a real system, drift or performance degradation would trigger retraining. In ``` 1. In GitHub, create a new pull request from your `feature/drift-retrain` branch into `main`. -1. Observe that the **Train model in dev (PR)** workflow runs automatically for the new pull request. When it completes, review the comment that shows the updated **dev** Accuracy and AUC. +1. Observe that the **Train model in dev** workflow runs automatically for the new pull request because you added the `pull_request` trigger earlier. When it completes, review the comment that shows the updated **dev** Accuracy and AUC. 1. If the dev metrics look acceptable, add a comment `/train-prod` on the pull request to trigger the **Train model in prod (PR comment)** workflow. When it completes, review the comment that shows the updated **prod** Accuracy and AUC. 1. If the prod metrics also meet your expectations, add a comment `/deploy-prod` on the pull request to trigger the **Deploy model to online endpoint (PR comment)** workflow. Wait for it to complete. 1. Finally, in Azure Machine Learning studio, go to **Endpoints** > **Real-time endpoints**, select the `diabetes-endpoint`, and use the **Test** tab to confirm that the endpoint still returns predictions after your retraining and deployment. diff --git a/infra/registry.yml b/infra/registry.yml new file mode 100644 index 0000000..89d63f5 --- /dev/null +++ b/infra/registry.yml @@ -0,0 +1,6 @@ +name: REGISTRY_NAME_PLACEHOLDER +tags: + description: Shared registry for approved machine learning assets across workspaces +location: PRIMARY_REGION_PLACEHOLDER +replication_locations: + - location: PRIMARY_REGION_PLACEHOLDER \ No newline at end of file diff --git a/infra/setup-mlops-envs.sh b/infra/setup-mlops-envs.sh index a30ec81..c588ea2 100644 --- a/infra/setup-mlops-envs.sh +++ b/infra/setup-mlops-envs.sh @@ -82,11 +82,17 @@ az ml data create \ echo "Creating registry resource group: $REGISTRY_RESOURCE_GROUP" az group create --name $REGISTRY_RESOURCE_GROUP --location $RANDOM_REGION +echo "Rendering registry.yml with dynamic values..." +sed \ + -e "s|REGISTRY_NAME_PLACEHOLDER|$REGISTRY_NAME|g" \ + -e "s|PRIMARY_REGION_PLACEHOLDER|$RANDOM_REGION|g" \ + registry.yml > registry.generated.yml + echo "Creating shared Azure Machine Learning registry: $REGISTRY_NAME" az ml registry create \ - --name $REGISTRY_NAME \ - --resource-group $REGISTRY_RESOURCE_GROUP \ - --location $RANDOM_REGION + --file registry.generated.yml \ + --resource-group $REGISTRY_RESOURCE_GROUP + # --------------------------------------------------------------------------- # Summary diff --git a/src/job.yml b/src/job.yml index 3310aec..a5cf0b7 100644 --- a/src/job.yml +++ b/src/job.yml @@ -6,8 +6,8 @@ command: >- --reg_rate ${{inputs.reg_rate}} inputs: training_data: - type: uri_folder - path: azureml:diabetes-dev-folder@latest + type: + path: reg_rate: 0.01 environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest compute: azureml:aml-cluster