Skip to content

fix(e2e): drop hardcoded D2s_v3 system pool from MA35D scenarios#8375

Open
ganeshkumarashok wants to merge 1 commit intomainfrom
e2e-fix-ma35d-system-pool-sku
Open

fix(e2e): drop hardcoded D2s_v3 system pool from MA35D scenarios#8375
ganeshkumarashok wants to merge 1 commit intomainfrom
e2e-fix-ma35d-system-pool-sku

Conversation

@ganeshkumarashok
Copy link
Copy Markdown
Contributor

Summary

  • Remove K8sSystemPoolSKU: \"Standard_D2s_v3\" from Test_AzureLinuxV3_MA35D and Test_AzureLinuxV3_MA35D_Scriptless
  • Falls back to config.DefaultVMSKU (Standard_D2ds_v5) — same SKU every other scenario uses for the AKS system node pool
  • The GPU SKU under test (Standard_NM16ads_MA35D) is unchanged

Background

Both MA35D scenarios pin Location: \"eastus\" (the only region with MA35D capacity) and override the system node pool to the older v3 SKU. The override predates the v5 default and is no longer necessary.

The pinned SKU is currently subscription-restricted across all eastus availability zones for the AB e2e subscription (8ecadfc9-...), so AKS cluster creation 400s before the GPU node under test can even be provisioned. Observed in PR #8228 GPU E2E build 161380177:

RESPONSE 400: 400 Bad Request
ERROR CODE: BadRequest
{
  \"code\": \"BadRequest\",
  \"message\": \"The VM size of 'Standard_D2s_v3' is currently not available in your subscription in location 'eastus'. All availability zones are restricted for this SKU. Please try another VM size or deploy to a different location.\"
}

This caused 5 reported failures in that run (the parent + leaf subtests for both MA35D scenarios), which had nothing to do with the PR being tested.

Why this fix

  • Standard_D2ds_v5 is the established default for all e2e system pools (config/config.go: DefaultVMSKU)
  • If D2ds_v5 were also restricted in eastus we'd see it across many GPU scenarios, not just MA35D — i.e. failure mode would be loud and broad, not silently misattributed to MA35D
  • Smallest possible diff (2 lines per test)

Test plan

  • Agentbaker GPU E2E (the only pipeline that exercises MA35D)
  • Confirm cluster abe2e-kubenet-v4-* in eastus comes up cleanly with v5 system pool
  • Confirm Test_AzureLinuxV3_MA35D and _Scriptless reach the GPU validators

Both Test_AzureLinuxV3_MA35D and Test_AzureLinuxV3_MA35D_Scriptless
pinned the AKS system node pool to Standard_D2s_v3 in eastus. That SKU
is currently subscription-restricted across all eastus availability zones
("All availability zones are restricted for this SKU"), so cluster
creation fails with 400 BadRequest before the scenario can exercise
the MA35D GPU node under test.

Removing the override falls back to config.DefaultVMSKU
(Standard_D2ds_v5), which every other GPU/non-GPU scenario already uses
successfully for the system pool. The MA35D GPU SKU itself
(Standard_NM16ads_MA35D) is unchanged.

Observed in build 161380177 on PR #8228:
  RESPONSE 400: 400 Bad Request
  The VM size of 'Standard_D2s_v3' is currently not available in your
  subscription in location 'eastus'.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Removes a hardcoded Kubernetes system node pool VM SKU from the two AzureLinuxV3 MA35D GPU e2e scenarios so they rely on the standard e2e default (config.Config.DefaultVMSKU), avoiding Standard_D2s_v3 availability/subscription restrictions in eastus while keeping the MA35D GPU SKU under test unchanged.

Changes:

  • Dropped K8sSystemPoolSKU: "Standard_D2s_v3" from Test_AzureLinuxV3_MA35D.
  • Dropped K8sSystemPoolSKU: "Standard_D2s_v3" from Test_AzureLinuxV3_MA35D_Scriptless.
  • Left Location: "eastus" in place for MA35D capacity, allowing system pool SKU to fall back to config.Config.DefaultVMSKU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants