fix(e2e): wait on RegionalReplicationStatus before declaring SIG image usable by ganeshkumarashok · Pull Request #8374 · Azure/AgentBaker

ganeshkumarashok · 2026-04-22T16:57:28Z

Summary

Fix GalleryImageNotFound cascading failures in AgentBaker E2E. ensureReplication now waits on per-region RegionalReplicationStatus.State == Completed instead of trusting the parent provisioning LRO. The vmss.go retry loop classifies the 404 against actual replication state — failing fast on terminal Failed, blocking on Replicating, retrying once on the genuinely transient fabric-consistency case.

Background

On PR #8228's GPU E2E run (build 161141895), 22 of 368 tests failed with the same shape:

failed to create VMSS after 10 retries:
RESPONSE 404: GalleryImageNotFound
"The gallery image .../2404gen2containerd/versions/1.1776726423.26642
 is not available in westus2 region."

The pipeline log shows ensureReplication had already emitted Replication took: 12m2.7s for that region — i.e. it returned success — and then VMSS create kept 404'ing for ~10 more minutes.

Root cause

replicatedToCurrentRegion checks PublishingProfile.TargetRegions, which records intent to replicate, not the actual regional state. The Azure Compute Gallery Get API returns provisioningState=Succeeded once the parent LRO finishes, but RegionalReplicationStatus.State for the new region can remain Replicating for several more minutes before it transitions to Completed. During that window, the imageReference is unresolvable from Microsoft.Compute/virtualMachineScaleSets create — yielding the 404.

The existing retry loop in vmss.go (10 attempts × 5s = ~50 s) was orders of magnitude shorter than that window, and it surfaced the failure as failed to create VMSS after 10 retries — visually indistinguishable from a real test failure. PR authors had no way to tell that the change-under-test was never actually exercised.

Why fix here, not in the pipeline YAML

Considered and rejected:

Per-pipeline YAML pre-flight gate — duplicates the framework's responsibility, and tests in one E2E run pull different images for different regions, so a static gate would either over-replicate or under-gate.
t.Skip on GalleryImageNotFound — silently masks real bugs. PR authors would think their change was validated when it wasn't.
Bumping the retry budget — doesn't distinguish Failed from Replicating, wastes CRP quota, and still produces a misleading error message.

The framework already owns "ensure image is usable in this region" (ensureReplication). The right fix is to make that primitive honor its contract.

What changed

`e2e/config/azure.go`

New waitForRegionalReplicationCompleted: polls the gallery image version with Expand=ReplicationStatusTypesReplicationStatus (15s interval, 20 min cap) until RegionalReplicationStatus.State == Completed. Treats Replicating/Unknown as transient. Fails fast with attached Details on terminal Failed.
ensureReplication now calls it on both paths:
- "Region is already a target": the LRO may have completed but the regional replica may still be Replicating (this is exactly what bit PR feat: skip MIG reboot for H100 and newer GPU architectures #8228).
- "Newly added target": the new replicate LRO can return Succeeded with the regional state still Replicating.
New public WaitForImageVersionReplicatedToRegion for the retry loop in vmss.go — fetches the live version and delegates to the same poller.

`e2e/vmss.go`

CreateVMSSWithRetry's classifier now distinguishes GalleryImageNotFound from AllocationFailed. On the 404 it consults the actual regional replication state via WaitForImageVersionReplicatedToRegion:
- Completed → fall through, retry once (true fabric-consistency race, sub-minute window).
- Replicating → block until Completed under the poller's 20-min cap, then retry.
- Failed → return immediately with the precise infra error wrapped, instead of retrying nine more times.

`e2e/config/azure_replication_test.go`

New table-driven unit tests for findRegionalReplicationStatus covering nil status, case/whitespace normalization both ways (ARM returns "West US 2", our location strings are "westus2"), nil-entry tolerance, and missing-region.

Behavior changes

Before: 22 noisy red failed to create VMSS after 10 retries failures every time a fresh VHD ran multi-region E2E. Test logs blamed the test code.
After: ensureReplication blocks until the regional replica is actually serving — usual case, no retry pressure on CRP at all. If something is genuinely wrong with replication, you get a clear regional replication of <id> to <region> failed: <Azure-provided details> instead of failed to create VMSS after 10 retries. If it's the rare fabric-eventual-consistency case (LRO+ReplicationStatus both Completed but compute fabric not yet propagated), the existing retry catches it without a long fixed sleep.

Test plan

go build ./... clean
go test ./config/ -run TestFindRegionalReplicationStatus -v -vet=off passes (vet-off only because of an unrelated pre-existing nit in config/config.go:182)
CI: full e2e + unit-test gates on this PR
Post-merge: rerun GPU E2E on PR feat: skip MIG reboot for H100 and newer GPU architectures #8228 and confirm the 22 GalleryImageNotFound failures don't recur and the MIG H100 change gets a real signal

…ions E2E was reporting "Replication took: 12m" via ensureReplication then immediately failing on VMSS create with `404 GalleryImageNotFound` for ~10 more minutes. Root cause: replicatedToCurrentRegion only checks PublishingProfile.TargetRegions (intent), and the parent LRO returns Succeeded before per-region replicas are actually serving traffic. The vmss.go retry loop (10 x 5s) was far too short to absorb the gap and surfaced the infra failure as `failed to create VMSS after 10 retries`, indistinguishable from a real test failure. Fix: - ensureReplication now polls Get with Expand=ReplicationStatus until RegionalReplicationStatus.State == Completed (20 min timeout, 15s interval). Fails fast on terminal Failed. - CreateVMSSWithRetry no longer burns fixed retries on 404 GalleryImageNotFound. It calls the replication poller, which either returns quickly when the region is Completed (fabric-eventual- consistency race -> retry once), blocks until Completed, or fails fast on Failed with the real cause attached. - Add a public WaitForImageVersionReplicatedToRegion entrypoint and unit tests for the region-name normalization in findRegionalReplicationStatus. This should eliminate the GalleryImageNotFound class of false E2E failures across Linux VHD, Windows VHD, GPU, and base AgentBaker E2E pipelines.

Copilot

Pull request overview

Improves AgentBaker E2E reliability by ensuring Azure Compute Gallery (SIG) image versions are actually replicated and serving in a target region before tests proceed, reducing noisy GalleryImageNotFound VMSS creation failures caused by eventual consistency.

Changes:

Update ensureReplication to wait on per-region RegionalReplicationStatus.State == Completed rather than relying on the parent LRO/target region intent.
Enhance VMSS create retry behavior to specially classify GalleryImageNotFound and consult the replication-state wait helper before retrying.
Add unit tests for region name normalization and status lookup logic used when interpreting replication status.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
e2e/config/azure.go	Adds replication-status polling helpers and integrates them into replication enforcement + VMSS retry support.
e2e/vmss.go	Extends retry classification to treat `GalleryImageNotFound` specially and consult replication state before retrying.
e2e/config/azure_replication_test.go	Adds table-driven tests for `findRegionalReplicationStatus` normalization and nil tolerance.

Copilot · 2026-04-22T17:00:56Z

+	if err != nil {
+		return fmt.Errorf("create image version client: %w", err)
+	}
+	resp, err := imgVersionClient.Get(ctx, image.Gallery.ResourceGroupName, image.Gallery.Name, image.Name, image.Version, nil)


The doc comment for WaitForImageVersionReplicatedToRegion says it "fetches the live image version with ReplicationStatus expanded", but the initial Get call uses nil options (no Expand). Either update the comment to reflect that the expand happens inside waitForRegionalReplicationCompleted, or pass GalleryImageVersionsClientGetOptions{Expand: ReplicationStatusTypesReplicationStatus} here for consistency.

Suggested change

resp, err := imgVersionClient.Get(ctx, image.Gallery.ResourceGroupName, image.Gallery.Name, image.Name, image.Version, nil)

resp, err := imgVersionClient.Get(ctx, image.Gallery.ResourceGroupName, image.Gallery.Name, image.Name, image.Version, &armcompute.GalleryImageVersionsClientGetOptions{

Expand: to.Ptr(armcompute.ReplicationStatusTypesReplicationStatus),

})

Copilot · 2026-04-22T17:00:56Z

+		if isGalleryImageMissing && s.VHD != nil && s.Runtime != nil && s.Runtime.Cluster != nil && s.Runtime.Cluster.Model != nil && s.Runtime.Cluster.Model.Location != nil {
+			location := *s.Runtime.Cluster.Model.Location
+			toolkit.Logf(ctx, "GalleryImageNotFound on VMSS create in %s; consulting regional replication status before retrying", location)
+			if waitErr := config.Azure.WaitForImageVersionReplicatedToRegion(ctx, s.VHD, location); waitErr != nil {
+				// Surface the precise infra cause instead of a generic "failed after N retries".
+				return vm, fmt.Errorf("VMSS create returned GalleryImageNotFound and regional replication is not usable: %w (original error: %v)", waitErr, err)
+			}
+			toolkit.Logf(ctx, "Regional replication confirmed Completed; retrying VMSS create immediately")
+			continue
+		}


On GalleryImageNotFound, the code waits for replication to be Completed and then immediately continues, which skips the normal backoff delay. If the compute fabric is still eventually consistent even after replication is Completed, this can spin in a tight loop (up to maxAttempts) and hammer the VMSS create API without sleeping. Consider limiting this path to a single immediate retry (track a boolean), or apply a minimal delay/backoff before continuing.

Copilot · 2026-04-22T17:00:57Z

+			got := findRegionalReplicationStatus(tt.status, tt.location)
+			if !tt.wantFound {
+				if got != nil {
+					t.Fatalf("expected nil, got region %q state %v", *got.Region, *got.State)


In the !tt.wantFound branch, the failure message dereferences *got.Region and *got.State without checking for nil. If findRegionalReplicationStatus ever returns an entry with a nil Region or State (which this test data already includes elsewhere), the test will panic instead of producing a helpful assertion failure. Consider guarding those pointers (or formatting the struct with %#v) in the error message.

Suggested change

t.Fatalf("expected nil, got region %q state %v", *got.Region, *got.State)

t.Fatalf("expected nil, got %#v", got)

Copilot · 2026-04-22T17:00:57Z

+		resp, err := imgVersionClient.Get(ctx, image.Gallery.ResourceGroupName, image.Gallery.Name, image.Name, *version.Name, getOpts)
+		if err != nil {
+			// Transient API errors should not abort the wait; the SDK already retries throttling.
+			toolkit.Logf(ctx, "transient error getting image version replication status (will retry): %v", err)
+			return false, nil
+		}


waitForRegionalReplicationCompleted treats any imgVersionClient.Get error as transient and continues polling (return false, nil). This can mask permanent failures (e.g., auth/permission issues, 404 due to wrong RG/name) and will only surface as a timeout, losing the underlying cause. Consider only swallowing clearly transient conditions (e.g., 429/5xx/timeouts) and returning the error for non-retryable 4xx so callers fail fast with the real Azure error.

Copilot AI review requested due to automatic review settings April 22, 2026 16:57

ganeshkumarashok requested review from AbelHu, Devinwong, SriHarsha001, awesomenix, calvin197, cameronmeissner, djsly, junjiezhang1997, lilypan26, mxj220, pdamianov-dev, phealy, r2k1, sulixu, surajssd, timmy-wright and zachary-bailey as code owners April 22, 2026 16:57

ganeshkumarashok temporarily deployed to test April 22, 2026 16:57 — with GitHub Actions Inactive

ganeshkumarashok mentioned this pull request Apr 22, 2026

fix(e2e): wait on RegionalReplicationStatus before declaring SIG image usable #8373

Closed

4 tasks

Copilot started reviewing on behalf of ganeshkumarashok April 22, 2026 16:58 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(e2e): wait on RegionalReplicationStatus before declaring SIG image usable#8374

fix(e2e): wait on RegionalReplicationStatus before declaring SIG image usable#8374
ganeshkumarashok wants to merge 1 commit intomainfrom
e2e-fix-gallery-image-replication-wait

ganeshkumarashok commented Apr 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-	resp, err := imgVersionClient.Get(ctx, image.Gallery.ResourceGroupName, image.Gallery.Name, image.Name, image.Version, nil)
+	resp, err := imgVersionClient.Get(ctx, image.Gallery.ResourceGroupName, image.Gallery.Name, image.Name, image.Version, &armcompute.GalleryImageVersionsClientGetOptions{
+		Expand: to.Ptr(armcompute.ReplicationStatusTypesReplicationStatus),
+	})

	t.Fatalf("expected nil, got region %q state %v", got.Region, got.State)
	t.Fatalf("expected nil, got %#v", got)

Conversation

ganeshkumarashok commented Apr 22, 2026

Summary

Background

Root cause

Why fix here, not in the pipeline YAML

What changed

e2e/config/azure.go

e2e/vmss.go

e2e/config/azure_replication_test.go

Behavior changes

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`e2e/config/azure.go`

`e2e/vmss.go`

`e2e/config/azure_replication_test.go`