fix(e2e): wait on RegionalReplicationStatus before declaring SIG image usable#8374
fix(e2e): wait on RegionalReplicationStatus before declaring SIG image usable#8374ganeshkumarashok wants to merge 1 commit intomainfrom
Conversation
…ions E2E was reporting "Replication took: 12m" via ensureReplication then immediately failing on VMSS create with `404 GalleryImageNotFound` for ~10 more minutes. Root cause: replicatedToCurrentRegion only checks PublishingProfile.TargetRegions (intent), and the parent LRO returns Succeeded before per-region replicas are actually serving traffic. The vmss.go retry loop (10 x 5s) was far too short to absorb the gap and surfaced the infra failure as `failed to create VMSS after 10 retries`, indistinguishable from a real test failure. Fix: - ensureReplication now polls Get with Expand=ReplicationStatus until RegionalReplicationStatus.State == Completed (20 min timeout, 15s interval). Fails fast on terminal Failed. - CreateVMSSWithRetry no longer burns fixed retries on 404 GalleryImageNotFound. It calls the replication poller, which either returns quickly when the region is Completed (fabric-eventual- consistency race -> retry once), blocks until Completed, or fails fast on Failed with the real cause attached. - Add a public WaitForImageVersionReplicatedToRegion entrypoint and unit tests for the region-name normalization in findRegionalReplicationStatus. This should eliminate the GalleryImageNotFound class of false E2E failures across Linux VHD, Windows VHD, GPU, and base AgentBaker E2E pipelines.
There was a problem hiding this comment.
Pull request overview
Improves AgentBaker E2E reliability by ensuring Azure Compute Gallery (SIG) image versions are actually replicated and serving in a target region before tests proceed, reducing noisy GalleryImageNotFound VMSS creation failures caused by eventual consistency.
Changes:
- Update
ensureReplicationto wait on per-regionRegionalReplicationStatus.State == Completedrather than relying on the parent LRO/target region intent. - Enhance VMSS create retry behavior to specially classify
GalleryImageNotFoundand consult the replication-state wait helper before retrying. - Add unit tests for region name normalization and status lookup logic used when interpreting replication status.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| e2e/config/azure.go | Adds replication-status polling helpers and integrates them into replication enforcement + VMSS retry support. |
| e2e/vmss.go | Extends retry classification to treat GalleryImageNotFound specially and consult replication state before retrying. |
| e2e/config/azure_replication_test.go | Adds table-driven tests for findRegionalReplicationStatus normalization and nil tolerance. |
| if err != nil { | ||
| return fmt.Errorf("create image version client: %w", err) | ||
| } | ||
| resp, err := imgVersionClient.Get(ctx, image.Gallery.ResourceGroupName, image.Gallery.Name, image.Name, image.Version, nil) |
There was a problem hiding this comment.
The doc comment for WaitForImageVersionReplicatedToRegion says it "fetches the live image version with ReplicationStatus expanded", but the initial Get call uses nil options (no Expand). Either update the comment to reflect that the expand happens inside waitForRegionalReplicationCompleted, or pass GalleryImageVersionsClientGetOptions{Expand: ReplicationStatusTypesReplicationStatus} here for consistency.
| resp, err := imgVersionClient.Get(ctx, image.Gallery.ResourceGroupName, image.Gallery.Name, image.Name, image.Version, nil) | |
| resp, err := imgVersionClient.Get(ctx, image.Gallery.ResourceGroupName, image.Gallery.Name, image.Name, image.Version, &armcompute.GalleryImageVersionsClientGetOptions{ | |
| Expand: to.Ptr(armcompute.ReplicationStatusTypesReplicationStatus), | |
| }) |
| if isGalleryImageMissing && s.VHD != nil && s.Runtime != nil && s.Runtime.Cluster != nil && s.Runtime.Cluster.Model != nil && s.Runtime.Cluster.Model.Location != nil { | ||
| location := *s.Runtime.Cluster.Model.Location | ||
| toolkit.Logf(ctx, "GalleryImageNotFound on VMSS create in %s; consulting regional replication status before retrying", location) | ||
| if waitErr := config.Azure.WaitForImageVersionReplicatedToRegion(ctx, s.VHD, location); waitErr != nil { | ||
| // Surface the precise infra cause instead of a generic "failed after N retries". | ||
| return vm, fmt.Errorf("VMSS create returned GalleryImageNotFound and regional replication is not usable: %w (original error: %v)", waitErr, err) | ||
| } | ||
| toolkit.Logf(ctx, "Regional replication confirmed Completed; retrying VMSS create immediately") | ||
| continue | ||
| } |
There was a problem hiding this comment.
On GalleryImageNotFound, the code waits for replication to be Completed and then immediately continues, which skips the normal backoff delay. If the compute fabric is still eventually consistent even after replication is Completed, this can spin in a tight loop (up to maxAttempts) and hammer the VMSS create API without sleeping. Consider limiting this path to a single immediate retry (track a boolean), or apply a minimal delay/backoff before continuing.
| got := findRegionalReplicationStatus(tt.status, tt.location) | ||
| if !tt.wantFound { | ||
| if got != nil { | ||
| t.Fatalf("expected nil, got region %q state %v", *got.Region, *got.State) |
There was a problem hiding this comment.
In the !tt.wantFound branch, the failure message dereferences *got.Region and *got.State without checking for nil. If findRegionalReplicationStatus ever returns an entry with a nil Region or State (which this test data already includes elsewhere), the test will panic instead of producing a helpful assertion failure. Consider guarding those pointers (or formatting the struct with %#v) in the error message.
| t.Fatalf("expected nil, got region %q state %v", *got.Region, *got.State) | |
| t.Fatalf("expected nil, got %#v", got) |
| resp, err := imgVersionClient.Get(ctx, image.Gallery.ResourceGroupName, image.Gallery.Name, image.Name, *version.Name, getOpts) | ||
| if err != nil { | ||
| // Transient API errors should not abort the wait; the SDK already retries throttling. | ||
| toolkit.Logf(ctx, "transient error getting image version replication status (will retry): %v", err) | ||
| return false, nil | ||
| } |
There was a problem hiding this comment.
waitForRegionalReplicationCompleted treats any imgVersionClient.Get error as transient and continues polling (return false, nil). This can mask permanent failures (e.g., auth/permission issues, 404 due to wrong RG/name) and will only surface as a timeout, losing the underlying cause. Consider only swallowing clearly transient conditions (e.g., 429/5xx/timeouts) and returning the error for non-retryable 4xx so callers fail fast with the real Azure error.
Summary
GalleryImageNotFoundcascading failures in AgentBaker E2E.ensureReplicationnow waits on per-regionRegionalReplicationStatus.State == Completedinstead of trusting the parent provisioning LRO. Thevmss.goretry loop classifies the 404 against actual replication state — failing fast on terminalFailed, blocking onReplicating, retrying once on the genuinely transient fabric-consistency case.Background
On PR #8228's GPU E2E run (build 161141895), 22 of 368 tests failed with the same shape:
The pipeline log shows
ensureReplicationhad already emittedReplication took: 12m2.7sfor that region — i.e. it returned success — and then VMSS create kept 404'ing for ~10 more minutes.Root cause
replicatedToCurrentRegionchecksPublishingProfile.TargetRegions, which records intent to replicate, not the actual regional state. The Azure Compute Gallery Get API returnsprovisioningState=Succeededonce the parent LRO finishes, butRegionalReplicationStatus.Statefor the new region can remainReplicatingfor several more minutes before it transitions toCompleted. During that window, theimageReferenceis unresolvable fromMicrosoft.Compute/virtualMachineScaleSetscreate — yielding the 404.The existing retry loop in
vmss.go(10 attempts × 5s = ~50 s) was orders of magnitude shorter than that window, and it surfaced the failure asfailed to create VMSS after 10 retries— visually indistinguishable from a real test failure. PR authors had no way to tell that the change-under-test was never actually exercised.Why fix here, not in the pipeline YAML
Considered and rejected:
t.SkiponGalleryImageNotFound— silently masks real bugs. PR authors would think their change was validated when it wasn't.FailedfromReplicating, wastes CRP quota, and still produces a misleading error message.The framework already owns "ensure image is usable in this region" (
ensureReplication). The right fix is to make that primitive honor its contract.What changed
e2e/config/azure.gowaitForRegionalReplicationCompleted: polls the gallery image version withExpand=ReplicationStatusTypesReplicationStatus(15s interval, 20 min cap) untilRegionalReplicationStatus.State == Completed. TreatsReplicating/Unknownas transient. Fails fast with attachedDetailson terminalFailed.ensureReplicationnow calls it on both paths:Replicating(this is exactly what bit PR feat: skip MIG reboot for H100 and newer GPU architectures #8228).Succeededwith the regional state stillReplicating.WaitForImageVersionReplicatedToRegionfor the retry loop invmss.go— fetches the live version and delegates to the same poller.e2e/vmss.goCreateVMSSWithRetry's classifier now distinguishesGalleryImageNotFoundfromAllocationFailed. On the 404 it consults the actual regional replication state viaWaitForImageVersionReplicatedToRegion:e2e/config/azure_replication_test.gofindRegionalReplicationStatuscovering nil status, case/whitespace normalization both ways (ARM returns"West US 2", our location strings are"westus2"), nil-entry tolerance, and missing-region.Behavior changes
failed to create VMSS after 10 retriesfailures every time a fresh VHD ran multi-region E2E. Test logs blamed the test code.ensureReplicationblocks until the regional replica is actually serving — usual case, no retry pressure on CRP at all. If something is genuinely wrong with replication, you get a clearregional replication of <id> to <region> failed: <Azure-provided details>instead offailed to create VMSS after 10 retries. If it's the rare fabric-eventual-consistency case (LRO+ReplicationStatus both Completed but compute fabric not yet propagated), the existing retry catches it without a long fixed sleep.Test plan
go build ./...cleango test ./config/ -run TestFindRegionalReplicationStatus -v -vet=offpasses (vet-off only because of an unrelated pre-existing nit inconfig/config.go:182)