Skip to content

fix(e2e): wait on RegionalReplicationStatus before declaring SIG image usable#8374

Open
ganeshkumarashok wants to merge 1 commit intomainfrom
e2e-fix-gallery-image-replication-wait
Open

fix(e2e): wait on RegionalReplicationStatus before declaring SIG image usable#8374
ganeshkumarashok wants to merge 1 commit intomainfrom
e2e-fix-gallery-image-replication-wait

Conversation

@ganeshkumarashok
Copy link
Copy Markdown
Contributor

Summary

  • Fix GalleryImageNotFound cascading failures in AgentBaker E2E. ensureReplication now waits on per-region RegionalReplicationStatus.State == Completed instead of trusting the parent provisioning LRO. The vmss.go retry loop classifies the 404 against actual replication state — failing fast on terminal Failed, blocking on Replicating, retrying once on the genuinely transient fabric-consistency case.

Background

On PR #8228's GPU E2E run (build 161141895), 22 of 368 tests failed with the same shape:

failed to create VMSS after 10 retries:
RESPONSE 404: GalleryImageNotFound
"The gallery image .../2404gen2containerd/versions/1.1776726423.26642
 is not available in westus2 region."

The pipeline log shows ensureReplication had already emitted Replication took: 12m2.7s for that region — i.e. it returned success — and then VMSS create kept 404'ing for ~10 more minutes.

Root cause

replicatedToCurrentRegion checks PublishingProfile.TargetRegions, which records intent to replicate, not the actual regional state. The Azure Compute Gallery Get API returns provisioningState=Succeeded once the parent LRO finishes, but RegionalReplicationStatus.State for the new region can remain Replicating for several more minutes before it transitions to Completed. During that window, the imageReference is unresolvable from Microsoft.Compute/virtualMachineScaleSets create — yielding the 404.

The existing retry loop in vmss.go (10 attempts × 5s = ~50 s) was orders of magnitude shorter than that window, and it surfaced the failure as failed to create VMSS after 10 retries — visually indistinguishable from a real test failure. PR authors had no way to tell that the change-under-test was never actually exercised.

Why fix here, not in the pipeline YAML

Considered and rejected:

  • Per-pipeline YAML pre-flight gate — duplicates the framework's responsibility, and tests in one E2E run pull different images for different regions, so a static gate would either over-replicate or under-gate.
  • t.Skip on GalleryImageNotFound — silently masks real bugs. PR authors would think their change was validated when it wasn't.
  • Bumping the retry budget — doesn't distinguish Failed from Replicating, wastes CRP quota, and still produces a misleading error message.

The framework already owns "ensure image is usable in this region" (ensureReplication). The right fix is to make that primitive honor its contract.

What changed

e2e/config/azure.go

  • New waitForRegionalReplicationCompleted: polls the gallery image version with Expand=ReplicationStatusTypesReplicationStatus (15s interval, 20 min cap) until RegionalReplicationStatus.State == Completed. Treats Replicating/Unknown as transient. Fails fast with attached Details on terminal Failed.
  • ensureReplication now calls it on both paths:
  • New public WaitForImageVersionReplicatedToRegion for the retry loop in vmss.go — fetches the live version and delegates to the same poller.

e2e/vmss.go

  • CreateVMSSWithRetry's classifier now distinguishes GalleryImageNotFound from AllocationFailed. On the 404 it consults the actual regional replication state via WaitForImageVersionReplicatedToRegion:
    • Completed → fall through, retry once (true fabric-consistency race, sub-minute window).
    • Replicating → block until Completed under the poller's 20-min cap, then retry.
    • Failed → return immediately with the precise infra error wrapped, instead of retrying nine more times.

e2e/config/azure_replication_test.go

  • New table-driven unit tests for findRegionalReplicationStatus covering nil status, case/whitespace normalization both ways (ARM returns "West US 2", our location strings are "westus2"), nil-entry tolerance, and missing-region.

Behavior changes

  • Before: 22 noisy red failed to create VMSS after 10 retries failures every time a fresh VHD ran multi-region E2E. Test logs blamed the test code.
  • After: ensureReplication blocks until the regional replica is actually serving — usual case, no retry pressure on CRP at all. If something is genuinely wrong with replication, you get a clear regional replication of <id> to <region> failed: <Azure-provided details> instead of failed to create VMSS after 10 retries. If it's the rare fabric-eventual-consistency case (LRO+ReplicationStatus both Completed but compute fabric not yet propagated), the existing retry catches it without a long fixed sleep.

Test plan

  • go build ./... clean
  • go test ./config/ -run TestFindRegionalReplicationStatus -v -vet=off passes (vet-off only because of an unrelated pre-existing nit in config/config.go:182)
  • CI: full e2e + unit-test gates on this PR
  • Post-merge: rerun GPU E2E on PR feat: skip MIG reboot for H100 and newer GPU architectures #8228 and confirm the 22 GalleryImageNotFound failures don't recur and the MIG H100 change gets a real signal

…ions

E2E was reporting "Replication took: 12m" via ensureReplication then
immediately failing on VMSS create with `404 GalleryImageNotFound` for
~10 more minutes. Root cause: replicatedToCurrentRegion only checks
PublishingProfile.TargetRegions (intent), and the parent LRO returns
Succeeded before per-region replicas are actually serving traffic.
The vmss.go retry loop (10 x 5s) was far too short to absorb the gap
and surfaced the infra failure as `failed to create VMSS after 10
retries`, indistinguishable from a real test failure.

Fix:
- ensureReplication now polls Get with Expand=ReplicationStatus until
  RegionalReplicationStatus.State == Completed (20 min timeout, 15s
  interval). Fails fast on terminal Failed.
- CreateVMSSWithRetry no longer burns fixed retries on 404
  GalleryImageNotFound. It calls the replication poller, which either
  returns quickly when the region is Completed (fabric-eventual-
  consistency race -> retry once), blocks until Completed, or fails
  fast on Failed with the real cause attached.
- Add a public WaitForImageVersionReplicatedToRegion entrypoint and
  unit tests for the region-name normalization in
  findRegionalReplicationStatus.

This should eliminate the GalleryImageNotFound class of false E2E
failures across Linux VHD, Windows VHD, GPU, and base AgentBaker E2E
pipelines.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves AgentBaker E2E reliability by ensuring Azure Compute Gallery (SIG) image versions are actually replicated and serving in a target region before tests proceed, reducing noisy GalleryImageNotFound VMSS creation failures caused by eventual consistency.

Changes:

  • Update ensureReplication to wait on per-region RegionalReplicationStatus.State == Completed rather than relying on the parent LRO/target region intent.
  • Enhance VMSS create retry behavior to specially classify GalleryImageNotFound and consult the replication-state wait helper before retrying.
  • Add unit tests for region name normalization and status lookup logic used when interpreting replication status.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
e2e/config/azure.go Adds replication-status polling helpers and integrates them into replication enforcement + VMSS retry support.
e2e/vmss.go Extends retry classification to treat GalleryImageNotFound specially and consult replication state before retrying.
e2e/config/azure_replication_test.go Adds table-driven tests for findRegionalReplicationStatus normalization and nil tolerance.

Comment thread e2e/config/azure.go
if err != nil {
return fmt.Errorf("create image version client: %w", err)
}
resp, err := imgVersionClient.Get(ctx, image.Gallery.ResourceGroupName, image.Gallery.Name, image.Name, image.Version, nil)
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment for WaitForImageVersionReplicatedToRegion says it "fetches the live image version with ReplicationStatus expanded", but the initial Get call uses nil options (no Expand). Either update the comment to reflect that the expand happens inside waitForRegionalReplicationCompleted, or pass GalleryImageVersionsClientGetOptions{Expand: ReplicationStatusTypesReplicationStatus} here for consistency.

Suggested change
resp, err := imgVersionClient.Get(ctx, image.Gallery.ResourceGroupName, image.Gallery.Name, image.Name, image.Version, nil)
resp, err := imgVersionClient.Get(ctx, image.Gallery.ResourceGroupName, image.Gallery.Name, image.Name, image.Version, &armcompute.GalleryImageVersionsClientGetOptions{
Expand: to.Ptr(armcompute.ReplicationStatusTypesReplicationStatus),
})

Copilot uses AI. Check for mistakes.
Comment thread e2e/vmss.go
Comment on lines +470 to +479
if isGalleryImageMissing && s.VHD != nil && s.Runtime != nil && s.Runtime.Cluster != nil && s.Runtime.Cluster.Model != nil && s.Runtime.Cluster.Model.Location != nil {
location := *s.Runtime.Cluster.Model.Location
toolkit.Logf(ctx, "GalleryImageNotFound on VMSS create in %s; consulting regional replication status before retrying", location)
if waitErr := config.Azure.WaitForImageVersionReplicatedToRegion(ctx, s.VHD, location); waitErr != nil {
// Surface the precise infra cause instead of a generic "failed after N retries".
return vm, fmt.Errorf("VMSS create returned GalleryImageNotFound and regional replication is not usable: %w (original error: %v)", waitErr, err)
}
toolkit.Logf(ctx, "Regional replication confirmed Completed; retrying VMSS create immediately")
continue
}
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On GalleryImageNotFound, the code waits for replication to be Completed and then immediately continues, which skips the normal backoff delay. If the compute fabric is still eventually consistent even after replication is Completed, this can spin in a tight loop (up to maxAttempts) and hammer the VMSS create API without sleeping. Consider limiting this path to a single immediate retry (track a boolean), or apply a minimal delay/backoff before continuing.

Copilot uses AI. Check for mistakes.
got := findRegionalReplicationStatus(tt.status, tt.location)
if !tt.wantFound {
if got != nil {
t.Fatalf("expected nil, got region %q state %v", *got.Region, *got.State)
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the !tt.wantFound branch, the failure message dereferences *got.Region and *got.State without checking for nil. If findRegionalReplicationStatus ever returns an entry with a nil Region or State (which this test data already includes elsewhere), the test will panic instead of producing a helpful assertion failure. Consider guarding those pointers (or formatting the struct with %#v) in the error message.

Suggested change
t.Fatalf("expected nil, got region %q state %v", *got.Region, *got.State)
t.Fatalf("expected nil, got %#v", got)

Copilot uses AI. Check for mistakes.
Comment thread e2e/config/azure.go
Comment on lines +628 to +633
resp, err := imgVersionClient.Get(ctx, image.Gallery.ResourceGroupName, image.Gallery.Name, image.Name, *version.Name, getOpts)
if err != nil {
// Transient API errors should not abort the wait; the SDK already retries throttling.
toolkit.Logf(ctx, "transient error getting image version replication status (will retry): %v", err)
return false, nil
}
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waitForRegionalReplicationCompleted treats any imgVersionClient.Get error as transient and continues polling (return false, nil). This can mask permanent failures (e.g., auth/permission issues, 404 due to wrong RG/name) and will only surface as a timeout, losing the underlying cause. Consider only swallowing clearly transient conditions (e.g., 429/5xx/timeouts) and returning the error for non-retryable 4xx so callers fail fast with the real Azure error.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants