fix(replica): don't re-restore an already-verified or in-progress snapshot#91
Merged
Merged
Conversation
…pshot The snapshot-list result handler only compared the picked snapshot against the *active* restore. For an ephemeral verify replica the active restore is torn down after verification, so a later snapshot-list job resolving to the same snapshot passed the guard, created a second restore, and reported a duplicate restore-verification to canopy. Skip creation when the snapshot is already recorded in status.verifiedSnapshotId or when any non-failed restore is already working on it (Pending/Restoring/Ready/Switching/Active). Failed restores still allow a retry via the failure backoff path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤖 The snapshot-list result handler only compared the picked snapshot against the active restore. For an ephemeral
verifyreplica the active restore is torn down after verification, so a later snapshot-list job resolving to the same snapshot passed the guard, created a second restore, and reported a duplicate restore-verification to canopy (observed as twoverify/healthyreports for the same snapshot ~76s apart).The switchover path itself cannot double-fire for a single restore CR: the phase is flipped to
Activebefore the report is sent, so any reconcile that reaches the report has already left theSwitchingstate. The duplicate therefore came from a second restore CR for the same snapshot.This tightens the create guard: skip creation when the snapshot is already recorded in
status.verifiedSnapshotId(the ephemeral marker that outlives the torn-down restore) or when any non-failed restore is already working on it (Pending/Restoring/Ready/Switching/Active). Failed restores still allow a retry via the failure backoff path.Defense-in-depth alongside the canopy-side change to drop verify entries from the worklist once their report is received; this closes the propagation-window race and any non-canopy trigger.