fix: purgeOSD retries on any transient failure, not just EBUSY exit code#748
Open
UtkarshBhatthere wants to merge 1 commit into
Open
fix: purgeOSD retries on any transient failure, not just EBUSY exit code#748UtkarshBhatthere wants to merge 1 commit into
UtkarshBhatthere wants to merge 1 commit into
Conversation
The EBUSY check used syscall.Errno(exitError.ExitCode()) to detect a busy OSD, but ceph CLI exits with code 1 on failure, not the errno value 16. This meant the retry loop never fired: every failure broke out immediately at the "unexpected exit error" branch, leaving the OSD purge with zero effective retries when the process had not fully stopped yet. Fix: retry on any failure from doPurge, matching the wipeDevice pattern. Add purgeRetrySleepFunc for test injection (per AGENTS.md Func suffix convention). Add TestPurgeOSD covering success, transient failure, and retry exhaustion. Assisted-by: claude-code:claude-sonnet-4-6 Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
purgeOSDhad a dead retry loop: it checkedsyscall.Errno(exitError.ExitCode()) != syscall.EBUSYto decide whether to retry, butceph osd purgeexits with code 1 on failure, not the errno value 16 (EBUSY). Every failure hit the "unexpected exit error" break and retried zero times.doPurge, matching the existingwipeDevicepattern. The OSD process may still be shutting down when purge is attempted, causing transient errors.purgeRetrySleepFuncinjectable variable (per AGENTS.mdFuncsuffix convention) to allow test-time sleep suppression.TestPurgeOSDcovering success-on-first-try, transient-failure-then-success, and retry exhaustion.Test Plan
go test ./ceph/... -run TestOSD -count=1passesTestPurgeOSDcovers all three casestest_dsl_api_disk_hurlOSD purge race no longer flaky