Skip to content

fix: purgeOSD retries on any transient failure, not just EBUSY exit code#748

Open
UtkarshBhatthere wants to merge 1 commit into
mainfrom
fix/osd-purge-retry
Open

fix: purgeOSD retries on any transient failure, not just EBUSY exit code#748
UtkarshBhatthere wants to merge 1 commit into
mainfrom
fix/osd-purge-retry

Conversation

@UtkarshBhatthere
Copy link
Copy Markdown
Contributor

Summary

  • purgeOSD had a dead retry loop: it checked syscall.Errno(exitError.ExitCode()) != syscall.EBUSY to decide whether to retry, but ceph osd purge exits with code 1 on failure, not the errno value 16 (EBUSY). Every failure hit the "unexpected exit error" break and retried zero times.
  • Fix: retry on any failure from doPurge, matching the existing wipeDevice pattern. The OSD process may still be shutting down when purge is attempted, causing transient errors.
  • Add purgeRetrySleepFunc injectable variable (per AGENTS.md Func suffix convention) to allow test-time sleep suppression.
  • Add TestPurgeOSD covering success-on-first-try, transient-failure-then-success, and retry exhaustion.

Test Plan

  • go test ./ceph/... -run TestOSD -count=1 passes
  • TestPurgeOSD covers all three cases
  • Integration: test_dsl_api_disk_hurl OSD purge race no longer flaky

The EBUSY check used syscall.Errno(exitError.ExitCode()) to detect a busy
OSD, but ceph CLI exits with code 1 on failure, not the errno value 16.
This meant the retry loop never fired: every failure broke out immediately
at the "unexpected exit error" branch, leaving the OSD purge with zero
effective retries when the process had not fully stopped yet.

Fix: retry on any failure from doPurge, matching the wipeDevice pattern.
Add purgeRetrySleepFunc for test injection (per AGENTS.md Func suffix
convention). Add TestPurgeOSD covering success, transient failure, and
retry exhaustion.

Assisted-by: claude-code:claude-sonnet-4-6
Signed-off-by: Utkarsh Bhatt <utkarsh.bhatt@canonical.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant