ceph: treat already-removed partition entry as cleaned on WAL/DB cleanup#750
Open
johnramsden wants to merge 1 commit into
Open
ceph: treat already-removed partition entry as cleaned on WAL/DB cleanup#750johnramsden wants to merge 1 commit into
johnramsden wants to merge 1 commit into
Conversation
deletePartition removes a generated WAL/DB partition in two steps: "sfdisk --delete" then "partx -d". Because the sfdisk call does not pass --no-reread, the kernel can re-read the partition table and drop the entry before "partx -d" runs, so partx exits non-zero with "error deleting partition". That redundant failure propagated up and intermittently failed "microceph disk remove" when generated partitions shared a WAL/DB carrier (DSL rm2 case). Treat a "partx -d" failure as success when the partition no longer resolves, mirroring the existing idempotency handling on the preceding "sfdisk --delete" step. Add TestDeletePartitionTreatsAlreadyRemovedKernelEntryAsCleaned, which reproduces the race deterministically. Fixes: canonical#749 Assisted-by: claude-code:claude-opus-4-8 Signed-off-by: John Ramsden <john.ramsden@canonical.com>
There was a problem hiding this comment.
Pull request overview
Fixes a flaky WAL/DB partition cleanup failure where partx -d would fail because sfdisk --delete had already triggered a kernel re-read removing the partition entry. The fix mirrors the existing idempotency handling on sfdisk --delete by treating partx -d failure as success when the partition no longer resolves.
Changes:
- Add idempotency check after
partx -dindeletePartition: if the partition no longer resolves, treat as cleaned. - Add regression test reproducing the race deterministically.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| microceph/ceph/osd_waldb.go | Treats partx -d error as success when the partition entry is already gone. |
| microceph/ceph/osd_waldb_test.go | Adds regression test TestDeletePartitionTreatsAlreadyRemovedKernelEntryAsCleaned. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
deletePartition removes a generated WAL/DB partition in two steps: "sfdisk --delete" then "partx -d". Because the sfdisk call does not pass --no-reread, the kernel can re-read the partition table and drop the entry before "partx -d" runs, so partx exits non-zero with "error deleting partition". That redundant failure propagated up and intermittently failed "microceph disk remove" when generated partitions shared a WAL/DB carrier (DSL rm2 case).
Treat a "partx -d" failure as success when the partition no longer resolves, mirroring the existing idempotency handling on the preceding "sfdisk --delete" step. Add TestDeletePartitionTreatsAlreadyRemovedKernelEntryAsCleaned, which reproduces the race deterministically.
Fixes: #749
Type of change
Delete options that are not relevant.
How has this been tested?
Added unit test exercising regression
Contributor checklist
Please check that you have: