Skip to content

ceph: treat already-removed partition entry as cleaned on WAL/DB cleanup#750

Open
johnramsden wants to merge 1 commit into
canonical:mainfrom
johnramsden:fix/749-waldb-partx-idempotent-cleanup
Open

ceph: treat already-removed partition entry as cleaned on WAL/DB cleanup#750
johnramsden wants to merge 1 commit into
canonical:mainfrom
johnramsden:fix/749-waldb-partx-idempotent-cleanup

Conversation

@johnramsden
Copy link
Copy Markdown
Member

Description

deletePartition removes a generated WAL/DB partition in two steps: "sfdisk --delete" then "partx -d". Because the sfdisk call does not pass --no-reread, the kernel can re-read the partition table and drop the entry before "partx -d" runs, so partx exits non-zero with "error deleting partition". That redundant failure propagated up and intermittently failed "microceph disk remove" when generated partitions shared a WAL/DB carrier (DSL rm2 case).

Treat a "partx -d" failure as success when the partition no longer resolves, mirroring the existing idempotency handling on the preceding "sfdisk --delete" step. Add TestDeletePartitionTreatsAlreadyRemovedKernelEntryAsCleaned, which reproduces the race deterministically.

Fixes: #749

Type of change

Delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

How has this been tested?

Added unit test exercising regression

Contributor checklist

Please check that you have:

  • self-reviewed the code in this PR
  • added code comments, particularly in less straightforward areas
  • checked and added or updated relevant documentation
  • added or updated HTML meta descriptions for any new or modified documentation pages (see #643)
  • verified that page title and headings accurately represent page content for new or modified documentation pages
  • checked and added or updated relevant release notes
  • added tests to verify effectiveness of this change

deletePartition removes a generated WAL/DB partition in two steps: "sfdisk
--delete" then "partx -d". Because the sfdisk call does not pass --no-reread,
the kernel can re-read the partition table and drop the entry before "partx
-d" runs, so partx exits non-zero with "error deleting partition". That
redundant failure propagated up and intermittently failed "microceph disk
remove" when generated partitions shared a WAL/DB carrier (DSL rm2 case).

Treat a "partx -d" failure as success when the partition no longer resolves,
mirroring the existing idempotency handling on the preceding "sfdisk --delete"
step. Add TestDeletePartitionTreatsAlreadyRemovedKernelEntryAsCleaned, which
reproduces the race deterministically.

Fixes: canonical#749

Assisted-by: claude-code:claude-opus-4-8
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a flaky WAL/DB partition cleanup failure where partx -d would fail because sfdisk --delete had already triggered a kernel re-read removing the partition entry. The fix mirrors the existing idempotency handling on sfdisk --delete by treating partx -d failure as success when the partition no longer resolves.

Changes:

  • Add idempotency check after partx -d in deletePartition: if the partition no longer resolves, treat as cleaned.
  • Add regression test reproducing the race deterministically.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
microceph/ceph/osd_waldb.go Treats partx -d error as success when the partition entry is already gone.
microceph/ceph/osd_waldb_test.go Adds regression test TestDeletePartitionTreatsAlreadyRemovedKernelEntryAsCleaned.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky: WAL/DB partition cleanup fails with 'partx -d ... error deleting partition' on shared-device OSD removal

2 participants