Skip to content

[Disk manager] correctly handle deletion of non-existent disk; improve test on concurrent overlay disk creation and deletion#4830

Closed
gy2411 wants to merge 3 commits intomainfrom
users/gayurgin/fix_retire_base_disk
Closed

[Disk manager] correctly handle deletion of non-existent disk; improve test on concurrent overlay disk creation and deletion#4830
gy2411 wants to merge 3 commits intomainfrom
users/gayurgin/fix_retire_base_disk

Conversation

@gy2411
Copy link
Copy Markdown
Collaborator

@gy2411 gy2411 commented Dec 12, 2025

In #4339 there was a bug. This bug leads to a leaking base disk slot and, consequenly, to hanging retire base disk task.

Scenario:

  • Overlay disk is creating
  • Concurrenlty, this overlay disk is deleting
  • Disk is deleted successfully
  • CreateOverlayDiskTask successfully acquires base disk slot
  • CreateOverlayDiskTask sees that disk is deleted and starts to cancel
  • On cancel, CreateOverlayDiskTask sees that the disk is deleted and does not proceed with releasing base disk

In this pr we make the following changes:

  • Task does not stop after storage.DeleteDisk call even if the is is already deleted.
  • Improve the test using CheckBaseDiskSlotReleased.
  • Increase DeletedDiskExpirationTimeout parameter in test config. Otherwise, the deleted disk gets cleaned up from the database too quickly and the test does not work.

Without the fix, the improve test fails with error:

common.go:711: 
                Error Trace:    /-S/cloud/disk_manager/internal/pkg/facade/testcommon/common.go:711
                                                        /-S/cloud/disk_manager/internal/pkg/facade/disk_service_test/disk_service_test.go:1273
                Error:          Received unexpected error:
                                Non retriable error, Silent=false: internal inconsistency: slot &{overlayDiskID:TestDiskServiceDeleteDiskWhenCreationIsInFlightAAAA overlayDiskKind:0 overlayDiskSize:134217728 baseDiskID:base-6c1b5f58-9f2f-44fb-8f27-6856e31de178 imageID:TestDiskServiceDeleteDiskWhenCreationIsInFlightAAAA zoneID:zone-a allottedSlots:1 allottedUnits:1 releasedAt:{wall:0 ext:62135596800 loc:0x1fe8200} targetZoneID: targetBaseDiskID: targetAllottedSlots:0 targetAllottedUnits:0 generation:0 status:0} should be released

@gy2411 gy2411 added large-tests Launch large tests for PR disk_manager Add this label to run only cloud/disk_manager build and tests on PR labels Dec 12, 2025
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 12, 2025

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit dabd103.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
1255 1253 0 2 0 0 0

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit dabd103.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
47 46 0 1 0 0 0

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit dabd103.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
45 45 0 0 0 0 0

@gy2411 gy2411 force-pushed the users/gayurgin/fix_retire_base_disk branch from dabd103 to 0838a21 Compare December 15, 2025 09:41
@github-actions
Copy link
Copy Markdown
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit 0838a21.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
1255 1255 0 0 0 0 0

@jkuradobery jkuradobery self-requested a review December 15, 2025 12:34
}
if len(zoneID) == 0 {
return t.storage.DiskDeleted(ctx, diskID, time.Now())
}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Нужно продолжать удаление диска, даже если диска нет в сторадже.

В частности, таск удаления диска, отсутсвтующего в сторадже, должен отправить в NBS запрос на удаление этого диска.

Поэтому при пустой ZoneID не делаем return, а забираем зону из запроса (как мы это и делали до того, как посадили баг: https://github.com/ydb-platform/nbs/pull/4339/files#diff-1605caf3e850a302b598eb7fed17f3d13cae8a4defd1b6c2f18df892bcbf8060)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Выделил эту правку в отдельный pr #4910

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

И всё же лучше это место вернуть, как было.

@github-actions
Copy link
Copy Markdown
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit 0473505.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
1255 1255 0 0 0 0 0

@SvartMetal SvartMetal changed the title [Disk manager] correctry handle deletion of non-existent disk; improve test on concurrent overlay disk creation and deletion [Disk manager] correctly handle deletion of non-existent disk; improve test on concurrent overlay disk creation and deletion Dec 24, 2025
@gy2411 gy2411 force-pushed the users/gayurgin/fix_retire_base_disk branch from d1c6e49 to d97e0cd Compare January 12, 2026 09:23
@github-actions
Copy link
Copy Markdown
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit d97e0cd.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
1255 1255 0 0 0 0 0

@gy2411
Copy link
Copy Markdown
Collaborator Author

gy2411 commented Mar 20, 2026

We don't want to proceed deletion of disk when disk does not exist. So another approach is proposed, see #5539.

@gy2411 gy2411 closed this Mar 20, 2026
@gy2411
Copy link
Copy Markdown
Collaborator Author

gy2411 commented Mar 20, 2026

Close in favour to #5539.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

disk_manager Add this label to run only cloud/disk_manager build and tests on PR large-tests Launch large tests for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant