Skip to content

[Disk Manager] Fix leak of base disk slots#5539

Open
gy2411 wants to merge 6 commits intomainfrom
users/gayurgin/fix_leak_of_base_disk_slots
Open

[Disk Manager] Fix leak of base disk slots#5539
gy2411 wants to merge 6 commits intomainfrom
users/gayurgin/fix_leak_of_base_disk_slots

Conversation

@gy2411
Copy link
Copy Markdown
Collaborator

@gy2411 gy2411 commented Mar 20, 2026

In #4339 there was a bug. This bug leads to a leaking base disk slots: the overlay disk is deleted, but its slot remains in the database.

Scenario:

  • Overlay disk is creating
  • Concurrenlty, this overlay disk is deleting
  • Disk is deleted successfully
  • CreateOverlayDiskTask successfully acquires base disk slot
  • CreateOverlayDiskTask sees that disk is deleted and starts to cancel
  • On cancel, CreateOverlayDiskTask sees that the disk is deleted and does not proceed with releasing base disk

Leaking slots result in hanging RetireBaseDisk task: the base disk with leaking slot will never retire.


In this pr we make the following changes:

  • CreateOverlayDiskTask releases base disk slot on cancel if the disk does not exist, but its zone is known. This fixes the bug with leaking slots.
  • Improve the test on concurrent creation/deletion using CheckBaseDiskSlotReleased.
  • Increase DeletedDiskExpirationTimeout parameter in test config. Otherwise, the deleted disk gets cleaned up from the database too quickly and the test does not work.

Without the fix, the improve test fails with error:

common.go:711: 
                Error Trace:    /-S/cloud/disk_manager/internal/pkg/facade/testcommon/common.go:711
                                                        /-S/cloud/disk_manager/internal/pkg/facade/disk_service_test/disk_service_test.go:1273
                Error:          Received unexpected error:
                                Non retriable error, Silent=false: internal inconsistency: slot &{overlayDiskID:TestDiskServiceDeleteDiskWhenCreationIsInFlightAAAA overlayDiskKind:0 overlayDiskSize:134217728 baseDiskID:base-6c1b5f58-9f2f-44fb-8f27-6856e31de178 imageID:TestDiskServiceDeleteDiskWhenCreationIsInFlightAAAA zoneID:zone-a allottedSlots:1 allottedUnits:1 releasedAt:{wall:0 ext:62135596800 loc:0x1fe8200} targetZoneID: targetBaseDiskID: targetAllottedSlots:0 targetAllottedUnits:0 generation:0 status:0} should be released

@gy2411 gy2411 added large-tests Launch large tests for PR disk_manager Add this label to run only cloud/disk_manager build and tests on PR labels Mar 20, 2026
>
DisksConfig: <
DeletedDiskExpirationTimeout: "1s"
DeletedDiskExpirationTimeout: "30s"
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no tests that need small value of this parameter, aren't there?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change this value it is better to add a comment on why it is needed

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a multi-line string """, we can not add a comment inside it.

Seems that the best we could do here is to use string concatination here like this

long_string = (
    "First part of the string\n" # This is a comment
    "Second part of the string\n" # This is another comment
    "Third part"                 # Comment on the last line
)

Copy link
Copy Markdown
Collaborator

@SvartMetal SvartMetal Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can pass this value as a parameter and add a comment somewhere in the code


// Should wait here because base disk slot might be released on overlay disk
// creation operation cancel (and exact time of this event is unknown).
time.Sleep(time.Second * 10)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be very convinient (here and in some other test) if we new either the task completed its cancellation.

Maybe it's ok to add this information to the Operation message in dm api? https://github.com/ydb-platform/nbs/blob/main/cloud/disk_manager/api/operation.proto#L11.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can look at task state directly using tasks.Storage similar to this code

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea. Added new method to tasks storage and used it in the test.

@gy2411 gy2411 marked this pull request as ready for review March 20, 2026 08:48
@gy2411 gy2411 force-pushed the users/gayurgin/fix_leak_of_base_disk_slots branch from 7999977 to 68c64e7 Compare March 20, 2026 09:00
@github-actions
Copy link
Copy Markdown
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo target: cloud/disk_manager/ (test time: 1844s): all tests PASSED for commit 68c64e7.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
1415 1415 0 0 0 0 0

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a race in overlay disk create/cancel that could leak base disk slots (blocking base disk retirement), and strengthens the concurrency regression test to detect slot leaks reliably.

Changes:

  • Ensure CreateOverlayDiskTask.Cancel() releases the acquired base disk slot even if the overlay disk record is already missing but the selected cell/zone is known.
  • Add/adjust unit and integration tests to validate base disk slot release under concurrent create/delete.
  • Update the test recipe to keep deleted disk records long enough for the concurrency test to observe and validate slot release.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
cloud/disk_manager/test/recipe/disk_manager_launcher.py Increases deleted-disk expiration in the test environment to avoid premature cleanup during the race test.
cloud/disk_manager/internal/pkg/services/disks/create_overlay_disk_task.go Refactors overlay disk identification and adds an explicit base-disk-slot release path on cancel when the disk meta is absent but cell is known.
cloud/disk_manager/internal/pkg/services/disks/create_overlay_disk_task_test.go Adds a unit test covering cancel-before-disk-created with slot release; updates existing cancel test state setup.
cloud/disk_manager/internal/pkg/facade/disk_service_test/disk_service_test.go Strengthens the concurrent create/delete test by checking that the base disk slot is released.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@SvartMetal SvartMetal self-requested a review March 20, 2026 13:42
jkuradobery
jkuradobery previously approved these changes Mar 20, 2026
return err
}

_, err = t.scheduler.WaitTask(ctx, execCtx, taskID)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return t.scheduler.WaitTask(...

@@ -170,35 +167,36 @@ func (t *createOverlayDiskTask) Cancel(
}

if diskMeta == nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If diskMeta is nil, it means we could not start acquiring the base disk, so releasing it is not necessary

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, is't not true -- this is just the reason why we need a fix. The disk might have been deleted concurrently by DeleteDisk task -- see the scenario in pr description

Copy link
Copy Markdown
Collaborator

@SvartMetal SvartMetal Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The disk might have been deleted concurrently by DeleteDisk task

In this case base disk should be released by DeleteDisk task

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The release by DeleteDisk task could happen earliar then the acquire by CreateOverayDisk. If so, nothing happens on release, then the slot is acquired and never released.

The is an alternative way to fix this bug -- we could change the behaviour of ReleaseBaseDisk in case then there is no slot for this overlay disk. Now it is noop. But we could insert a tombstone or something like that, so the AcquireDisk will see this tombstone and will not acquire a slot. But such approach seems too difficult.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more possible approach is to check the disk state in one transaction with acquiring base disk slot. Then acquire will see that the disk is already deleting/deleted and will not acquire a slot.

But here we need to query the disks table https://github.com/ydb-platform/nbs/blob/main/cloud/disk_manager/internal/pkg/resources/disks.go#L227 from the pool storage https://github.com/ydb-platform/nbs/tree/main/cloud/disk_manager/internal/pkg/services/pools/storage. Is it ok for us?

Copy link
Copy Markdown
Collaborator

@SvartMetal SvartMetal Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The is an alternative way to fix this bug -- we could change the behaviour of ReleaseBaseDisk in case then there is no slot for this overlay disk. Now it is noop. But we could insert a tombstone or something like that, so the AcquireDisk will see this tombstone and will not acquire a slot

This is the correct approach. The CreateOverlayDisk and DeleteDisk task code was designed with this behavior in mind

When releasing a base disk slot, a tombstone should be used, similar to the one used when deleting a disk

state.status = diskStatusDeleted

func (t *createOverlayDiskTask) releaseBaseDisk(
ctx context.Context,
execCtx tasks.ExecutionContext,
overlayDisk *types.Disk,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use getOverlayDisk instead?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe there is no need to extract releaseBaseDisk as a separate method

@github-actions
Copy link
Copy Markdown
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo target: cloud/disk_manager/ (test time: 1904s): all tests PASSED for commit ed788e2.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
1415 1415 0 0 0 0 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

disk_manager Add this label to run only cloud/disk_manager build and tests on PR large-tests Launch large tests for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants