Conversation
cloud/disk_manager/internal/pkg/facade/disk_service_test/disk_service_test.go
Show resolved
Hide resolved
| > | ||
| DisksConfig: < | ||
| DeletedDiskExpirationTimeout: "1s" | ||
| DeletedDiskExpirationTimeout: "30s" |
There was a problem hiding this comment.
There are no tests that need small value of this parameter, aren't there?
There was a problem hiding this comment.
If you change this value it is better to add a comment on why it is needed
There was a problem hiding this comment.
There is a multi-line string """, we can not add a comment inside it.
Seems that the best we could do here is to use string concatination here like this
long_string = (
"First part of the string\n" # This is a comment
"Second part of the string\n" # This is another comment
"Third part" # Comment on the last line
)
There was a problem hiding this comment.
You can pass this value as a parameter and add a comment somewhere in the code
|
|
||
| // Should wait here because base disk slot might be released on overlay disk | ||
| // creation operation cancel (and exact time of this event is unknown). | ||
| time.Sleep(time.Second * 10) |
There was a problem hiding this comment.
It would be very convinient (here and in some other test) if we new either the task completed its cancellation.
Maybe it's ok to add this information to the Operation message in dm api? https://github.com/ydb-platform/nbs/blob/main/cloud/disk_manager/api/operation.proto#L11.
There was a problem hiding this comment.
I think we can look at task state directly using tasks.Storage similar to this code
There was a problem hiding this comment.
This is a good idea. Added new method to tasks storage and used it in the test.
7999977 to
68c64e7
Compare
There was a problem hiding this comment.
Pull request overview
Fixes a race in overlay disk create/cancel that could leak base disk slots (blocking base disk retirement), and strengthens the concurrency regression test to detect slot leaks reliably.
Changes:
- Ensure
CreateOverlayDiskTask.Cancel()releases the acquired base disk slot even if the overlay disk record is already missing but the selected cell/zone is known. - Add/adjust unit and integration tests to validate base disk slot release under concurrent create/delete.
- Update the test recipe to keep deleted disk records long enough for the concurrency test to observe and validate slot release.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| cloud/disk_manager/test/recipe/disk_manager_launcher.py | Increases deleted-disk expiration in the test environment to avoid premature cleanup during the race test. |
| cloud/disk_manager/internal/pkg/services/disks/create_overlay_disk_task.go | Refactors overlay disk identification and adds an explicit base-disk-slot release path on cancel when the disk meta is absent but cell is known. |
| cloud/disk_manager/internal/pkg/services/disks/create_overlay_disk_task_test.go | Adds a unit test covering cancel-before-disk-created with slot release; updates existing cancel test state setup. |
| cloud/disk_manager/internal/pkg/facade/disk_service_test/disk_service_test.go | Strengthens the concurrent create/delete test by checking that the base disk slot is released. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
cloud/disk_manager/internal/pkg/services/disks/create_overlay_disk_task.go
Show resolved
Hide resolved
cloud/disk_manager/internal/pkg/facade/disk_service_test/disk_service_test.go
Outdated
Show resolved
Hide resolved
| return err | ||
| } | ||
|
|
||
| _, err = t.scheduler.WaitTask(ctx, execCtx, taskID) |
There was a problem hiding this comment.
return t.scheduler.WaitTask(...
| @@ -170,35 +167,36 @@ func (t *createOverlayDiskTask) Cancel( | |||
| } | |||
|
|
|||
| if diskMeta == nil { | |||
There was a problem hiding this comment.
If diskMeta is nil, it means we could not start acquiring the base disk, so releasing it is not necessary
There was a problem hiding this comment.
Unfortunately, is't not true -- this is just the reason why we need a fix. The disk might have been deleted concurrently by DeleteDisk task -- see the scenario in pr description
There was a problem hiding this comment.
The disk might have been deleted concurrently by DeleteDisk task
In this case base disk should be released by DeleteDisk task
There was a problem hiding this comment.
The release by DeleteDisk task could happen earliar then the acquire by CreateOverayDisk. If so, nothing happens on release, then the slot is acquired and never released.
The is an alternative way to fix this bug -- we could change the behaviour of ReleaseBaseDisk in case then there is no slot for this overlay disk. Now it is noop. But we could insert a tombstone or something like that, so the AcquireDisk will see this tombstone and will not acquire a slot. But such approach seems too difficult.
There was a problem hiding this comment.
One more possible approach is to check the disk state in one transaction with acquiring base disk slot. Then acquire will see that the disk is already deleting/deleted and will not acquire a slot.
But here we need to query the disks table https://github.com/ydb-platform/nbs/blob/main/cloud/disk_manager/internal/pkg/resources/disks.go#L227 from the pool storage https://github.com/ydb-platform/nbs/tree/main/cloud/disk_manager/internal/pkg/services/pools/storage. Is it ok for us?
There was a problem hiding this comment.
The is an alternative way to fix this bug -- we could change the behaviour of ReleaseBaseDisk in case then there is no slot for this overlay disk. Now it is noop. But we could insert a tombstone or something like that, so the AcquireDisk will see this tombstone and will not acquire a slot
This is the correct approach. The CreateOverlayDisk and DeleteDisk task code was designed with this behavior in mind
When releasing a base disk slot, a tombstone should be used, similar to the one used when deleting a disk
| func (t *createOverlayDiskTask) releaseBaseDisk( | ||
| ctx context.Context, | ||
| execCtx tasks.ExecutionContext, | ||
| overlayDisk *types.Disk, |
There was a problem hiding this comment.
Maybe use getOverlayDisk instead?
There was a problem hiding this comment.
But maybe there is no need to extract releaseBaseDisk as a separate method
In #4339 there was a bug. This bug leads to a leaking base disk slots: the overlay disk is deleted, but its slot remains in the database.
Scenario:
Leaking slots result in hanging RetireBaseDisk task: the base disk with leaking slot will never retire.
In this pr we make the following changes:
Without the fix, the improve test fails with error: