Skip to content

issue-5588: [Tasks] support non-cancellable tasks that can only either complete successfully or fail with a NonCancellableError#5589

Open
SvartMetal wants to merge 4 commits intomainfrom
users/svartmetal/issue-5588/1
Open

issue-5588: [Tasks] support non-cancellable tasks that can only either complete successfully or fail with a NonCancellableError#5589
SvartMetal wants to merge 4 commits intomainfrom
users/svartmetal/issue-5588/1

Conversation

@SvartMetal
Copy link
Copy Markdown
Collaborator

@SvartMetal SvartMetal commented Mar 26, 2026

Notes

Needed to mitigate the impact of bugs like #5511.

If a DeleteDisk task encounters a non-retriable error, it persists that error in its task state.
As a result, the corresponding Compute API operation becomes stuck, because it keeps retrying the same Disk Manager task and gets the same error every time.
Cleanup tasks should not be allowed to fail with retriable or non-retriable errors.
The only exception is a non-cancellable error, which means the task itself must not run and therefore must not be cancelled, to avoid further damaging the system.
In other words, cleanup tasks should always finish successfully, regardless of fatal errors returned by underlying services.
Such errors should still be reflected in metrics, but they should not cause the task to fail.
Whenever a cleanup task finishes unsuccessfully, this indicates an abnormal situation that likely requires manual intervention from the on-duty engineer

Issue

#5588

@SvartMetal SvartMetal added the large-tests Launch large tests for PR label Mar 26, 2026
@SvartMetal SvartMetal force-pushed the users/svartmetal/issue-5588/1 branch from c1b8e71 to 91d4b72 Compare March 26, 2026 01:01
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 26, 2026

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo target: cloud/tasks/,cloud/storage/ (test time: 304s): all tests PASSED for commit 91d4b72.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
878 878 0 0 0 0 0

🟢 linux-x86_64-relwithdebinfo target: cloud/disk_manager/ (test time: 1972s): all tests PASSED for commit 91d4b72.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
1414 1414 0 0 0 0 0

🟢 linux-x86_64-relwithdebinfo target: cloud/filestore/ (test time: 7136s): all tests PASSED for commit 91d4b72.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
3558 3558 0 0 0 0 0

🟢 linux-x86_64-relwithdebinfo target: cloud/blockstore/ (test time: 11263s): all tests PASSED for commit 91d4b72.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
6401 6400 0 0 0 1 0

@SvartMetal SvartMetal force-pushed the users/svartmetal/issue-5588/1 branch from 91d4b72 to cd08321 Compare March 26, 2026 23:12
@SvartMetal SvartMetal changed the title issue-5588: [Disk Manager] support non-cancellable tasks that can only either complete successfully or fail with a NonCancellableError issue-5588: [Tasks] support non-cancellable tasks that can only either complete successfully or fail with a NonCancellableError Mar 26, 2026
@SvartMetal SvartMetal force-pushed the users/svartmetal/issue-5588/1 branch from cd08321 to 1b5a223 Compare March 27, 2026 00:06
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo target: cloud/filestore/ (test time: 7103s): all tests PASSED for commit 1b5a223.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
3558 3558 0 0 0 0 0

🟢 linux-x86_64-relwithdebinfo target: cloud/blockstore/ (test time: 11225s): all tests PASSED for commit 1b5a223.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
6401 6400 0 0 0 1 0

…r complete successfully or fail with a NonCancellableError
@SvartMetal SvartMetal force-pushed the users/svartmetal/issue-5588/1 branch from 1b5a223 to daf7340 Compare March 27, 2026 10:03
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo target: cloud/tasks/,cloud/storage/ (test time: 350s): all tests PASSED for commit daf7340.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
882 882 0 0 0 0 0

🟢 linux-x86_64-relwithdebinfo target: cloud/disk_manager/ (test time: 1951s): all tests PASSED for commit daf7340.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
1414 1414 0 0 0 0 0

🟢 linux-x86_64-relwithdebinfo target: cloud/filestore/ (test time: 7136s): all tests PASSED for commit daf7340.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
3558 3558 0 0 0 0 0

🔴 linux-x86_64-relwithdebinfo target: cloud/blockstore/ (test time: 11296s): some tests FAILED for commit daf7340.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
6401 6399 0 1 0 1 0

🟢 linux-x86_64-relwithdebinfo target: cloud/blockstore/ (test time: 36s): all tests PASSED for commit daf7340.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
2 2 0 0 0 0 0

@jkuradobery jkuradobery self-requested a review March 27, 2026 12:02
@SvartMetal SvartMetal requested a review from gy2411 March 27, 2026 12:08
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo target: cloud/tasks/,cloud/storage/ (test time: 402s): all tests PASSED for commit f84230d.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
882 882 0 0 0 0 0

🟢 linux-x86_64-relwithdebinfo target: cloud/disk_manager/ (test time: 2052s): all tests PASSED for commit f84230d.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
1414 1414 0 0 0 0 0

🟢 linux-x86_64-relwithdebinfo target: cloud/filestore/ (test time: 7138s): all tests PASSED for commit f84230d.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
3558 3558 0 0 0 0 0

jkuradobery
jkuradobery previously approved these changes Mar 27, 2026
lastState := states[0]

// A non-cancellable task may transition to the cancelled state, as this
// simply skips the actual cancellation process
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsCancellationRequested checks that status in not Cancelled -- so why not just

if lastState.NonCancellable && IsCancellationRequested(state.Status) {
    return ...
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed

// It is forbidden to start the cancellation process for a
// non-cancellable task
if IsCancellationRequested(state.Status) {
return TaskState{}, errors.NewNonRetriableErrorf(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that will we account one and the same error twice for non-cancellable tasks.

E.g. if the task got a non-retriable error, we will account the error inself in onExecutionError https://github.com/ydb-platform/nbs/pull/5589/changes#diff-e17825d18ff2e356fdaf163881058e0cc95b82194123ea7d232049b8e80c2737R178 and also account this error here https://github.com/ydb-platform/nbs/blob/main/cloud/tasks/runner.go#L236. Is it ok?

Maybe we should return InterruptExecutionError here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is acceptable - more alerts in such situations are better.


GetTaskID() string

IsNonCancellable() bool
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe write a comment here https://github.com/ydb-platform/nbs/blob/main/cloud/tasks/execution_context.go#L295 that setError returns an error for non-cancellable tasks? Is doesn't seem obvious and requires diving deep into the storage code to understand it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment

gy2411
gy2411 previously approved these changes Mar 27, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-relwithdebinfo target: cloud/tasks/,cloud/storage/ (test time: 362s): some tests FAILED for commit 896c5ee.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
882 881 0 1 0 0 0

🔴 linux-x86_64-relwithdebinfo target: cloud/tasks/,cloud/storage/ (test time: 9s): some tests FAILED for commit 896c5ee.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
2 1 0 1 0 0 0

🔴 linux-x86_64-relwithdebinfo target: cloud/tasks/,cloud/storage/ (test time: 9s): some tests FAILED for commit 896c5ee.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
2 1 0 1 0 0 0

Copy link
Copy Markdown
Collaborator

@jkuradobery jkuradobery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo target: cloud/tasks/,cloud/storage/ (test time: 468s): all tests PASSED for commit bbd126a.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
882 882 0 0 0 0 0

🟢 linux-x86_64-relwithdebinfo target: cloud/disk_manager/ (test time: 2111s): all tests PASSED for commit bbd126a.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
1414 1414 0 0 0 0 0

🟢 linux-x86_64-relwithdebinfo target: cloud/filestore/ (test time: 7159s): all tests PASSED for commit bbd126a.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
3558 3558 0 0 0 0 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

large-tests Launch large tests for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants