issue-5588: [Tasks] support non-cancellable tasks that can only either complete successfully or fail with a NonCancellableError#5589
Conversation
c1b8e71 to
91d4b72
Compare
|
Note This is an automated comment that will be appended during run. 🟢 linux-x86_64-relwithdebinfo target: cloud/tasks/,cloud/storage/ (test time: 304s): all tests PASSED for commit 91d4b72.
🟢 linux-x86_64-relwithdebinfo target: cloud/disk_manager/ (test time: 1972s): all tests PASSED for commit 91d4b72.
🟢 linux-x86_64-relwithdebinfo target: cloud/filestore/ (test time: 7136s): all tests PASSED for commit 91d4b72.
🟢 linux-x86_64-relwithdebinfo target: cloud/blockstore/ (test time: 11263s): all tests PASSED for commit 91d4b72.
|
91d4b72 to
cd08321
Compare
cd08321 to
1b5a223
Compare
|
Note This is an automated comment that will be appended during run. 🟢 linux-x86_64-relwithdebinfo target: cloud/filestore/ (test time: 7103s): all tests PASSED for commit 1b5a223.
🟢 linux-x86_64-relwithdebinfo target: cloud/blockstore/ (test time: 11225s): all tests PASSED for commit 1b5a223.
|
…r complete successfully or fail with a NonCancellableError
1b5a223 to
daf7340
Compare
|
Note This is an automated comment that will be appended during run. 🟢 linux-x86_64-relwithdebinfo target: cloud/tasks/,cloud/storage/ (test time: 350s): all tests PASSED for commit daf7340.
🟢 linux-x86_64-relwithdebinfo target: cloud/disk_manager/ (test time: 1951s): all tests PASSED for commit daf7340.
🟢 linux-x86_64-relwithdebinfo target: cloud/filestore/ (test time: 7136s): all tests PASSED for commit daf7340.
🔴 linux-x86_64-relwithdebinfo target: cloud/blockstore/ (test time: 11296s): some tests FAILED for commit daf7340.
🟢 linux-x86_64-relwithdebinfo target: cloud/blockstore/ (test time: 36s): all tests PASSED for commit daf7340.
|
|
Note This is an automated comment that will be appended during run. 🟢 linux-x86_64-relwithdebinfo target: cloud/tasks/,cloud/storage/ (test time: 402s): all tests PASSED for commit f84230d.
🟢 linux-x86_64-relwithdebinfo target: cloud/disk_manager/ (test time: 2052s): all tests PASSED for commit f84230d.
🟢 linux-x86_64-relwithdebinfo target: cloud/filestore/ (test time: 7138s): all tests PASSED for commit f84230d.
|
| lastState := states[0] | ||
|
|
||
| // A non-cancellable task may transition to the cancelled state, as this | ||
| // simply skips the actual cancellation process |
There was a problem hiding this comment.
IsCancellationRequested checks that status in not Cancelled -- so why not just
if lastState.NonCancellable && IsCancellationRequested(state.Status) {
return ...
}
| // It is forbidden to start the cancellation process for a | ||
| // non-cancellable task | ||
| if IsCancellationRequested(state.Status) { | ||
| return TaskState{}, errors.NewNonRetriableErrorf( |
There was a problem hiding this comment.
Seems that will we account one and the same error twice for non-cancellable tasks.
E.g. if the task got a non-retriable error, we will account the error inself in onExecutionError https://github.com/ydb-platform/nbs/pull/5589/changes#diff-e17825d18ff2e356fdaf163881058e0cc95b82194123ea7d232049b8e80c2737R178 and also account this error here https://github.com/ydb-platform/nbs/blob/main/cloud/tasks/runner.go#L236. Is it ok?
Maybe we should return InterruptExecutionError here?
There was a problem hiding this comment.
I think this is acceptable - more alerts in such situations are better.
|
|
||
| GetTaskID() string | ||
|
|
||
| IsNonCancellable() bool |
There was a problem hiding this comment.
Maybe write a comment here https://github.com/ydb-platform/nbs/blob/main/cloud/tasks/execution_context.go#L295 that setError returns an error for non-cancellable tasks? Is doesn't seem obvious and requires diving deep into the storage code to understand it.
There was a problem hiding this comment.
Added a comment
|
Note This is an automated comment that will be appended during run. 🔴 linux-x86_64-relwithdebinfo target: cloud/tasks/,cloud/storage/ (test time: 362s): some tests FAILED for commit 896c5ee.
🔴 linux-x86_64-relwithdebinfo target: cloud/tasks/,cloud/storage/ (test time: 9s): some tests FAILED for commit 896c5ee.
🔴 linux-x86_64-relwithdebinfo target: cloud/tasks/,cloud/storage/ (test time: 9s): some tests FAILED for commit 896c5ee.
|
896c5ee to
bbd126a
Compare
|
Note This is an automated comment that will be appended during run. 🟢 linux-x86_64-relwithdebinfo target: cloud/tasks/,cloud/storage/ (test time: 468s): all tests PASSED for commit bbd126a.
🟢 linux-x86_64-relwithdebinfo target: cloud/disk_manager/ (test time: 2111s): all tests PASSED for commit bbd126a.
🟢 linux-x86_64-relwithdebinfo target: cloud/filestore/ (test time: 7159s): all tests PASSED for commit bbd126a.
|
Notes
Needed to mitigate the impact of bugs like #5511.
If a DeleteDisk task encounters a non-retriable error, it persists that error in its task state.
As a result, the corresponding Compute API operation becomes stuck, because it keeps retrying the same Disk Manager task and gets the same error every time.
Cleanup tasks should not be allowed to fail with retriable or non-retriable errors.
The only exception is a non-cancellable error, which means the task itself must not run and therefore must not be cancelled, to avoid further damaging the system.
In other words, cleanup tasks should always finish successfully, regardless of fatal errors returned by underlying services.
Such errors should still be reflected in metrics, but they should not cause the task to fail.
Whenever a cleanup task finishes unsuccessfully, this indicates an abnormal situation that likely requires manual intervention from the on-duty engineer
Issue
#5588