Prevent sending interrupt signals by lost killer tasks by activity#192
Closed
agabrys wants to merge 1 commit intojenkinsci:masterfrom
Closed
Prevent sending interrupt signals by lost killer tasks by activity#192agabrys wants to merge 1 commit intojenkinsci:masterfrom
agabrys wants to merge 1 commit intojenkinsci:masterfrom
Conversation
Author
|
When the review is done and the code is accepted - I'm going to:
|
db22812 to
283c9cd
Compare
When many activity timeouts are run at the same time, sometimes the "Sending interrupt signal to process" message appears and the build is aborted (JENKINS-58752). The "Cancelling nested steps due to timeout" message is never printed. The code has been refactored to prevent such issues: - the implementation of the activity and absolute timeouts have been separated to improve the code readability - the tasks executed after a delay are always created in synchronized sections, to prevent losing tasks which should be canceled - the timer logic of the activity timeout is changed from always stopping to verifying if the logic should be stopped or continued, so the number of timers is always under control (less instances) - the `Tick` class is replaced by a listener which notifies the step less frequently about the changes. The behavior could be controlled by setting the `org.jenkinsci.plugins.workflow.steps.TimeoutStepExecution.activityNotifyWaitRatio` property. It informs when the earliest the information about new activities should be sent to the timeout (`time * ratio`). When there were no activities in that time, then the next activity will be announced right after it has been reported There are additional changes introduced in this commit: - the timeout id is always printed in the log messages which improves debugging (e.g. when timeouts are nested) - the logic is more precise. The previous implementation allowed exceeding the timeout by 1/10 of the time (JENKINS-63696). The new stops the process right after the timeout is reached. There is a property to allow exceeding it a little bit - `org.jenkinsci.plugins.workflow.steps.TimeoutStepExecution.activityPrecision`. It is necessary to not abort the logic due to delay in the notification process.
283c9cd to
c44488d
Compare
|
The colleague that opened this PR no longer is in the department. To keep this going and keep it up-to-date, I opened a new PR to which I have access and can implement change requests in: #231. This PR will therefore be closed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The activity timeout quite often stops the builds on our servers. Base on the log messages, it shouldn't do it. I analyzed the plugin source code and I figured out that the problem is caused by "lost" killer tasks. When multiple threads operate on the same
TimeoutStepExecutionobject, it is possible that killers are not stopped when should be. I wanted to fix the current implementation, but it is very confusing, for example method interactions ("→" = calls):resetTimer→setupTimer→ sets timer with:cancel→resetTimer→setupTimer→cancelWe are testing the new implementation on our server and we haven't hit any issues yet.
The introduced changes also made the timer more precise. The current one allowed extending the time by 1/10 time or more.
I'm aware that for some time we will have to use a forked version, so instead of overwriting the class, I introduced a new one -
TimeoutStepExecutionThreadSafe. It is used instead of the original one when theorg.jenkinsci.plugins.workflow.steps.TimeoutStep.threadsafeis set totrue(falseby default). It should also make the first review cycle easier, when diff it simple and the original class is easily available to compare with the new one.I executed the
TimeoutStepTesttests with the new implementation and all tests finished successfully.I attached the built plugin, so you may test it - workflow-basic-steps.zip (941.vdfe1b_a_132c64 + fix) 🙂
git message:
When many activity timeouts are run at the same time, sometimes the "Sending interrupt signal to process" message appears and the build is aborted (JENKINS-58752). The "Cancelling nested steps due to timeout" message is never printed. The code has been refactored to prevent such issues:
Tickclass is replaced by a listener which notifies the step less frequently about the changes. The behavior could be controlled by setting theorg.jenkinsci.plugins.workflow.steps.TimeoutStepExecution.activityNotifyWaitRatioproperty. It informs when the earliest the information about new activities should be sent to the timeout (time * ratio). When there were no activities in that time, then the next activity will be announced right after it has been reportedThere are additional changes introduced in this commit:
org.jenkinsci.plugins.workflow.steps.TimeoutStepExecution.activityPrecision. It is necessary to not abort the logic due to delay in the notification process.