JobMonitor: Align timeout reporting with in-flight Helix cancellation counts#16940
Open
Copilot wants to merge 2 commits into
Open
JobMonitor: Align timeout reporting with in-flight Helix cancellation counts#16940Copilot wants to merge 2 commits into
Copilot wants to merge 2 commits into
Conversation
Copilot
AI
changed the title
[WIP] Fix log message count mismatch for Helix job cancellations
Align timeout reporting with in-flight Helix cancellation counts
Jun 2, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates Helix Job Monitor timeout reporting so the “had not finished” count reflects only truly in-flight Helix jobs (and only the latest job attempts), aligning the timeout logs with the subsequent cancel-in-flight behavior and separating out “completed-on-Helix but not yet processed into AzDO” jobs.
Changes:
- Update
ReportTimeoutto consider only latest Helix attempts and split timeout logging into (a) in-flight jobs (error) vs (b) completed-but-unprocessed jobs (warning). - Add unit tests covering in-flight-only, completed-unprocessed-only, and mixed-with-superseded-attempt timeout classification behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/Microsoft.DotNet.Helix/JobMonitor/JobMonitorRunner.cs | Aligns timeout classification with cancel-in-flight semantics by filtering to latest attempts and splitting log categories. |
| src/Microsoft.DotNet.Helix/Sdk.Tests/Microsoft.DotNet.Helix.Sdk.Tests/JobMonitorRunnerTests.cs | Adds focused tests validating the new timeout classification and superseded-attempt filtering behavior. |
mmitche
approved these changes
Jun 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Timeout logs were conflating two states: jobs still running on Helix vs jobs finished on Helix but not yet processed into AzDO. That made the reported “had not finished” count diverge from the later cancel-in-flight count and included superseded attempts.
ReportTimeout semantics now match cancellation semantics
ReportTimeoutnow evaluates only latest Helix attempts (GetLatestHelixJobAttempts(...)), excluding superseded resubmissions.unfinishedJobs:!IsCompleted(true Helix in-flight jobs)unprocessedCompletedJobs:IsCompleted && !processedHelixJobs.Contains(JobName)(Helix finished, AzDO processing incomplete)Timeout logging split into operator-meaningful categories
Unit coverage updated for timeout classification behavior
To double check:
Original prompt
Problem
In
src/Microsoft.DotNet.Helix/JobMonitor/JobMonitorRunner.cs, when the Helix Job Monitor hits its overall timeout, the log message claims a count of "Helix job(s) had not finished" that does not agree with the count of jobs the runner then actually tries to cancel. For example:The user reasonably expects these two counts to agree (or, if they intentionally differ, for the log message to make that obvious).
Root cause
ReportTimeoutandCancelInFlightHelixJobsAsyncoperate on the sameassociatedJobsdictionary but apply different filters:ReportTimeout(currently the "N had not finished" message):||— a job is included if it's not completed OR not yet processed (i.e. test results not yet uploaded to AzDO).GetLatestHelixJobAttempts, so superseded resubmitted job attempts are also included.CancelInFlightHelixJobsAsync(the "Attempting to cancel N in-flight Helix job(s)" message):&&— only jobs that are both still running on Helix AND not yet processed.GetLatestHelixJobAttempts.So the "unfinished" set reported by
ReportTimeoutincludes (a) Helix jobs that already finished on the Helix side but whose test-result upload to AzDO hadn't completed when the timeout fired, and (b) superseded job attempts that were already replaced by a resubmission. Neither of those categories is actually still running on Helix, so they aren't included in the cancel call — which is correct, but the log message is misleading.Required fix
Update
ReportTimeoutinsrc/Microsoft.DotNet.Helix/JobMonitor/JobMonitorRunner.csso its message is accurate and aligned withCancelInFlightHelixJobsAsync:latestAssociatedJobsthroughGetLatestHelixJobAttempts(...)first, so superseded resubmitted job attempts are not counted as unfinished.unfinishedJobs:!j.IsCompleted(Helix-side not done). This must match whatCancelInFlightHelixJobsAsyncconsiders in-flight, so the two log lines agree.unprocessedCompletedJobs:j.IsCompleted && !processedHelixJobs.Contains(j.JobName)(Helix-side done, but test results / processing for AzDO didn't finish before the timeout).unfinishedJobs.Count > 0, log an error listing those jobs (this is the line that should match the "Attempting to cancel N in-flight Helix job(s)" count).unprocessedCompletedJobs.Count > 0, log a warning listing those jobs with wording that makes it clear they finished on Helix but their results/processing in AzDO did not complete before the timeout.LogCritical("...No unfinished Helix jobs were tracked at the time of timeout.")message.Preserve the existing log format conventions used elsewhere in this file:
TimeoutMinutes(timeout.TotalMinutes) andTimeout(theTimeSpan) values.{DetailsUri} ({QueueId}), joined withEnvironment.NewLine + "- ".JobNamewithStringComparer.OrdinalIgnoreCase.Tests
ReportTimeoutis covered by unit tests undersrc/Microsoft.DotNet.Helix/JobMonitor/(look in the correspondingtests/directory in the repo, e.g.tests/Microsoft.DotNet.Helix.JobMonitor.Tests/ aJobMonitorRunnerTests.csfile). Please:This pull request was created from Copilot chat.