Skip to content

JobMonitor: Align timeout reporting with in-flight Helix cancellation counts#16940

Open
Copilot wants to merge 2 commits into
mainfrom
copilot/fix-count-mismatch-logs
Open

JobMonitor: Align timeout reporting with in-flight Helix cancellation counts#16940
Copilot wants to merge 2 commits into
mainfrom
copilot/fix-count-mismatch-logs

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Jun 2, 2026

Timeout logs were conflating two states: jobs still running on Helix vs jobs finished on Helix but not yet processed into AzDO. That made the reported “had not finished” count diverge from the later cancel-in-flight count and included superseded attempts.

  • ReportTimeout semantics now match cancellation semantics

    • ReportTimeout now evaluates only latest Helix attempts (GetLatestHelixJobAttempts(...)), excluding superseded resubmissions.
    • It computes two explicit sets:
      • unfinishedJobs: !IsCompleted (true Helix in-flight jobs)
      • unprocessedCompletedJobs: IsCompleted && !processedHelixJobs.Contains(JobName) (Helix finished, AzDO processing incomplete)
  • Timeout logging split into operator-meaningful categories

    • Error: unfinished Helix jobs (this count now aligns with cancel-in-flight intent)
    • Warning: completed-on-Helix but unprocessed-in-AzDO jobs
    • Existing critical “No unfinished Helix jobs...” path is preserved when both sets are empty.
  • Unit coverage updated for timeout classification behavior

    • Added focused tests for:
      • in-flight-only latest attempts
      • completed-but-unprocessed-only latest attempts
      • mixed state with a superseded older attempt
    • Assertions verify category separation and that superseded attempts are excluded from both lists.
IReadOnlyList<HelixJobInfo> latestAttempts = GetLatestHelixJobAttempts(latestAssociatedJobs);

var unfinishedJobs = latestAttempts
    .Where(j => !j.IsCompleted)
    .OrderBy(j => j.JobName, StringComparer.OrdinalIgnoreCase)
    .ToList();

var unprocessedCompletedJobs = latestAttempts
    .Where(j => j.IsCompleted && !processedHelixJobs.Contains(j.JobName))
    .OrderBy(j => j.JobName, StringComparer.OrdinalIgnoreCase)
    .ToList();

To double check:

Original prompt

Problem

In src/Microsoft.DotNet.Helix/JobMonitor/JobMonitorRunner.cs, when the Helix Job Monitor hits its overall timeout, the log message claims a count of "Helix job(s) had not finished" that does not agree with the count of jobs the runner then actually tries to cancel. For example:

fail: Helix Job Monitor timed out after 88 minute(s) (01:28:00). 8 Helix job(s) had not finished:
      - https://helix.dot.net/api/2019-06-17/jobs/145eac13-... (ubuntu.2204.amd64.open)
      - ... (7 more) ...
warn: Cancellation requested. Attempting to cancel 2 in-flight Helix job(s).

The user reasonably expects these two counts to agree (or, if they intentionally differ, for the log message to make that obvious).

Root cause

ReportTimeout and CancelInFlightHelixJobsAsync operate on the same associatedJobs dictionary but apply different filters:

ReportTimeout (currently the "N had not finished" message):

var unfinishedJobs = latestAssociatedJobs
    .Where(j => !j.IsCompleted || !processedHelixJobs.Contains(j.JobName))
    .OrderBy(j => j.JobName, StringComparer.OrdinalIgnoreCase)
    .ToList();
  • Uses || — a job is included if it's not completed OR not yet processed (i.e. test results not yet uploaded to AzDO).
  • Does not call GetLatestHelixJobAttempts, so superseded resubmitted job attempts are also included.

CancelInFlightHelixJobsAsync (the "Attempting to cancel N in-flight Helix job(s)" message):

List<HelixJobInfo> inFlightJobs =
[
    ..GetLatestHelixJobAttempts(associatedJobs)
        .Where(j => !j.IsCompleted && !processedHelixJobs.Contains(j.JobName))
        .OrderBy(j => j.JobName, StringComparer.OrdinalIgnoreCase)
];
  • Uses && — only jobs that are both still running on Helix AND not yet processed.
  • First strips superseded attempts via GetLatestHelixJobAttempts.

So the "unfinished" set reported by ReportTimeout includes (a) Helix jobs that already finished on the Helix side but whose test-result upload to AzDO hadn't completed when the timeout fired, and (b) superseded job attempts that were already replaced by a resubmission. Neither of those categories is actually still running on Helix, so they aren't included in the cancel call — which is correct, but the log message is misleading.

Required fix

Update ReportTimeout in src/Microsoft.DotNet.Helix/JobMonitor/JobMonitorRunner.cs so its message is accurate and aligned with CancelInFlightHelixJobsAsync:

  1. Filter to latest attempts. Pipe latestAssociatedJobs through GetLatestHelixJobAttempts(...) first, so superseded resubmitted job attempts are not counted as unfinished.
  2. Separate "unfinished on Helix" from "completed but not yet processed". Compute two sets from the latest-attempt jobs:
    • unfinishedJobs: !j.IsCompleted (Helix-side not done). This must match what CancelInFlightHelixJobsAsync considers in-flight, so the two log lines agree.
    • unprocessedCompletedJobs: j.IsCompleted && !processedHelixJobs.Contains(j.JobName) (Helix-side done, but test results / processing for AzDO didn't finish before the timeout).
  3. Log them as distinct categories so operators can tell at a glance which jobs are truly still consuming Helix capacity vs. which jobs simply had their AzDO upload cut short:
    • If unfinishedJobs.Count > 0, log an error listing those jobs (this is the line that should match the "Attempting to cancel N in-flight Helix job(s)" count).
    • If unprocessedCompletedJobs.Count > 0, log a warning listing those jobs with wording that makes it clear they finished on Helix but their results/processing in AzDO did not complete before the timeout.
    • If both are zero, keep the existing LogCritical("...No unfinished Helix jobs were tracked at the time of timeout.") message.

Preserve the existing log format conventions used elsewhere in this file:

  • Continue including TimeoutMinutes (timeout.TotalMinutes) and Timeout (the TimeSpan) values.
  • Continue formatting each job line as {DetailsUri} ({QueueId}), joined with Environment.NewLine + "- ".
  • Continue ordering jobs by JobName with StringComparer.OrdinalIgnoreCase.

Tests

ReportTimeout is covered by unit tests under src/Microsoft.DotNet.Helix/JobMonitor/ (look in the corresponding tests/ directory in the repo, e.g. tests/Microsoft.DotNet.Helix.JobMonitor.Tests / a JobMonitorRunnerTests.cs file). Please:

  • Update any existing tests that assert on the old "N Helix job(s) had not finished" message so they still pass with the new wording / split.
  • Add new unit tests that exercise the three scenarios:
    1. Only truly in-flight (not completed) latest-attempt jobs at timeout → reported as unfinished; count matches what would be cancelled.
    2. Only "completed but unprocessed" latest-attempt jobs at timeout → reported in the new "completed but not processed" warning, and not counted as unfinished.
    3. A mix of (1) and (2) plus a supersede...

This pull request was created from Copilot chat.

Copilot AI self-assigned this Jun 2, 2026
Copilot AI review requested due to automatic review settings June 2, 2026 12:44
Copilot AI review requested due to automatic review settings June 2, 2026 12:44
Copilot AI requested review from Copilot and removed request for Copilot June 2, 2026 12:52
Copilot AI changed the title [WIP] Fix log message count mismatch for Helix job cancellations Align timeout reporting with in-flight Helix cancellation counts Jun 2, 2026
Copilot finished work on behalf of premun June 2, 2026 12:56
Copilot AI requested a review from premun June 2, 2026 12:56
@premun premun changed the title Align timeout reporting with in-flight Helix cancellation counts JobMonitor: Align timeout reporting with in-flight Helix cancellation counts Jun 2, 2026
@mmitche mmitche marked this pull request as ready for review June 2, 2026 15:20
Copilot AI review requested due to automatic review settings June 2, 2026 15:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Helix Job Monitor timeout reporting so the “had not finished” count reflects only truly in-flight Helix jobs (and only the latest job attempts), aligning the timeout logs with the subsequent cancel-in-flight behavior and separating out “completed-on-Helix but not yet processed into AzDO” jobs.

Changes:

  • Update ReportTimeout to consider only latest Helix attempts and split timeout logging into (a) in-flight jobs (error) vs (b) completed-but-unprocessed jobs (warning).
  • Add unit tests covering in-flight-only, completed-unprocessed-only, and mixed-with-superseded-attempt timeout classification behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/Microsoft.DotNet.Helix/JobMonitor/JobMonitorRunner.cs Aligns timeout classification with cancel-in-flight semantics by filtering to latest attempts and splitting log categories.
src/Microsoft.DotNet.Helix/Sdk.Tests/Microsoft.DotNet.Helix.Sdk.Tests/JobMonitorRunnerTests.cs Adds focused tests validating the new timeout classification and superseded-attempt filtering behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants