Skip to content

Add elapsed-time telemetry to RunProcessAndGetOutput test helper#13836

Closed
jankratochvilcz wants to merge 3 commits into
dotnet:mainfrom
jankratochvilcz:upstream-pr/nettaskhost-telemetry-v2
Closed

Add elapsed-time telemetry to RunProcessAndGetOutput test helper#13836
jankratochvilcz wants to merge 3 commits into
dotnet:mainfrom
jankratochvilcz:upstream-pr/nettaskhost-telemetry-v2

Conversation

@jankratochvilcz
Copy link
Copy Markdown
Contributor

Re-opening of #13829 to get a fresh license/cla check.

#13829 has been blocked for hours with mergeStateStatus=BLOCKED even though every msbuild-pr leg is green and the PR is approved (@AlesProkop). Root cause: the required license/cla check is missing on the head commit c0801793 — the Microsoft policy service reported on earlier commits in the PR but never on the merge commit. Same code, identical SHA pushed to a new branch to mint a fresh CLA check.

Below is the original description from #13829:


RunProcessAndGetOutput is the helper that almost every cross-process MSBuild test uses (net472/netcore task host launches, multi-proc node smoke tests, etc.). It already enforces a per-call timeout, but on failure we only see "process X is active for more than N sec" — no signal at all on the happy path about how close we ran to the budget.

This is exactly the kind of data we need to retune the three flaky ToolTask tests stabilized in the sibling PRs (#13828, #13830) — they pick their timeouts and slow-vs-fast gaps blind right now.

Changes

  • Capture sw.ElapsedMilliseconds immediately after WaitForExit() returns, before the post-exit drain loop, so the telemetry reports budget-relevant time only.
  • Log one line per process exit:
    • Normal mode: Process {pid} exited in {N}ms (budget {T}ms, {T-N}ms remaining)
    • DebugUnitTests mode (no enforced timeout): Process {pid} exited in {N}ms (DebugUnitTests: budget not enforced)
  • No behavioral change. Test-only file.

Risk

Test-helper-only changes. Single file: src/UnitTests.Shared/RunnerUtilities.cs. No product code touched.

Closes #13829.

Jan Krivanek and others added 3 commits May 21, 2026 11:43
Replaces the global default-timeout bump (reviewed against on PR #45) with
elapsed-time logging in the bootstrap-MSBuild test helper. The Stopwatch
around WaitForExit lets callers see actual ms taken vs the budget, so when
follow-up PRs add per-test timeout overrides (e.g. the unmerged
NetTaskHostTest_FallbackToDotnet tests tracked in #41/#42), the author can
pick a tight value based on data instead of guessing.

Rationale for not bumping the default: the helper is called from 53 test
sites across 11 files; a global bump would silently triple the worst-case
budget for tests that today don't even use 30s, masking perf regressions.
The two flaky tests cited in the original issues don't exist in main yet
(they're on a feature branch), so per-call scoping isn't possible today;
this PR makes the data available so it can be scoped correctly when they
land.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jankratochvilcz
Copy link
Copy Markdown
Contributor Author

@AlesProkop sorry for the re-ping — could I get a re-approval here? This is #13829 reopened verbatim (same head SHA, just on a new branch) because the required license/cla check was stuck pending for hours on the merge commit even though it had reported success on the earlier commits in the PR. Trying to unblock past what looks like a buggy CLA pipeline. The PR is your already-approved version with your wording nit applied ({N}ms remaining).

@jankratochvilcz jankratochvilcz enabled auto-merge (squash) May 21, 2026 16:56
@jankratochvilcz
Copy link
Copy Markdown
Contributor Author

Closing as duplicate — #13829 auto-merged 6 seconds before I opened this one (the auto-merge I'd set up earlier fired the moment license/cla finally reported on the head commit). The content is on main already (merge commit f095875). Sorry for the noise, @AlesProkop — no action needed.

auto-merge was automatically disabled May 21, 2026 16:57

Pull request was closed

@jankratochvilcz jankratochvilcz deleted the upstream-pr/nettaskhost-telemetry-v2 branch May 21, 2026 16:57
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

This is a straightforward, low-risk diagnostic improvement that adds timing output to RunProcessAndGetOutput. The change helps developers investigate timeout-related test failures by showing how much of the timeout budget was consumed. Good addition for debuggability.

No blocking issues found.

Minor observations (all non-blocking):

# Severity Issue
1 Low Stopwatch measures wait time, not total execution time from p.Start(). The log message "exited in Xms" could be slightly misleading.
2 Low sw.Stop() is functionally dead — the stopwatch value was already captured and is never read again.
3 Low Negative "remaining" is theoretically possible due to scheduling jitter after WaitForExit returns.
4 Nit "Telemetry" label in comment is a misnomer for diagnostic log output.

What's good:

  • Correctly captures elapsed before the post-exit drain WaitForExit() so the budget comparison is accurate.
  • Correctly handles the DebugUnitTests path (no misleading budget info when budget isn't enforced).
  • No behavioral change to existing logic — purely additive diagnostic output.
  • Scoped to test infrastructure only — zero risk to production code.

Note

🔒 Integrity filter blocked 2 items

The following items were blocked because they don't meet the GitHub integrity level.

  • #13836 pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #13836 pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

Generated by Expert Code Review (on open) for issue #13836 · ● 1.2M

p.StandardInput.Dispose();

TimeSpan timeout = TimeSpan.FromMilliseconds(timeoutMilliseconds);
Stopwatch sw = Stopwatch.StartNew();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (documentation gap): The stopwatch starts after p.Start() and async output reading has begun (lines 168–172). This means exitElapsedMs measures wait time (time spent in WaitForExit), not total process execution time. The process may have done significant work between Start() and this line.

This is arguably the right thing to measure for timeout-budget purposes (since it's the same duration WaitForExit(timeout) is checking), but the log message says "exited in {exitElapsedMs}ms" which could be misread as total execution time. Consider rewording to something like "Process {pid} waited {exitElapsedMs}ms for exit" or adding a brief inline comment noting it's measuring wait duration, not wall-clock execution time from process start.

pid = p.Id;
successfulExit = p.ExitCode == 0;

// Telemetry: log actual elapsed time so callers tuning their timeouts can see
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (naming): The comment says "Telemetry" but this is diagnostic logging via WriteOutput, not actual telemetry infrastructure (no counters, no structured events, no collection pipeline). Consider using "Diagnostics:" or just removing the label — the comment already explains the purpose clearly without it.

}
else
{
WriteOutput($"Process {pid} exited in {exitElapsedMs}ms (budget {timeoutMilliseconds}ms, {timeoutMilliseconds - exitElapsedMs}ms remaining)");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor robustness: timeoutMilliseconds - exitElapsedMs mixes int and long, producing a long result. In normal operation this should always be non-negative (since WaitForExit(timeout) would have thrown on timeout). However, there's a theoretical window where thread scheduling after WaitForExit returns could make exitElapsedMs slightly exceed timeoutMilliseconds. A negative "remaining" value would be confusing in the log.

Consider using Math.Max(0, timeoutMilliseconds - exitElapsedMs) or simply documenting that negative values indicate measurement jitter. This is very low severity — just a polish item.

// See https://docs.microsoft.com/en-us/dotnet/api/system.diagnostics.process.waitforexit?view=net-6.0#system-diagnostics-process-waitforexit(system-int32).
// The overload WaitForExit() waits for the error and output to be handled. The WaitForExit(int timeout) overload does not, so we could lose the data.
p.WaitForExit();
sw.Stop();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (dead code): sw.Stop() has no observable effect here — exitElapsedMs was already captured on line 190 and the stopwatch is never read again after this point. It could be removed to reduce noise, or if you prefer to keep it for clarity/hygiene that's fine too — just noting it's functionally inert.

@jankratochvilcz jankratochvilcz review requested due to automatic review settings May 21, 2026 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant