Stabilize TaskNodesDieAfterBuild: bump WaitForExit timeout + add process-identity telemetry#13828
Conversation
The TaskHost child process is expected to terminate shortly after the build completes, but on slow CI agents (Helix Linux/macOS in particular) the existing 3-second budget has proven insufficient and produces flaky test failures (~3 distinct branches in the past week). Bump the wait to 15 seconds and include the captured pid + HasExited state in the assertion message so future failures are easier to diagnose. Fixes #43 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reviewer (expert-reviewer on PR #44) noted that a bare WaitForExit timeout bump papers over a PID-reuse race: between build-end and GetProcessById, the OS may have recycled the pid to an unrelated process. We capture ProcessName and StartTime up front so a future failure shows the identity rather than looking like the task host hung. The Stopwatch around WaitForExit is mandated by the flaky-test-survey skill workflow: every timeout bump must log actual elapsed ms so a follow-up PR can tune the timeout back down to a tight-but-safe value. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Stabilizes the TaskNodesDieAfterBuild unit test by increasing the post-build task host exit wait budget and adding diagnostic output intended to disambiguate slow-exit vs PID-reuse scenarios on CI.
Changes:
- Increased
WaitForExittimeout from 3s to 15s inTaskNodesDieAfterBuild. - Captured process identity fields (name/start time) best-effort and logged elapsed wait time via
Stopwatch. - Added
SafeGetProcessFieldhelper to prevent diagnostic reads from failing the test.
There was a problem hiding this comment.
Review Summary — 24 Dimensions Evaluated
✅ 24/24 dimensions clean — no findings.
This is a well-crafted test improvement that:
- Correctly addresses the flaky test issue by increasing the timeout to a generous 15s (5× the observed maximum)
- Adds excellent diagnostic information (process name, start time, elapsed ms) that will help distinguish PID-reuse races from genuine hangs
- Handles edge cases gracefully via
SafeGetProcessField(process may exit before diagnostics are captured) - Documents the reasoning clearly with "why" comments and a TELEMETRY note for future tuning
- Maintains the same core assertion (task host must exit after build)
Key dimensions verified clean:
| Dimension | Result |
|---|---|
| Concurrency & Thread Safety | ✅ Single-threaded test, inter-process race acknowledged and handled |
| Correctness & Edge Cases | ✅ HasExited access in assertion message is safe (handle valid after WaitForExit) |
| Resource & Memory Management | ✅ No new resource issues (Process not disposed is pre-existing) |
| Test Coverage & Completeness | ✅ Assertion preserved, diagnostics improved, flakiness addressed |
| Scope & PR Discipline | ✅ Single concern: fix flaky test + improve diagnostics |
| Documentation Accuracy | ✅ Comments explain rationale, not just mechanics |
Note
🔒 Integrity filter blocked 2 items
The following items were blocked because they don't meet the GitHub integrity level.
- #13828
pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved". - #13828
pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
To allow these resources, lower min-integrity in your GitHub frontmatter:
tools:
github:
min-integrity: approved # merged | approved | unapproved | noneGenerated by Expert Code Review (on open) for issue #13828 · ● 4M
|
Thanks — reviewer flagged 24/24 dimensions clean with no findings, so nothing to address here. This PR is test-only ( |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
9df0ded to
9b0b907
Compare
|
24h post-merge telemetry from
The 15 000 ms budget gave ~80x headroom on the worst observed (191 ms). Zero failures. Follow-up to honor the "shrink the budget back" promise from the PR body: draft #13845 lowers it to 2 000 ms (~10x of worst observed, comfortably above the 3 000 ms that originally flaked). |
The
TaskNodesDieAfterBuildtheory insrc/Build.UnitTests/BackEnd/TaskHostFactory_Tests.csflakes intermittently on slow CI agents because the post-buildWaitForExit(3000)budget is too tight for the task host child process to drain stdio and exit.Change
WaitForExitbudget bumped from 3 s -> 15 s.ProcessNameandStartTimeare captured up front so the failure message reveals a PID-reuse race (where the OS recycled the pid to an unrelated process between build-end andGetProcessById) instead of looking like the task host hung.elapsedMsis logged viaStopwatchso a follow-up PR can shrink the budget back to a tight-but-safe value once CI data accumulates.Risk
Test-only change. No production code touched.
Repro signal
The flaky version was observed on multiple distinct branches in dnceng-public pipeline 75 over a 7-day window. The fix has been validated locally (
dotnet test ... --filter-method *TaskNodesDieAfterBuild*-> 4/4 passing).