Update Windows AMI to AMI Windows 2019 GHA CI - 20260325213408 by atalman · Pull Request #412 · pytorch/ci-infra

atalman · 2026-03-26T18:32:01Z

Keeping it as draft for canary testing

…sive drift, and tight timeouts (#435) **Impact:** All OSDC clusters — runner job scheduling, node compactor availability, Karpenter drift behavior **Risk:** low ## What Fixes three independent issues that converged to cause recurring pod scheduling failures on `arc-cbr-production` (4 incidents in 4 days, ~95 pending jobs, trunk red). Adds alerting so the compactor going offline is detected before it causes capacity loss. ## Why Investigation of [#1084](meta-pytorch/pytorch-gha-infra#1084) identified a cascading failure chain: 1. **Broken node-compactor** — lightkube's `Client()` reads the projected SA token once at construction and caches it forever. After EKS rotated OIDC signing keys (~12 days into the pod's life), every API call returned 401 Unauthorized. The compactor's burst-absorption mechanism (untainting nodes when Pending pods accumulate) was silently offline for days. 2. **Unbounded Karpenter drift replacement** — A disk size change across all 20 NodePools triggered simultaneous `NodeClassDrift` on every node. With the disruption budget effectively at 100%, Karpenter could cordon and replace all nodes of a given type at once, leaving zero schedulable capacity during demand bursts. 3. **Timeout too short for cold nodes** — Fresh nodes require EC2 launch (1-3 min) + git-cache sync (~112s) + cold CUDA image pull (5-15 min for images up to 26.8 GB). Total time-to-ready (18-20 min) exceeded the 15-minute `ACTIONS_RUNNER_PREPARE_JOB_TIMEOUT_SECONDS`, causing pods to hit backoff timeout and fail. ## How - Catch 401 at the reconciliation loop level and recreate the `Client()` to pick up the rotated SA token — matches lightkube's own `ExecAuth` retry pattern - Cap Karpenter disruption budget at 20% so at least 80% of nodes remain schedulable during drift replacement - Increase runner prepare-job timeout from 15 min to 25 min to cover worst-case cold-node startup - Add a PrometheusRule alert that fires after 15 minutes of continuous compactor reconciliation errors, so silent failures are detected early ## Changes **Node compactor — token rotation fix** - `compactor.py`: Add `ApiError` catch before the generic `Exception` handler in the main loop; on 401, log a warning and recreate the `Client()`; on other API errors, log and continue - `test_compactor.py`: Add `test_main_recreates_client_on_401` verifying the client is reconstructed and the next reconciliation uses the fresh client **Karpenter disruption budget** - `clusters.yaml`: Add `gpu_disruption_budget: "20%"` and `cpu_disruption_budget: "20%"` to defaults (previous effective default was `100%` from the deploy.sh fallback) **Runner timeout** - `runner.yaml.tpl`: Increase `ACTIONS_RUNNER_PREPARE_JOB_TIMEOUT_SECONDS` from `900` (15 min) to `1500` (25 min) **Compactor health alerting** - `node-compactor-alerts.yaml`: New PrometheusRule — `NodeCompactorReconcileErrors` fires at `severity: critical` when `rate(node_compactor_reconcile_cycles_total{status="error"}[5m]) > 0` persists for 15 minutes - `kustomization.yaml`: Register the new alert resource ## Notes - The 401 fix is a workaround for lightkube's token caching behavior, not a permanent solution. A proper `ServiceAccountAuth` class that re-reads the token file proactively (before expiry) would be more robust but requires upstream changes or a custom auth wrapper. - The disruption budget change applies to all clusters via defaults. Staging already inherits defaults and does not override disruption budgets. - The investigation also identified stale manual refresh taints (13 days, 7 nodes) as a contributing factor — that requires an operational `just untaint-nodes` run, not a code change. ## testing ``` $  just smoke arc-staging Updating kubeconfig for pytorch-arc-staging (us-west-1)... Updated context pytorch-arc-staging in /Users/jschmidt/.kube/config Running smoke tests for cluster: arc-staging Test directories: - base/helm/harbor/tests/smoke - base/kubernetes/git-cache/tests/smoke - base/kubernetes/image-cache-janitor/tests/smoke - base/kubernetes/tests/smoke - base/node-compactor/tests/smoke - modules/eks/tests/smoke - modules/karpenter/tests/smoke - modules/arc/tests/smoke - modules/nodepools/tests/smoke - modules/arc-runners/tests/smoke - modules/buildkit/tests/smoke - modules/pypi-cache/tests/smoke - modules/cache-enforcer/tests/smoke - modules/monitoring/tests/smoke - modules/logging/tests/smoke ================================================================================================================================== test session starts ================================================================================================================================== platform darwin -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0 rootdir: /Users/jschmidt/meta/ciforge/osdc/upstream/osdc configfile: pyproject.toml plugins: anyio-4.12.1, xdist-3.8.0, cov-7.0.0 16 workers [195 items] ................................................................................................................................................................s.................................. [100%] ================================================================================================================================ short test summary info ================================================================================================================================ SKIPPED [1] modules/monitoring/tests/smoke/test_monitoring.py:173: No dcgm-exporter pods found (no GPU nodes) ====================================================================================================================== 194 passed, 1 skipped in 105.52s (0:01:45) ======================================================================================================================= Smoke tests completed in 1m46s ``` ``` $  just integration-test arc-staging Updating kubeconfig for pytorch-arc-staging (us-west-1)... Updated context pytorch-arc-staging in /Users/jschmidt/.kube/config 20:50:04 [INFO] Integration test for cluster: arc-staging (pytorch-arc-staging) 20:50:04 [INFO] Runner prefix: 'c-mt-' 20:50:04 [INFO] B200 enabled: False 20:50:04 [INFO] Release runners: True 20:50:04 [INFO] Cache enforcer: True 20:50:04 [INFO] PyPI cache slugs: cpu cu126 cu128 cu130 20:50:04 [INFO] Smoke tests: skip 20:50:04 [INFO] Compactor tests: skip 20:50:04 [INFO] Branch: osdc-integration-test-arc-staging 20:50:04 [INFO] Phase 0: Cleaning up stale PRs... 20:50:07 [INFO] Phase 1: Checking for active runner pods (arc-staging only)... 20:50:10 [INFO] No runner pods active. Skipping pool clear. 20:50:11 [INFO] Canary repo already cloned at /Users/jschmidt/meta/ciforge/osdc/upstream/osdc/.scratch/pytorch-canary, fetching... 20:50:12 [INFO] Phase 2: Preparing PR... 20:50:18 [INFO] PR #412 created: pytorch/pytorch-canary#412 20:50:18 [INFO] Phase 3: Running parallel validation... 20:50:18 [INFO] Phase 4: Waiting for PR workflow runs (timeout: 50 min, buffer: 10 min)... 20:50:18 [INFO] Filtering to runs created after 2026-04-13T03:50:12.346241+00:00 20:50:20 [INFO] No runs found yet, waiting... 20:50:52 [INFO] Run: OSDC Integration Test — https://github.com/pytorch/pytorch-canary/actions/runs/24324789847 20:50:52 [INFO] 1/1 runs still in progress... 20:51:25 [INFO] 1/1 runs still in progress... 20:51:58 [INFO] 1/1 runs still in progress... 20:52:29 [INFO] 1/1 runs still in progress... 20:53:02 [INFO] 1/1 runs still in progress... 20:53:33 [INFO] 1/1 runs still in progress... 20:54:06 [INFO] 1/1 runs still in progress... 20:54:37 [INFO] 1/1 runs still in progress... 20:55:09 [INFO] 1/1 runs still in progress... 20:55:40 [INFO] 1/1 runs still in progress... 20:56:13 [INFO] All 1 run(s) completed. ============================================================ OSDC Integration Test Results ============================================================ Cluster: arc-staging (pytorch-arc-staging) Date: 2026-04-13 03:56 UTC PR Workflow Jobs: ✓ test-pypi-cache-action-cuda success ✓ test-pypi-cache-action-cpu success ✓ test-cpu-x86-avx512 success ✓ test-cpu-arm64 success ✓ test-git-cache success ✓ test-cpu-x86-amx success ✓ test-pypi-cache-defaults success ✓ test-gpu-t4 success ✓ test-gpu-t4-multi success ✓ test-harbor success ✓ test-release-arm64 success ✓ test-cache-enforcer success ✓ build-arm64 / build success ✓ build-amd64 / build success Smoke ⊘ SKIPPED Compactor ⊘ SKIPPED Overall: PASSED ============================================================ 20:56:15 [INFO] Phase 5: Closing PR #412... 20:56:17 [INFO] Total integration test time: 6m13s ``` Signed-off-by: Jean Schmidt <contato@jschmidt.me>

Update variables.tf

cda428a

atalman requested review from ZainRizvi, jeanschmidt and zxiiro as code owners March 26, 2026 18:32

atalman marked this pull request as draft March 26, 2026 18:32

zxiiro approved these changes Mar 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Windows AMI to AMI Windows 2019 GHA CI - 20260325213408#412

Update Windows AMI to AMI Windows 2019 GHA CI - 20260325213408#412
atalman wants to merge 1 commit intomainfrom
atalman-win-20260325

atalman commented Mar 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

atalman commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

atalman commented Mar 26, 2026 •

edited

Loading