Update Windows AMI to AMI Windows 2019 GHA CI - 20260325213408#412
Draft
Update Windows AMI to AMI Windows 2019 GHA CI - 20260325213408#412
Conversation
zxiiro
approved these changes
Mar 26, 2026
github-merge-queue bot
pushed a commit
that referenced
this pull request
Apr 13, 2026
…sive drift, and tight timeouts (#435) **Impact:** All OSDC clusters — runner job scheduling, node compactor availability, Karpenter drift behavior **Risk:** low ## What Fixes three independent issues that converged to cause recurring pod scheduling failures on `arc-cbr-production` (4 incidents in 4 days, ~95 pending jobs, trunk red). Adds alerting so the compactor going offline is detected before it causes capacity loss. ## Why Investigation of [#1084](meta-pytorch/pytorch-gha-infra#1084) identified a cascading failure chain: 1. **Broken node-compactor** — lightkube's `Client()` reads the projected SA token once at construction and caches it forever. After EKS rotated OIDC signing keys (~12 days into the pod's life), every API call returned 401 Unauthorized. The compactor's burst-absorption mechanism (untainting nodes when Pending pods accumulate) was silently offline for days. 2. **Unbounded Karpenter drift replacement** — A disk size change across all 20 NodePools triggered simultaneous `NodeClassDrift` on every node. With the disruption budget effectively at 100%, Karpenter could cordon and replace all nodes of a given type at once, leaving zero schedulable capacity during demand bursts. 3. **Timeout too short for cold nodes** — Fresh nodes require EC2 launch (1-3 min) + git-cache sync (~112s) + cold CUDA image pull (5-15 min for images up to 26.8 GB). Total time-to-ready (18-20 min) exceeded the 15-minute `ACTIONS_RUNNER_PREPARE_JOB_TIMEOUT_SECONDS`, causing pods to hit backoff timeout and fail. ## How - Catch 401 at the reconciliation loop level and recreate the `Client()` to pick up the rotated SA token — matches lightkube's own `ExecAuth` retry pattern - Cap Karpenter disruption budget at 20% so at least 80% of nodes remain schedulable during drift replacement - Increase runner prepare-job timeout from 15 min to 25 min to cover worst-case cold-node startup - Add a PrometheusRule alert that fires after 15 minutes of continuous compactor reconciliation errors, so silent failures are detected early ## Changes **Node compactor — token rotation fix** - `compactor.py`: Add `ApiError` catch before the generic `Exception` handler in the main loop; on 401, log a warning and recreate the `Client()`; on other API errors, log and continue - `test_compactor.py`: Add `test_main_recreates_client_on_401` verifying the client is reconstructed and the next reconciliation uses the fresh client **Karpenter disruption budget** - `clusters.yaml`: Add `gpu_disruption_budget: "20%"` and `cpu_disruption_budget: "20%"` to defaults (previous effective default was `100%` from the deploy.sh fallback) **Runner timeout** - `runner.yaml.tpl`: Increase `ACTIONS_RUNNER_PREPARE_JOB_TIMEOUT_SECONDS` from `900` (15 min) to `1500` (25 min) **Compactor health alerting** - `node-compactor-alerts.yaml`: New PrometheusRule — `NodeCompactorReconcileErrors` fires at `severity: critical` when `rate(node_compactor_reconcile_cycles_total{status="error"}[5m]) > 0` persists for 15 minutes - `kustomization.yaml`: Register the new alert resource ## Notes - The 401 fix is a workaround for lightkube's token caching behavior, not a permanent solution. A proper `ServiceAccountAuth` class that re-reads the token file proactively (before expiry) would be more robust but requires upstream changes or a custom auth wrapper. - The disruption budget change applies to all clusters via defaults. Staging already inherits defaults and does not override disruption budgets. - The investigation also identified stale manual refresh taints (13 days, 7 nodes) as a contributing factor — that requires an operational `just untaint-nodes` run, not a code change. ## testing ``` $ just smoke arc-staging Updating kubeconfig for pytorch-arc-staging (us-west-1)... Updated context pytorch-arc-staging in /Users/jschmidt/.kube/config Running smoke tests for cluster: arc-staging Test directories: - base/helm/harbor/tests/smoke - base/kubernetes/git-cache/tests/smoke - base/kubernetes/image-cache-janitor/tests/smoke - base/kubernetes/tests/smoke - base/node-compactor/tests/smoke - modules/eks/tests/smoke - modules/karpenter/tests/smoke - modules/arc/tests/smoke - modules/nodepools/tests/smoke - modules/arc-runners/tests/smoke - modules/buildkit/tests/smoke - modules/pypi-cache/tests/smoke - modules/cache-enforcer/tests/smoke - modules/monitoring/tests/smoke - modules/logging/tests/smoke ================================================================================================================================== test session starts ================================================================================================================================== platform darwin -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0 rootdir: /Users/jschmidt/meta/ciforge/osdc/upstream/osdc configfile: pyproject.toml plugins: anyio-4.12.1, xdist-3.8.0, cov-7.0.0 16 workers [195 items] ................................................................................................................................................................s.................................. [100%] ================================================================================================================================ short test summary info ================================================================================================================================ SKIPPED [1] modules/monitoring/tests/smoke/test_monitoring.py:173: No dcgm-exporter pods found (no GPU nodes) ====================================================================================================================== 194 passed, 1 skipped in 105.52s (0:01:45) ======================================================================================================================= Smoke tests completed in 1m46s ``` ``` $ just integration-test arc-staging Updating kubeconfig for pytorch-arc-staging (us-west-1)... Updated context pytorch-arc-staging in /Users/jschmidt/.kube/config 20:50:04 [INFO] Integration test for cluster: arc-staging (pytorch-arc-staging) 20:50:04 [INFO] Runner prefix: 'c-mt-' 20:50:04 [INFO] B200 enabled: False 20:50:04 [INFO] Release runners: True 20:50:04 [INFO] Cache enforcer: True 20:50:04 [INFO] PyPI cache slugs: cpu cu126 cu128 cu130 20:50:04 [INFO] Smoke tests: skip 20:50:04 [INFO] Compactor tests: skip 20:50:04 [INFO] Branch: osdc-integration-test-arc-staging 20:50:04 [INFO] Phase 0: Cleaning up stale PRs... 20:50:07 [INFO] Phase 1: Checking for active runner pods (arc-staging only)... 20:50:10 [INFO] No runner pods active. Skipping pool clear. 20:50:11 [INFO] Canary repo already cloned at /Users/jschmidt/meta/ciforge/osdc/upstream/osdc/.scratch/pytorch-canary, fetching... 20:50:12 [INFO] Phase 2: Preparing PR... 20:50:18 [INFO] PR #412 created: pytorch/pytorch-canary#412 20:50:18 [INFO] Phase 3: Running parallel validation... 20:50:18 [INFO] Phase 4: Waiting for PR workflow runs (timeout: 50 min, buffer: 10 min)... 20:50:18 [INFO] Filtering to runs created after 2026-04-13T03:50:12.346241+00:00 20:50:20 [INFO] No runs found yet, waiting... 20:50:52 [INFO] Run: OSDC Integration Test — https://github.com/pytorch/pytorch-canary/actions/runs/24324789847 20:50:52 [INFO] 1/1 runs still in progress... 20:51:25 [INFO] 1/1 runs still in progress... 20:51:58 [INFO] 1/1 runs still in progress... 20:52:29 [INFO] 1/1 runs still in progress... 20:53:02 [INFO] 1/1 runs still in progress... 20:53:33 [INFO] 1/1 runs still in progress... 20:54:06 [INFO] 1/1 runs still in progress... 20:54:37 [INFO] 1/1 runs still in progress... 20:55:09 [INFO] 1/1 runs still in progress... 20:55:40 [INFO] 1/1 runs still in progress... 20:56:13 [INFO] All 1 run(s) completed. ============================================================ OSDC Integration Test Results ============================================================ Cluster: arc-staging (pytorch-arc-staging) Date: 2026-04-13 03:56 UTC PR Workflow Jobs: ✓ test-pypi-cache-action-cuda success ✓ test-pypi-cache-action-cpu success ✓ test-cpu-x86-avx512 success ✓ test-cpu-arm64 success ✓ test-git-cache success ✓ test-cpu-x86-amx success ✓ test-pypi-cache-defaults success ✓ test-gpu-t4 success ✓ test-gpu-t4-multi success ✓ test-harbor success ✓ test-release-arm64 success ✓ test-cache-enforcer success ✓ build-arm64 / build success ✓ build-amd64 / build success Smoke ⊘ SKIPPED Compactor ⊘ SKIPPED Overall: PASSED ============================================================ 20:56:15 [INFO] Phase 5: Closing PR #412... 20:56:17 [INFO] Total integration test time: 6m13s ``` Signed-off-by: Jean Schmidt <contato@jschmidt.me>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Keeping it as draft for canary testing