Skip to content

Update Windows AMI to AMI Windows 2019 GHA CI - 20260325213408#412

Draft
atalman wants to merge 1 commit intomainfrom
atalman-win-20260325
Draft

Update Windows AMI to AMI Windows 2019 GHA CI - 20260325213408#412
atalman wants to merge 1 commit intomainfrom
atalman-win-20260325

Conversation

@atalman
Copy link
Copy Markdown
Contributor

@atalman atalman commented Mar 26, 2026

Keeping it as draft for canary testing

@atalman atalman marked this pull request as draft March 26, 2026 18:32
github-merge-queue bot pushed a commit that referenced this pull request Apr 13, 2026
…sive drift, and tight timeouts (#435)

**Impact:** All OSDC clusters — runner job scheduling, node compactor
availability, Karpenter drift behavior
**Risk:** low

## What
Fixes three independent issues that converged to cause recurring pod
scheduling failures on `arc-cbr-production` (4 incidents in 4 days, ~95
pending jobs, trunk red). Adds alerting so the compactor going offline
is detected before it causes capacity loss.

## Why
Investigation of
[#1084](meta-pytorch/pytorch-gha-infra#1084)
identified a cascading failure chain:

1. **Broken node-compactor** — lightkube's `Client()` reads the
projected SA token once at construction and caches it forever. After EKS
rotated OIDC signing keys (~12 days into the pod's life), every API call
returned 401 Unauthorized. The compactor's burst-absorption mechanism
(untainting nodes when Pending pods accumulate) was silently offline for
days.
2. **Unbounded Karpenter drift replacement** — A disk size change across
all 20 NodePools triggered simultaneous `NodeClassDrift` on every node.
With the disruption budget effectively at 100%, Karpenter could cordon
and replace all nodes of a given type at once, leaving zero schedulable
capacity during demand bursts.
3. **Timeout too short for cold nodes** — Fresh nodes require EC2 launch
(1-3 min) + git-cache sync (~112s) + cold CUDA image pull (5-15 min for
images up to 26.8 GB). Total time-to-ready (18-20 min) exceeded the
15-minute `ACTIONS_RUNNER_PREPARE_JOB_TIMEOUT_SECONDS`, causing pods to
hit backoff timeout and fail.

## How
- Catch 401 at the reconciliation loop level and recreate the `Client()`
to pick up the rotated SA token — matches lightkube's own `ExecAuth`
retry pattern
- Cap Karpenter disruption budget at 20% so at least 80% of nodes remain
schedulable during drift replacement
- Increase runner prepare-job timeout from 15 min to 25 min to cover
worst-case cold-node startup
- Add a PrometheusRule alert that fires after 15 minutes of continuous
compactor reconciliation errors, so silent failures are detected early

## Changes

**Node compactor — token rotation fix**
- `compactor.py`: Add `ApiError` catch before the generic `Exception`
handler in the main loop; on 401, log a warning and recreate the
`Client()`; on other API errors, log and continue
- `test_compactor.py`: Add `test_main_recreates_client_on_401` verifying
the client is reconstructed and the next reconciliation uses the fresh
client

**Karpenter disruption budget**
- `clusters.yaml`: Add `gpu_disruption_budget: "20%"` and
`cpu_disruption_budget: "20%"` to defaults (previous effective default
was `100%` from the deploy.sh fallback)

**Runner timeout**
- `runner.yaml.tpl`: Increase
`ACTIONS_RUNNER_PREPARE_JOB_TIMEOUT_SECONDS` from `900` (15 min) to
`1500` (25 min)

**Compactor health alerting**
- `node-compactor-alerts.yaml`: New PrometheusRule —
`NodeCompactorReconcileErrors` fires at `severity: critical` when
`rate(node_compactor_reconcile_cycles_total{status="error"}[5m]) > 0`
persists for 15 minutes
- `kustomization.yaml`: Register the new alert resource

## Notes
- The 401 fix is a workaround for lightkube's token caching behavior,
not a permanent solution. A proper `ServiceAccountAuth` class that
re-reads the token file proactively (before expiry) would be more robust
but requires upstream changes or a custom auth wrapper.
- The disruption budget change applies to all clusters via defaults.
Staging already inherits defaults and does not override disruption
budgets.
- The investigation also identified stale manual refresh taints (13
days, 7 nodes) as a contributing factor — that requires an operational
`just untaint-nodes` run, not a code change.

## testing
```
 $  just smoke arc-staging
Updating kubeconfig for pytorch-arc-staging (us-west-1)...
Updated context pytorch-arc-staging in /Users/jschmidt/.kube/config
Running smoke tests for cluster: arc-staging
Test directories:
  - base/helm/harbor/tests/smoke
  - base/kubernetes/git-cache/tests/smoke
  - base/kubernetes/image-cache-janitor/tests/smoke
  - base/kubernetes/tests/smoke
  - base/node-compactor/tests/smoke
  - modules/eks/tests/smoke
  - modules/karpenter/tests/smoke
  - modules/arc/tests/smoke
  - modules/nodepools/tests/smoke
  - modules/arc-runners/tests/smoke
  - modules/buildkit/tests/smoke
  - modules/pypi-cache/tests/smoke
  - modules/cache-enforcer/tests/smoke
  - modules/monitoring/tests/smoke
  - modules/logging/tests/smoke

================================================================================================================================== test session starts ==================================================================================================================================
platform darwin -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0
rootdir: /Users/jschmidt/meta/ciforge/osdc/upstream/osdc
configfile: pyproject.toml
plugins: anyio-4.12.1, xdist-3.8.0, cov-7.0.0
16 workers [195 items]
................................................................................................................................................................s..................................                                                                               [100%]
================================================================================================================================ short test summary info ================================================================================================================================
SKIPPED [1] modules/monitoring/tests/smoke/test_monitoring.py:173: No dcgm-exporter pods found (no GPU nodes)
====================================================================================================================== 194 passed, 1 skipped in 105.52s (0:01:45) =======================================================================================================================

Smoke tests completed in 1m46s
```
```
 $  just integration-test arc-staging
Updating kubeconfig for pytorch-arc-staging (us-west-1)...
Updated context pytorch-arc-staging in /Users/jschmidt/.kube/config
20:50:04 [INFO] Integration test for cluster: arc-staging (pytorch-arc-staging)
20:50:04 [INFO]   Runner prefix: 'c-mt-'
20:50:04 [INFO]   B200 enabled: False
20:50:04 [INFO]   Release runners: True
20:50:04 [INFO]   Cache enforcer: True
20:50:04 [INFO]   PyPI cache slugs: cpu cu126 cu128 cu130
20:50:04 [INFO]   Smoke tests: skip
20:50:04 [INFO]   Compactor tests: skip
20:50:04 [INFO]   Branch: osdc-integration-test-arc-staging
20:50:04 [INFO] Phase 0: Cleaning up stale PRs...
20:50:07 [INFO] Phase 1: Checking for active runner pods (arc-staging only)...
20:50:10 [INFO]   No runner pods active. Skipping pool clear.
20:50:11 [INFO]   Canary repo already cloned at /Users/jschmidt/meta/ciforge/osdc/upstream/osdc/.scratch/pytorch-canary, fetching...
20:50:12 [INFO] Phase 2: Preparing PR...
20:50:18 [INFO]   PR #412 created: pytorch/pytorch-canary#412
20:50:18 [INFO] Phase 3: Running parallel validation...
20:50:18 [INFO] Phase 4: Waiting for PR workflow runs (timeout: 50 min, buffer: 10 min)...
20:50:18 [INFO]   Filtering to runs created after 2026-04-13T03:50:12.346241+00:00
20:50:20 [INFO]   No runs found yet, waiting...
20:50:52 [INFO]   Run: OSDC Integration Test — https://github.com/pytorch/pytorch-canary/actions/runs/24324789847
20:50:52 [INFO]   1/1 runs still in progress...
20:51:25 [INFO]   1/1 runs still in progress...
20:51:58 [INFO]   1/1 runs still in progress...
20:52:29 [INFO]   1/1 runs still in progress...
20:53:02 [INFO]   1/1 runs still in progress...
20:53:33 [INFO]   1/1 runs still in progress...
20:54:06 [INFO]   1/1 runs still in progress...
20:54:37 [INFO]   1/1 runs still in progress...
20:55:09 [INFO]   1/1 runs still in progress...
20:55:40 [INFO]   1/1 runs still in progress...
20:56:13 [INFO]   All 1 run(s) completed.


============================================================
  OSDC Integration Test Results
============================================================
  Cluster: arc-staging (pytorch-arc-staging)
  Date:    2026-04-13 03:56 UTC

  PR Workflow Jobs:
    ✓ test-pypi-cache-action-cuda    success
    ✓ test-pypi-cache-action-cpu     success
    ✓ test-cpu-x86-avx512            success
    ✓ test-cpu-arm64                 success
    ✓ test-git-cache                 success
    ✓ test-cpu-x86-amx               success
    ✓ test-pypi-cache-defaults       success
    ✓ test-gpu-t4                    success
    ✓ test-gpu-t4-multi              success
    ✓ test-harbor                    success
    ✓ test-release-arm64             success
    ✓ test-cache-enforcer            success
    ✓ build-arm64 / build            success
    ✓ build-amd64 / build            success

  Smoke            ⊘ SKIPPED
  Compactor        ⊘ SKIPPED

  Overall: PASSED
============================================================

20:56:15 [INFO] Phase 5: Closing PR #412...
20:56:17 [INFO] Total integration test time: 6m13s
```

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants