diff --git a/skills/cloud/gke-upgrades/EVAL.txtpb b/skills/cloud/gke-upgrades/EVAL.txtpb new file mode 100644 index 0000000000..0355f7e5e8 --- /dev/null +++ b/skills/cloud/gke-upgrades/EVAL.txtpb @@ -0,0 +1,115 @@ +suite_name: "gke-upgrades" + +cases { + name: "gke_upgrade_plan_standard" + prompt: "We have a Standard GKE cluster on the Regular release channel running 1.28 in us-central1-a. We need to upgrade to 1.30 before end of quarter. We have 3 node pools -- a general-purpose pool, a high-memory pool for our Postgres operator, and a GPU pool for ML inference. Can you put together an upgrade plan?" + expectations: "Recommends sequential upgrade path (1.28→1.29→1.30) rather than skipping" + expectations: "Specifies control plane upgrades before node pool upgrades" + expectations: "Provides different node pool upgrade ordering with rationale for the sequence" + expectations: "Includes specific surge upgrade settings (maxSurge/maxUnavailable) for each pool" + expectations: "Addresses Postgres-specific concerns (PDB, backup, operator compatibility, PV reclaim)" + expectations: "Addresses GPU-specific concerns (driver compatibility, workload coordination)" + expectations: "Includes actual gcloud commands for the upgrade steps" + expectations: "Provides a pre-upgrade checklist" + expectations: "Provides a post-upgrade checklist with validation commands" + expectations: "Includes rollback procedure or guidance" +} + +cases { + name: "gke_autopilot_checklists" + prompt: "I manage 4 Autopilot clusters -- 2 in dev on Rapid channel and 2 in prod on Stable. We've been getting notifications that our prod clusters will auto-upgrade from 1.29 to 1.30 next month. Can you give me pre and post upgrade checklists tailored to our setup?" + expectations: "Distinguishes between dev (Rapid) and prod (Stable) clusters in the checklists" + expectations: "Recognizes Autopilot limitations (no node management, no SSH, mandatory resource requests)" + expectations: "Includes API deprecation checks with kubectl commands" + expectations: "Mentions mandatory resource requests as an Autopilot-specific concern" + expectations: "Pre-upgrade checklist includes workload readiness items (PDBs, bare pods, graceful shutdown)" + expectations: "Post-upgrade checklist includes version verification commands" + expectations: "Post-upgrade checklist includes workload health validation" + expectations: "Recommends testing on dev clusters before prod auto-upgrade" + expectations: "Includes monitoring/observability validation items" +} + +cases { + name: "gke_troubleshoot_stuck_upgrade" + prompt: "Our node pool upgrade has been stuck for 2 hours. It says 3 out of 12 nodes upgraded. We're on a Standard cluster in us-east1, running 1.29, upgrading nodes to 1.30. The pods on the remaining nodes aren't draining. What should we check and how do we fix this?" + expectations: "Identifies PDB as the most likely/common cause of stuck upgrades" + expectations: "Provides kubectl command to check PDBs (kubectl get pdb --all-namespaces or similar)" + expectations: "Provides commands to identify which pods are blocking the drain" + expectations: "Includes resource constraint diagnosis (capacity issues, pending pods)" + expectations: "Provides a fix for restrictive PDBs (patch or temporary adjustment)" + expectations: "Includes guidance on surge upgrade settings as a potential fix" + expectations: "Mentions webhook issues as a possible cause" + expectations: "Includes validation steps to confirm the fix is working" +} + +cases { + name: "gke_maintenance_exclusions" + prompt: "We have a critical business period coming up from Nov 15 to Jan 15. We are running a GKE Standard cluster on the Regular channel. How can we ensure GKE doesn't auto-upgrade our control plane or nodes during this time, but still allows emergency security patches?" + expectations: "Recommends using maintenance exclusions instead of 'No channel' (which is deprecated)." + expectations: "Explains the difference between 'No upgrades' and 'No minor or node upgrades' exclusions." + expectations: "Recommends 'No minor or node upgrades' (or equivalent) to allow patch upgrades while blocking minor/node upgrades." + expectations: "Provides the correct gcloud command to add a maintenance exclusion." + expectations: "Verifies the suggested gcloud command uses the correct separate flags for exclusions (e.g., --add-maintenance-exclusion-name, --add-maintenance-exclusion-start, etc.)." + expectations: "Explains that auto-upgrades respect exclusions but manual upgrades bypass them." + expectations: "Mentions the 90-day limit for 'No upgrades' and how to handle longer periods (e.g. persistent exclusions for minor upgrades until End of Support)." +} + +cases { + name: "gke_gpu_node_upgrade" + prompt: "We run large-scale ML training on GKE using H100 GPUs with fixed reservations. We need to upgrade our node pools. We cannot use surge upgrades because we don't have extra GPU quota, and we cannot live migrate GPU workloads. What is the recommended upgrade strategy and plan?" + expectations: "Explains that GPU VMs do not support live migration and upgrades force pod restart." + expectations: "Recommends using maxSurge=0 and maxUnavailable=1 (or failure domain-based batching) due to fixed GPU reservations." + expectations: "Explains why blue-green upgrade is not feasible (requires 2x GPU resources)." + expectations: "Recommends cordoning GPU nodes and waiting for active training jobs to checkpoint/complete before upgrading." + expectations: "Warns about GPU driver coupling with the GKE version and the need to test CUDA compatibility in staging." + expectations: "Recommends using maintenance exclusions to block upgrades during active training campaigns." +} + +cases { + name: "gke_upgrade_quota_exhausted" + prompt: "Our GKE node pool upgrade is stuck. The status is 'updating' for several hours, but no new nodes are coming up. We checked the GCE logs and see errors like `ZONE_RESOURCE_POOL_EXHAUSTED` and `QUOTA_EXCEEDED` for `CPUS_ALL_REGIONS` in our zone. What does this mean and how do we resolve it?" + expectations: "Identifies that the upgrade requires temporary extra resources (surge nodes) which are failing to provision due to GCE quota or zone capacity limits" + expectations: "Suggests checking GCP Quota metrics for the target region/zone" + expectations: "Recommends modifying the node pool's surge configuration (e.g., lowering maxSurge to reduce concurrent resource footprint)" + expectations: "Suggests requesting a quota increase from Google Cloud" + expectations: "Suggests moving workloads or node pools to other zones with available capacity if feasible" +} + +cases { + name: "gke_mandatory_upgrade_override" + prompt: "We configured a maintenance exclusion window to block all upgrades during our peak sales event. However, GKE just upgraded our control plane anyway, causing some brief disruption. Why did GKE ignore our exclusion window? Is this a bug?" + expectations: "Explains that GKE reserves the right to override user-defined maintenance policies for mandatory operations (critical security patches, EOL version enforcement, expiring CAs)" + expectations: "Suggests checking GKE release notes or security bulletins to correlate the upgrade with emergency patches" + expectations: "Explains that mandatory overrides cannot be disabled or blocked by exclusions" + expectations: "Advises designing workloads to be resilient to unexpected node/control plane rotation (multi-zone, replicas > 1, PDBs)" +} + +cases { + name: "gke_post_upgrade_gpu_regression" + prompt: "We upgraded our GPU node pool to a new GKE version. The upgrade completed successfully and the nodes are Ready, but our ML training pods are now stuck in `CrashLoopBackOff` showing `SIGSEGV` or driver initialization errors. What happened and how do we debug this?" + expectations: "Identifies that GKE node upgrades update the underlying node OS image, introducing new Linux Kernels and hardware drivers (NVIDIA GPU drivers)" + expectations: "Explains that ML workloads using CUDA are often tightly coupled to specific driver versions, and driver updates can break compatibility" + expectations: "Suggests comparing OS image, kernel version (uname -r), and driver versions between old and new nodes" + expectations: "Suggests deploying a test pod to verify GPU driver access directly on the new node" + expectations: "Recommends rolling back the node pool to the previous version as a quick mitigation" + expectations: "Advises checking CUDA version compatibility and updating workload dependencies before re-upgrading" +} + +cases { + name: "gke_stockout_cert_expiry" + prompt: "Our GKE cluster control plane certificate is expiring in 10 days, and we need to upgrade to rotate it. However, the upgrade is failing with resource stockout errors in our region. We cannot provision new nodes or upgrade the control plane. We are stuck and facing an outage if the cert expires. What should we do?" + expectations: "Recognizes that a stockout is a Google-side capacity issue and cannot be resolved by customer-side tuning alone" + expectations: "Recommends opening a P1/P2 Google Cloud Support case immediately, citing urgent certificate expiration" + expectations: "Suggests checking if the upgrade can be retried in a different zone/region" + expectations: "Suggests performing a control plane credential rotation as a short-term mitigation to renew certs without a GKE version upgrade" + expectations: "Suggests enabling the DNS-based control plane endpoint as a mitigation for client connectivity" +} + +cases { + name: "gke_upgrade_silent_pause" + prompt: "We started upgrading our GKE Standard cluster node pool during our weekly 4-hour maintenance window. The upgrade was only half done when the window closed, and now the status is just sitting there. There are no error logs, and the cluster is in a mixed-version state. Is the upgrade stuck?" + expectations: "Identifies that the upgrade was paused because the maintenance window closed before the rollout could complete" + expectations: "Explains that GKE pauses active rollouts when the window closes to prevent disruption outside allowed times, leaving the cluster in a stable mixed-version state" + expectations: "Explains that the upgrade will automatically resume when the next maintenance window opens" + expectations: "Suggests temporarily widening the maintenance window if the user wants to complete the upgrade immediately" +} diff --git a/skills/cloud/gke-upgrades/SKILL.md b/skills/cloud/gke-upgrades/SKILL.md new file mode 100644 index 0000000000..ad85714ca4 --- /dev/null +++ b/skills/cloud/gke-upgrades/SKILL.md @@ -0,0 +1,196 @@ +--- +name: gke-upgrades +description: >- + Plans, executes, and validates Google Kubernetes Engine (GKE) cluster upgrades + and maintenance operations for both Standard and Autopilot clusters. Produces + upgrade plans, pre/post-upgrade checklists, maintenance runbooks with gcloud + commands, release channel strategy, and troubleshooting guides. Handles node + pool upgrade strategies (surge, blue-green), version compatibility, PDB + management, and workload-specific concerns (stateful, GPU, operators). Use this + skill whenever the user mentions GKE upgrades, Kubernetes version bumps, node + pool maintenance, GKE patching, cluster version management, release channel + selection, maintenance windows, surge upgrades, stuck upgrades, or any GKE + lifecycle management task — even casual mentions like "we need to upgrade our + clusters" or "plan our next GKE maintenance" or "our upgrade is stuck." Don't + use for GKE cluster creation, application onboarding, general networking/routing + setup, or security policy configurations (use gke-basics or relevant GKE skills + instead). +--- + +# GKE Upgrades & Maintenance + +Produce clear, actionable documents — upgrade plans, runbooks, or checklists — tailored to the user's environment. Output should be specific to their cluster mode, release channel, version, and workload types rather than generic advice. + +Always frame guidance around the auto-upgrade model: auto-upgrade with maintenance windows and exclusions is the preferred control mechanism. + +## Context Gathering + +Before producing any upgrade artifact, establish: +- **Cluster mode** — Standard or Autopilot? (Autopilot has no node pool management, mandatory resource requests, no SSH) +- **Current and target versions** — Node version skew must be within 2 minor versions of control plane. +- **Release channel** — Rapid, Regular, Stable, or Extended. +- **Environment topology & Rollout Sequencing** — Single vs multi-cluster, dev/staging/prod tiers, and whether Rollout Sequencing is used. +- **Workload sensitivity** — StatefulSets, databases, GPU, long-running batch need special handling. + +If the user provides these upfront, skip straight to the deliverable. If they're vague, fill in reasonable defaults and flag assumptions. + +## Core Principles + +GKE versions follow Kubernetes version terminology: **Major.Minor.Patch** (e.g., 1.30.1-gke.1187000). A **Minor** version bump (e.g., 1.29 → 1.30) introduces new features and APIs. A **Patch** version bump (e.g., 1.30.1 → 1.30.2) introduces security and bug fixes. Ensure the user understands this distinction. + +1. **Sequential control plane, skip-level node pools** -- Control plane upgrades are sequential (N → N+1 → N+2). Node pools support skip-level (N+2) upgrades. +2. **Control plane first** -- Control plane must be upgraded before node pools. Nodes can trail by up to 2 minor versions. +3. **Environment progression** -- Always upgrade dev/staging before production. Use **Rollout Sequencing** (preferred) to automate and enforce this progression across environments (e.g., dev → staging → prod), or manually coordinate version progression if Rollout Sequencing is not used. +4. **Workload-aware** -- Upgrade strategy depends on what's running (stateless, stateful, GPU, batch). +5. **Release channels first** -- Always recommend release channels. Note that "No channel" (static versioning) is deprecated and clusters should be migrated to release channels. +6. **Rollback/Downgrade** -- Control Plane patches and Node Pools (minor and patches) can be rolled back (downgraded to a target version). GKE supports a 2-step Control Plane minor upgrade where step 1 is rollbackable. Other Control Plane minor version rollbacks are NOT customer-doable and require GKE Support. +7. **Node pool upgrade ordering** -- When upgrading multiple node pools, always recommend sequential ordering: upgrade non-critical/stateless pools first (acting as a canary) to verify cluster health before upgrading critical stateful (database) or GPU pools. + +## Release Channels + +| Channel | Best for | SLA | +|---------|----------|-----| +| **Rapid** | Dev/test, early feature access | No upgrade stability SLA | +| **Regular** (default) | Most production | Full SLA | +| **Stable** | Mission-critical, stability-first | Full SLA | +| **Extended** | Compliance, EoS enforcement control | Full SLA | + +### Support Lifecycle +Standard GKE versions are supported for 14 months after they become available in the **Regular** channel. This means: +- **Rapid** channel versions may be supported for longer than 14 months (since they enter Rapid before Regular). +- **Stable** channel versions may be supported for less than 14 months (since they enter Stable after Regular). +- **Extended** support extends this period up to 24 months. Note that extra cost applies only during the extended support period (months 15-24). + +## Maintenance Windows & Exclusions + +Configure maintenance windows to control auto-upgrade timing. GKE also supports node pool level maintenance exclusions (in addition to cluster level) to block upgrades for specific workloads. + +**Exclusion types & Limits:** +- **"No upgrades" (Scope: `no_upgrades`)**: Blocks all upgrades (minor, patch, node). + - **Limit**: Max **90 days** of total exclusion duration in any **rolling 365-day window**. + - **Chaining constraint**: Because of the rolling 365-day limit, you cannot chain multiple exclusions to cover a continuous period longer than 90 days (e.g., you cannot cover a 100-day freeze using `no_upgrades`). +- **"No minor or node upgrades" (Scope: `no_minor_or_node_upgrades`)**: Blocks minor and node upgrades, but allows control plane patch upgrades (low risk). + - **Limit**: Up to **180 days per exclusion**. Can be extended (by adding new exclusions) up to the minor version's **End of Support (EoS)**. +- **"No minor upgrades" (Scope: `no_minor_upgrades`)**: Blocks minor upgrades, but allows control plane patches and node upgrades. + - **Limit**: Up to **180 days per exclusion**. Can be extended up to EoS. + +**Important Exclusion Rules (MUST follow when recommending exclusions and MUST include in the final text response):** +1. **Auto-upgrades only**: Maintenance exclusions **only block automatic upgrades**. Manual upgrades initiated by the user will bypass exclusions. You MUST explain this to the user. +2. **Warn against "No channel"**: You MUST explicitly warn that disabling release channels ("No channel" / static versioning) is deprecated and must not be used as a replacement for exclusions. +3. **Compare Scopes**: You MUST explain the difference between 'No upgrades' (limitations, blocks patches) and 'No minor or node upgrades' (allows patches, longer duration). Recommend 'No minor or node upgrades' when the user wants to allow security patches/fixes while blocking minor version jumps. +4. **Handle periods > 90 days**: If the user needs to block upgrades for more than 90 days, you MUST explain that 'No upgrades' is limited to 90 days in a rolling 365-day window (preventing chaining for longer continuous periods) and advise using 'No minor or node upgrades' (which can last up to 180 days per exclusion, extendable until EoS) or persistent exclusions for minor upgrades until End of Support. +5. **Version skew**: Be mindful of version skew (between control plane and node pools) when using exclusions. Ensure skew does not exceed the supported 2 minor versions. Use `--add-maintenance-exclusion-until-end-of-support` for persistent exclusions. +6. **Correct gcloud syntax**: When providing `gcloud` commands for exclusions, you MUST use the separate flag syntax: `--add-maintenance-exclusion-name`, `--add-maintenance-exclusion-start`, `--add-maintenance-exclusion-end` (or `--add-maintenance-exclusion-until-end-of-support`), and `--add-maintenance-exclusion-scope` (do NOT use a single comma-separated `--add-maintenance-exclusion` flag). + +## Mandatory Upgrade Overrides + +GKE reserves the right to override user-defined maintenance windows and exclusions for mandatory operations. These overrides cannot be disabled or blocked. + +**Common Override Scenarios:** +- **Critical Security Patches**: Urgent vulnerability fixes that must be applied immediately to protect infrastructure. +- **End of Support (EoS) / End of Life (EOL) Enforcement**: If a cluster is running an unsupported version, GKE will force upgrade it to a supported version. +- **Expiring Certificates**: If control plane certificates (CAs) are expiring (within 30 days) and rotation is required to prevent cluster unrecoverability. +- **Maintenance Starvation**: GKE requires at least 48 hours of maintenance availability in any rolling 32-day window. If exclusions block too much, GKE may force an upgrade. + +**Guidance (MUST follow when overrides are discussed):** +1. **Correlate with Bulletins**: If GKE performs an unexpected upgrade, you MUST explicitly suggest checking GKE Release Notes or Security Bulletins to correlate the event with emergency patches (do not just suggest checking Cloud Audit Logs). +2. **Design for Resilience**: Workloads must be designed to survive unexpected control plane or node rotation. You MUST recommend: + - Regional clusters (multi-master) to ensure API availability during control plane upgrades. + - Multi-zone workload deployments. + - Replicas > 1 for critical deployments. + - Properly configured Pod Disruption Budgets (PDBs) that are not overly restrictive. + +## Upgrade Planning + +When asked to plan an upgrade, produce a structured document covering: +- Version compatibility (breaking changes, deprecated APIs) (minor version upgrades only) +- Upgrade path (sequential minor version upgrades) (minor version upgrades only) +- Node pool upgrade strategy (Standard only) +- Workload readiness (PDBs, resource requests) +- Rollback/Contingency procedure (how to revert node pools or coordinate with GKE Support for master rollback) + +**Compatibility Search Rule:** +- If compatibility information (e.g., third-party operator compatibility, GPU driver/CUDA compatibility matrix) is not immediately available in the workspace or via a quick web search, **do NOT loop or make multiple search attempts**. Instead, list the compatibility verification as a **critical pre-upgrade action item** for the user in the checklist. + +### Node Pool Strategy (Standard Only) + +Recommend **Surge upgrade** as the default and most common strategy, with per-pool settings: +- **Stateless**: Higher `maxSurge` (2-3) for speed, `maxUnavailable=0` for safety. +- **Stateful/DB**: `maxSurge=1, maxUnavailable=0` (conservative). +- **GPU (fixed reservation)**: `maxSurge=0, maxUnavailable=1` (no surge capacity). +- **Large (50+ nodes)**: `maxSurge=20, maxUnavailable=0` (max parallelism). + +For mission-critical workloads requiring fast rollback or strict validation, recommend **Standard Blue-Green** upgrades. Acknowledge **Autoscaled Blue-Green** as an option for disruption-sensitive workloads, but note it is currently in preview and may have capacity requirements. + +**Upgrade Ordering (User-initiated only):** When planning manual upgrades, specify the sequence of node pool upgrades. Recommend upgrading stateless pools first, verifying cluster stability, and then upgrading stateful/GPU pools. For auto-upgrades, GKE automatically manages sequential node pool upgrades. + +For standard command sequences and runbook templates, see [`references/runbook-template.md`](references/runbook-template.md). + +### Large-Scale AI/ML Clusters (GPU/TPU) + +- **No Live Migration**: GPU VMs do not support live migration; GKE upgrades will force pod restarts. You MUST explain this. +- **Fixed Reservations & Quota**: H100/A100 typically use fixed reservations with no spare quota. + - Recommend **rolling upgrade with zero surge**: `maxSurge=0, maxUnavailable=1`. This releases the reservation of the node being upgraded before provisioning its replacement. + - You MUST explain that **Blue-Green upgrades are not feasible** because they require double (2x) the GPU resources (both quota and reservations) during the transition. +- **Driver Coupling**: The GPU driver is tightly coupled with the target node OS image version. + - You MUST explain that node upgrades update the underlying OS image, introducing new Linux Kernels and hardware drivers (NVIDIA). + - You MUST warn that driver updates can break CUDA compatibility. + - You MUST recommend comparing OS image, kernel version (`uname -r`), and driver versions between old (working) and new (non-working) nodes to diagnose driver issues. + - You MUST recommend deploying a test pod (e.g., vector addition) to verify GPU access. + - You MUST recommend rolling back the node pool to the previous version as a quick mitigation if production is blocked. + - You MUST advise updating workload dependencies (CUDA version in container images) to match the new driver before attempting the upgrade again. + - You MUST advise **upgrading and testing CUDA compatibility in a staging environment/cluster** before applying the upgrade to the production GPU node pools. +- **Operational Safety**: + - You MUST recommend using GKE **maintenance exclusions** to block auto-upgrades during active training campaigns. + - Prior to manual upgrades, cordon GPU nodes and wait for active training jobs to checkpoint/complete. +- **TPU Considerations**: TPU slices are recreated atomically (not rolling); maintenance on one slice restarts all slices in the environment. + +## Checklists + +Produce checklists as copyable markdown with checkboxes. See [`references/checklists.md`](references/checklists.md) for the full pre-upgrade and post-upgrade checklist templates. Adapt them to the user's environment. + +**Stateful Workloads:** When stateful workloads (databases) are present, always include checks for PV backup completion and verification of PV reclaim policies (e.g., Retain vs Delete) in the pre-upgrade checklist. + +**Autopilot Checklists:** For Autopilot clusters, ensure the checklists include: +- Verification of `resources.requests` on all containers (Autopilot requirement). +- You MUST include specific `kubectl` commands for API deprecation checks, specifically: `kubectl get --raw /metrics | grep apiserver_request_total | grep deprecated` to check if any active workloads are using deprecated APIs. +- Verifying PDBs to ensure they don't block node drain (even though GKE manages nodes, PDBs are still respected). +- Identifying and deleting "bare pods" (pods not managed by a ReplicaSet/Deployment/StatefulSet) as they won't be rescheduled during node recreation. +- Verification of `terminationGracePeriodSeconds` to ensure pods have enough time to shut down gracefully during node recreation. + +## Maintenance runbooks + +Produce step-by-step runbooks with actual `gcloud` and `kubectl` commands. See `references/runbook-template.md` for the standard command sequences. + +## Maintenance Window Pauses + +When diagnosing a \"stuck\" upgrade, consider if it was paused by a maintenance window: +- **Silent Pause Behavior:** If a maintenance window closes before an upgrade (auto or manual) completes, GKE intentionally pauses the rollout to prevent disruption outside allowed times. +- **Mixed-Version State:** The cluster is left in a stable mixed-version state (some nodes upgraded, some not). You MUST explicitly state that this is a supported and safe intended outcome. +- **Resumption:** The upgrade will automatically resume when the next maintenance window opens. +- **Mitigation for immediate completion:** If the user wants to complete the upgrade immediately, you MUST suggest **temporarily widening the maintenance window** to include the current time (e.g., using `gcloud container clusters update ... --maintenance-window-start ... --maintenance-window-duration ...`). Do not suggest re-triggering the manual upgrade or bypassing the window. + +## Troubleshooting + +When a user reports a stuck or failing upgrade, you MUST systematically analyze and address ALL 5 potential causes in your final response. Do not omit checks even if you suspect one is the primary cause: +1. **PDB blocking drain:** Identify if any PDB has `ALLOWED DISRUPTIONS = 0` using `kubectl get pdb -A`. +2. **Resource constraints:** Check if pods are stuck in `Pending` due to capacity limits. +3. **Bare pods:** Identify pods without owner references that are blocking the drain (recommend deleting them). +4. **Admission webhooks:** Check if Validating/Mutating webhooks are rejecting pod creation on new nodes. +5. **PVC attachment issues:** Check for volume attachment failures (especially zone constraints). + +**Stockout / Quota Exhaustion Rule:** +- If the upgrade is stuck due to `ZONE_RESOURCE_POOL_EXHAUSTED` (stockout) or `QUOTA_EXCEEDED` for Compute Engine resources: + 1. Recommend modifying the upgrade strategy to `maxSurge=0` (rolling in-place) to bypass quota limits. + 2. For `QUOTA_EXCEEDED`, suggest requesting a quota increase from Google Cloud. + 3. You MUST suggest **migrating workloads or creating new node pools in a different zone or region** where capacity/quota is available as a mitigation. + +Refer to [`references/troubleshooting.md`](references/troubleshooting.md) for the exact diagnostic commands and fix procedures for each step. + +## References + +- [GKE Release Notes](https://cloud.google.com/kubernetes-engine/docs/release-notes) +- [Upgrading GKE Clusters](https://cloud.google.com/kubernetes-engine/docs/how-to/upgrading-a-cluster) +- [Maintenance Windows & Exclusions](https://cloud.google.com/kubernetes-engine/docs/concepts/maintenance-windows-and-exclusions) +- [Rollout Sequencing Concepts](https://cloud.google.com/kubernetes-engine/docs/concepts/rollout-sequencing) +- [Configure Rollout Sequencing](https://cloud.google.com/kubernetes-engine/docs/how-to/rollout-sequencing) diff --git a/skills/cloud/gke-upgrades/references/checklists.md b/skills/cloud/gke-upgrades/references/checklists.md new file mode 100644 index 0000000000..ed69aa4ecf --- /dev/null +++ b/skills/cloud/gke-upgrades/references/checklists.md @@ -0,0 +1,75 @@ +# Checklist Templates + +Adapt these to the user's environment. Fill in cluster names, versions, and remove items that don't apply. + +## Pre-Upgrade Checklist + +``` +Pre-Upgrade Checklist +- [ ] Cluster: ___ | Mode: Standard / Autopilot | Channel: ___ +- [ ] Current version: ___ | Target version: ___ + +Compatibility +- [ ] Target version available in release channel (`gcloud container get-server-config --zone ZONE --format="yaml(channels)"`) +- [ ] No deprecated API usage (check GKE deprecation insights dashboard or check metrics: `kubectl get --raw /metrics | grep apiserver_request_total | grep deprecated`) +- [ ] GKE release notes reviewed for breaking changes between current → target +- [ ] Node version skew within 2 minor versions of control plane +- [ ] Rollout Sequencing configured and verified (if upgrading across environments) +- [ ] Third-party operators/controllers compatible with target version +- [ ] Admission webhooks tested against target version + +Workload Readiness +- [ ] PDBs configured for critical workloads (not overly restrictive) +- [ ] No bare pods — all managed by controllers +- [ ] terminationGracePeriodSeconds adequate for graceful shutdown +- [ ] StatefulSet PV backups completed, reclaim policies verified +- [ ] Resource requests/limits set on all containers (mandatory for Autopilot) +- [ ] GPU driver compatibility confirmed with target node image (if applicable) +- [ ] Postgres/database operator compatibility verified (if applicable) + +Infrastructure (Standard only) +- [ ] Node pool upgrade strategy chosen (surge / blue-green / autoscaled blue-green) +- [ ] Surge settings configured per pool: maxSurge=___ maxUnavailable=___ +- [ ] Sufficient compute quota for surge nodes +- [ ] Maintenance window configured (off-peak hours) +- [ ] Maintenance exclusions set for freeze periods (if applicable) + +Ops Readiness +- [ ] Monitoring and alerting active (Cloud Monitoring / Prometheus) +- [ ] Baseline metrics captured (error rates, latency, throughput) +- [ ] Upgrade window communicated to stakeholders +- [ ] Rollback plan documented +- [ ] On-call team aware and available +``` + +## Post-Upgrade Checklist + +``` +Post-Upgrade Checklist + +Cluster Health +- [ ] Control plane at target version: `gcloud container clusters describe CLUSTER --zone ZONE --format="value(currentMasterVersion)"` +- [ ] All node pools at target version: `gcloud container node-pools list --cluster CLUSTER --zone ZONE` +- [ ] All nodes Ready: `kubectl get nodes` +- [ ] System pods healthy: `kubectl get pods -n kube-system` +- [ ] No stuck PDBs: `kubectl get pdb --all-namespaces` + +Workload Health +- [ ] All deployments at desired replica count: `kubectl get deployments -A` +- [ ] No CrashLoopBackOff or Pending pods: `kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded` +- [ ] StatefulSets fully ready: `kubectl get statefulsets -A` +- [ ] Ingress/load balancers responding +- [ ] Application health checks and smoke tests passing + +Observability +- [ ] Metrics pipeline active, no collection gaps +- [ ] Logs flowing to aggregation +- [ ] Error rates within pre-upgrade baseline +- [ ] Latency (p50/p95/p99) within pre-upgrade baseline + +Cleanup +- [ ] Old node pools removed (if blue-green) +- [ ] Surge quota released (automatic for surge upgrades) +- [ ] Upgrade documented in changelog +- [ ] Lessons learned captured +``` diff --git a/skills/cloud/gke-upgrades/references/runbook-template.md b/skills/cloud/gke-upgrades/references/runbook-template.md new file mode 100644 index 0000000000..a3013560d2 --- /dev/null +++ b/skills/cloud/gke-upgrades/references/runbook-template.md @@ -0,0 +1,144 @@ +# Runbook Command Templates + +Standard command sequences for GKE upgrades. Replace placeholders: `CLUSTER_NAME`, `ZONE`, `TARGET_VERSION`, `NODE_POOL_NAME`. + +## Table of Contents +- [Pre-flight](#pre-flight) (Line 12-31) +- [Control plane upgrade](#control-plane-upgrade) (Line 32-47) +- [Node pool upgrade (Standard only)](#node-pool-upgrade-standard-only) (Line 48-71) +- [Maintenance window configuration](#maintenance-window-configuration) (Line 72-109) +- [Rollback/Downgrade guidance](#rollbackdowngrade-guidance) (Line 110-145) + +## Pre-flight + +```bash +# Current versions +gcloud container clusters describe CLUSTER_NAME \ + --zone ZONE \ + --format="table(name, currentMasterVersion, nodePools[].version)" + +# Available versions for channel +gcloud container get-server-config --zone ZONE \ + --format="yaml(channels)" + +# Deprecated API usage +kubectl get --raw /metrics | grep apiserver_request_total | grep deprecated + +# Cluster health +kubectl get nodes +kubectl get pods -A | grep -v Running | grep -v Completed +``` + +## Control plane upgrade + +```bash +gcloud container clusters upgrade CLUSTER_NAME \ + --zone ZONE \ + --master \ + --cluster-version TARGET_VERSION + +# Verify (wait ~10-15 min) +gcloud container clusters describe CLUSTER_NAME \ + --zone ZONE \ + --format="value(currentMasterVersion)" + +kubectl get pods -n kube-system +``` + +## Node pool upgrade (Standard only) + +```bash +# Configure surge settings +gcloud container node-pools update NODE_POOL_NAME \ + --cluster CLUSTER_NAME \ + --zone ZONE \ + --max-surge-upgrade MAX_SURGE \ + --max-unavailable-upgrade MAX_UNAVAILABLE + +# Upgrade +gcloud container node-pools upgrade NODE_POOL_NAME \ + --cluster CLUSTER_NAME \ + --zone ZONE \ + --cluster-version TARGET_VERSION + +# Monitor progress +watch 'kubectl get nodes -o wide -L cloud.google.com/gke-nodepool' + +# Verify +gcloud container node-pools list --cluster CLUSTER_NAME --zone ZONE +kubectl get pods -A | grep -v Running | grep -v Completed +``` + +## Maintenance window configuration + +```bash +# Set recurring maintenance window +gcloud container clusters update CLUSTER_NAME \ + --zone ZONE \ + --maintenance-window-start YYYY-MM-DDTHH:MM:SSZ \ + --maintenance-window-end YYYY-MM-DDTHH:MM:SSZ \ + --maintenance-window-recurrence "FREQ=WEEKLY;BYDAY=SA" + +# Add maintenance exclusion (up to 90 days) +gcloud container clusters update CLUSTER_NAME \ + --zone ZONE \ + --add-maintenance-exclusion-name="EXCLUSION_NAME" \ + --add-maintenance-exclusion-start=START_TIME \ + --add-maintenance-exclusion-end=END_TIME + +# Add persistent maintenance exclusion (until End of Support) +gcloud container clusters update CLUSTER_NAME \ + --zone ZONE \ + --add-maintenance-exclusion-name="EXCLUSION_NAME" \ + --add-maintenance-exclusion-start=START_TIME \ + --add-maintenance-exclusion-until-end-of-support \ + --add-maintenance-exclusion-scope=no_upgrades + +# Add node pool level exclusion (during creation) +gcloud container node-pools create NODE_POOL_NAME \ + --cluster CLUSTER_NAME \ + --zone ZONE \ + --add-maintenance-exclusion-until-end-of-support + +# Add node pool level exclusion (existing pool) +gcloud container node-pools update NODE_POOL_NAME \ + --cluster CLUSTER_NAME \ + --zone ZONE \ + --add-maintenance-exclusion-until-end-of-support +``` + +## Rollback/Downgrade guidance + +- **Control Plane Patches**: Can be downgraded by running the upgrade command with the target older patch version. +- **Control Plane Minors**: Rollback is only available during the first step of the 2-step upgrade process. +- **Node Pools (Minor & Patch)**: Can be downgraded directly by running the node pool upgrade command targeting the older version, OR by creating a new pool at the old version and migrating workloads (safer). + +### Downgrade Control Plane (Patch or Step-1 Minor) +```bash +gcloud container clusters upgrade CLUSTER_NAME \ + --master \ + --zone ZONE \ + --cluster-version TARGET_PREVIOUS_VERSION +``` + +### Downgrade Node Pool (Direct) +```bash +gcloud container node-pools upgrade NODE_POOL_NAME \ + --cluster CLUSTER_NAME \ + --zone ZONE \ + --cluster-version TARGET_PREVIOUS_VERSION +``` + +### Downgrade Node Pool (Safe migration - recommended) +```bash +# Create replacement node pool at previous version +gcloud container node-pools create NODE_POOL_NAME-rollback \ + --cluster CLUSTER_NAME \ + --zone ZONE \ + --cluster-version PREVIOUS_VERSION \ + --num-nodes NUM_NODES \ + --machine-type MACHINE_TYPE + +# Cordon old pool and migrate workloads +kubectl cordon -l cloud.google.com/gke-nodepool=NODE_POOL_NAME +``` diff --git a/skills/cloud/gke-upgrades/references/troubleshooting.md b/skills/cloud/gke-upgrades/references/troubleshooting.md new file mode 100644 index 0000000000..f233127718 --- /dev/null +++ b/skills/cloud/gke-upgrades/references/troubleshooting.md @@ -0,0 +1,199 @@ +# Troubleshooting GKE Upgrade Issues + +## Diagnostic flowchart + +## Table of Contents +- [Diagnostic flowchart](#diagnostic-flowchart) (Line 3-17) +- [1. PDB blocking drain (most common)](#1-pdb-blocking-drain-most-common) (Line 18-40) +- [2. Resource constraints (no room for pods)](#2-resource-constraints-no-room-for-pods) (Line 41-61) +- [3. Bare pods blocking drain](#3-bare-pods-blocking-drain) (Line 62-71) +- [4. Admission webhooks rejecting pod creation](#4-admission-webhooks-rejecting-pod-creation) (Line 72-88) +- [5. PVC attachment issues](#5-pvc-attachment-issues) (Line 89-98) +- [6. Long termination grace periods](#6-long-termination-grace-periods) (Line 99-108) +- [7. Upgrade operation stuck at GKE level](#7-upgrade-operation-stuck-at-gke-level) (Line 109-117) +- [8. Stockout during critical upgrades (e.g. cert expiration)](#8-stockout-during-critical-upgrades-eg-cert-expiration) (Line 118-148) +- [9. GPU node upgrade regressions (CrashLoopBackOff, driver issues)](#9-gpu-node-upgrade-regressions-crashloopbackoff-driver-issues) (Line 149-187) +- [Validation after applying a fix](#validation-after-applying-a-fix) (Line 188-200) + +When an upgrade is stuck or failing, work through these checks in order. Each section has the diagnosis command, what to look for, and the fix. + +## 1. PDB blocking drain (most common) + +**Diagnose:** +```bash +kubectl get pdb -A -o wide +# Look for ALLOWED DISRUPTIONS = 0 +kubectl describe pdb PDB_NAME -n NAMESPACE +``` + +**Fix — temporarily relax the PDB:** +```bash +# Option A: Allow all disruptions temporarily +kubectl patch pdb PDB_NAME -n NAMESPACE \ + -p '{"spec":{"minAvailable":null,"maxUnavailable":"100%"}}' + +# Option B: Back up and edit +kubectl get pdb PDB_NAME -n NAMESPACE -o yaml > pdb-backup.yaml +# Edit minAvailable/maxUnavailable, then: +kubectl apply -f pdb-backup.yaml +``` + +Restore original PDB after upgrade completes. + +## 2. Resource constraints (no room for pods) + +**Diagnose:** +```bash +kubectl get pods -A | grep Pending +kubectl get events -A --field-selector reason=FailedScheduling +kubectl top nodes +kubectl describe nodes | grep -A 5 "Allocated resources" +``` + +**Fix — increase surge capacity:** +```bash +gcloud container node-pools update NODE_POOL_NAME \ + --cluster CLUSTER_NAME \ + --zone ZONE \ + --max-surge-upgrade 2 \ + --max-unavailable-upgrade 0 +``` + +Or scale down non-critical workloads temporarily. + +## 3. Bare pods blocking drain + +**Diagnose:** +```bash +kubectl get pods -A -o json | \ + jq -r '.items[] | select(.metadata.ownerReferences | length == 0) | "\(.metadata.namespace)/\(.metadata.name)"' +``` + +**Fix:** Delete bare pods (they won't reschedule anyway) or wrap in Deployments. + +## 4. Admission webhooks rejecting pod creation + +**Diagnose:** +```bash +kubectl get validatingwebhookconfigurations +kubectl get mutatingwebhookconfigurations +# Check for webhooks matching broad API groups +kubectl describe validatingwebhookconfigurations WEBHOOK_NAME +``` + +**Fix — temporarily disable problematic webhook:** +```bash +# Add failure policy annotation or delete temporarily +kubectl delete validatingwebhookconfigurations WEBHOOK_NAME +# Re-create after upgrade +``` + +## 5. PVC attachment issues + +**Diagnose:** +```bash +kubectl get pvc -A | grep -v Bound +kubectl get events -A --field-selector reason=FailedAttachVolume +``` + +**Fix:** Check if volumes are zone-locked. For regional clusters, PVs may need to be in the same zone as the new node. Consider migrating workloads to already-upgraded nodes. + +## 6. Long termination grace periods + +**Diagnose:** +```bash +kubectl get pods -A -o json | \ + jq '.items[] | select(.spec.terminationGracePeriodSeconds > 120) | {ns:.metadata.namespace, name:.metadata.name, grace:.spec.terminationGracePeriodSeconds}' +``` + +**Fix:** Reduce `terminationGracePeriodSeconds` in the workload spec if possible. GKE waits up to 1 hour for pod eviction during surge upgrades. + +## 7. Upgrade operation stuck at GKE level + +**Diagnose:** +```bash +gcloud container operations list --cluster CLUSTER_NAME --zone ZONE --filter="operationType=UPGRADE_NODES" +``` + +**Fix:** If the operation shows no progress for >2 hours after resolving pod-level issues, contact GKE support with cluster name, zone, and operation ID. + +## 8. Stockout during critical upgrades (e.g. cert expiration) + +**Diagnose:** +Upgrade is failing with `ZONE_RESOURCE_POOL_EXHAUSTED` or `QUOTA_EXCEEDED` errors, and the cluster has a critical pending deadline (e.g., control plane certificate expiring soon). + +**Fix:** +1. **Change Upgrade Strategy**: Modify the node pool to use a rolling in-place upgrade (no surge) to bypass quota limits: + ```bash + gcloud container node-pools update NODE_POOL_NAME \ + --cluster CLUSTER_NAME \ + --zone ZONE \ + --max-surge-upgrade 0 \ + --max-unavailable-upgrade 1 + ``` +2. **Open Support Case**: Immediately open a P1/P2 Google Cloud Support case, citing urgent certificate expiration and stockout. +3. **Retry in Different Zone/Region**: If the cluster is regional or multi-zonal, check if you can retry the upgrade in a different zone that might have capacity, or add a temporary node pool in a different zone/region to migrate workloads. +4. **Credential Rotation**: Perform a control plane credential rotation to renew certificates without upgrading the GKE version. + ```bash + gcloud container clusters update CLUSTER_NAME --start-credential-rotation --zone ZONE + # Follow standard GKE documentation to complete the rotation. + ``` +5. **Enable DNS Endpoint**: If client connectivity is failing or at risk due to expired client certificates, enable the DNS-based control plane endpoint to allow IAM-based authentication. + ```bash + gcloud container clusters update CLUSTER_NAME --enable-dns-access --zone ZONE + # Get credentials using DNS endpoint: + gcloud container clusters get-credentials CLUSTER_NAME --dns-endpoint --zone ZONE + ``` + +## 9. GPU node upgrade regressions (CrashLoopBackOff, driver issues) + +**Diagnose:** +GPU nodes upgrade successfully, but ML pods are stuck in `CrashLoopBackOff` with `SIGSEGV` or driver initialization errors. + +1. **Compare Node Metadata**: Check if the OS image, kernel version (`uname -r`), or NVIDIA driver version changed and differs between working (old) and non-working (new) nodes. +2. **Verify Driver Installer Logs**: Check logs of the `nvidia-driver-installer` container in the `nvidia-gpu-device-plugin` pod on the new node. +3. **Test GPU Access**: Deploy a simple test pod to verify if the GPU is accessible with the current driver: + ```yaml + apiVersion: v1 + kind: Pod + metadata: + name: gpu-test-vectoradd + spec: + containers: + - name: vectoradd + image: nvidia/samples:vectoradd-cuda11.6.0 + resources: + limits: + nvidia.com/gpu: 1 + restartPolicy: Never + ``` + +**Fix:** +1. **Pin Driver Version**: If the default driver version changed, update your node pool configuration to pin to the previous working driver version (e.g., `R535`): + ```bash + gcloud container node-pools update NODE_POOL_NAME \ + --cluster CLUSTER_NAME \ + --zone ZONE \ + --accelerator type=GPU_TYPE,count=COUNT,gpu-driver-version=DRIVER_VERSION + ``` +2. **Update Workload Dependencies**: Rebuild container images with a CUDA version compatible with the new driver. +3. **Rollback Node Pool**: If production is blocked, roll back the node pool to the previous GKE version: + ```bash + gcloud container node-pools upgrade NODE_POOL_NAME \ + --cluster CLUSTER_NAME \ + --zone ZONE \ + --cluster-version PREVIOUS_VERSION + ``` + +## Validation after applying a fix + +```bash +# Monitor node upgrade progress +watch 'kubectl get nodes -o wide | grep -E "NAME|CURRENT_VERSION|TARGET_VERSION"' + +# Check no pods stuck +kubectl get pods -A | grep -E "Terminating|Pending" + +# Confirm upgrade resuming +gcloud container operations list --cluster CLUSTER_NAME --zone ZONE --limit=1 +```