Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
115 changes: 115 additions & 0 deletions skills/cloud/gke-upgrades/EVAL.txtpb
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
suite_name: "gke-upgrades"

cases {
name: "gke_upgrade_plan_standard"
prompt: "We have a Standard GKE cluster on the Regular release channel running 1.28 in us-central1-a. We need to upgrade to 1.30 before end of quarter. We have 3 node pools -- a general-purpose pool, a high-memory pool for our Postgres operator, and a GPU pool for ML inference. Can you put together an upgrade plan?"
expectations: "Recommends sequential upgrade path (1.28→1.29→1.30) rather than skipping"
expectations: "Specifies control plane upgrades before node pool upgrades"
expectations: "Provides different node pool upgrade ordering with rationale for the sequence"
expectations: "Includes specific surge upgrade settings (maxSurge/maxUnavailable) for each pool"
expectations: "Addresses Postgres-specific concerns (PDB, backup, operator compatibility, PV reclaim)"
expectations: "Addresses GPU-specific concerns (driver compatibility, workload coordination)"
expectations: "Includes actual gcloud commands for the upgrade steps"
expectations: "Provides a pre-upgrade checklist"
expectations: "Provides a post-upgrade checklist with validation commands"
expectations: "Includes rollback procedure or guidance"
}

cases {
name: "gke_autopilot_checklists"
prompt: "I manage 4 Autopilot clusters -- 2 in dev on Rapid channel and 2 in prod on Stable. We've been getting notifications that our prod clusters will auto-upgrade from 1.29 to 1.30 next month. Can you give me pre and post upgrade checklists tailored to our setup?"
expectations: "Distinguishes between dev (Rapid) and prod (Stable) clusters in the checklists"
expectations: "Recognizes Autopilot limitations (no node management, no SSH, mandatory resource requests)"
expectations: "Includes API deprecation checks with kubectl commands"
expectations: "Mentions mandatory resource requests as an Autopilot-specific concern"
expectations: "Pre-upgrade checklist includes workload readiness items (PDBs, bare pods, graceful shutdown)"
expectations: "Post-upgrade checklist includes version verification commands"
expectations: "Post-upgrade checklist includes workload health validation"
expectations: "Recommends testing on dev clusters before prod auto-upgrade"
expectations: "Includes monitoring/observability validation items"
}

cases {
name: "gke_troubleshoot_stuck_upgrade"
prompt: "Our node pool upgrade has been stuck for 2 hours. It says 3 out of 12 nodes upgraded. We're on a Standard cluster in us-east1, running 1.29, upgrading nodes to 1.30. The pods on the remaining nodes aren't draining. What should we check and how do we fix this?"
expectations: "Identifies PDB as the most likely/common cause of stuck upgrades"
expectations: "Provides kubectl command to check PDBs (kubectl get pdb --all-namespaces or similar)"
expectations: "Provides commands to identify which pods are blocking the drain"
expectations: "Includes resource constraint diagnosis (capacity issues, pending pods)"
expectations: "Provides a fix for restrictive PDBs (patch or temporary adjustment)"
expectations: "Includes guidance on surge upgrade settings as a potential fix"
expectations: "Mentions webhook issues as a possible cause"
expectations: "Includes validation steps to confirm the fix is working"
}

cases {
name: "gke_maintenance_exclusions"
prompt: "We have a critical business period coming up from Nov 15 to Jan 15. We are running a GKE Standard cluster on the Regular channel. How can we ensure GKE doesn't auto-upgrade our control plane or nodes during this time, but still allows emergency security patches?"
expectations: "Recommends using maintenance exclusions instead of 'No channel' (which is deprecated)."
expectations: "Explains the difference between 'No upgrades' and 'No minor or node upgrades' exclusions."
expectations: "Recommends 'No minor or node upgrades' (or equivalent) to allow patch upgrades while blocking minor/node upgrades."
expectations: "Provides the correct gcloud command to add a maintenance exclusion."
expectations: "Verifies the suggested gcloud command uses the correct separate flags for exclusions (e.g., --add-maintenance-exclusion-name, --add-maintenance-exclusion-start, etc.)."
expectations: "Explains that auto-upgrades respect exclusions but manual upgrades bypass them."
expectations: "Mentions the 90-day limit for 'No upgrades' and how to handle longer periods (e.g. persistent exclusions for minor upgrades until End of Support)."
}

cases {
name: "gke_gpu_node_upgrade"
prompt: "We run large-scale ML training on GKE using H100 GPUs with fixed reservations. We need to upgrade our node pools. We cannot use surge upgrades because we don't have extra GPU quota, and we cannot live migrate GPU workloads. What is the recommended upgrade strategy and plan?"
expectations: "Explains that GPU VMs do not support live migration and upgrades force pod restart."
expectations: "Recommends using maxSurge=0 and maxUnavailable=1 (or failure domain-based batching) due to fixed GPU reservations."
expectations: "Explains why blue-green upgrade is not feasible (requires 2x GPU resources)."
expectations: "Recommends cordoning GPU nodes and waiting for active training jobs to checkpoint/complete before upgrading."
expectations: "Warns about GPU driver coupling with the GKE version and the need to test CUDA compatibility in staging."
expectations: "Recommends using maintenance exclusions to block upgrades during active training campaigns."
}

cases {
name: "gke_upgrade_quota_exhausted"
prompt: "Our GKE node pool upgrade is stuck. The status is 'updating' for several hours, but no new nodes are coming up. We checked the GCE logs and see errors like `ZONE_RESOURCE_POOL_EXHAUSTED` and `QUOTA_EXCEEDED` for `CPUS_ALL_REGIONS` in our zone. What does this mean and how do we resolve it?"
expectations: "Identifies that the upgrade requires temporary extra resources (surge nodes) which are failing to provision due to GCE quota or zone capacity limits"
expectations: "Suggests checking GCP Quota metrics for the target region/zone"
expectations: "Recommends modifying the node pool's surge configuration (e.g., lowering maxSurge to reduce concurrent resource footprint)"
expectations: "Suggests requesting a quota increase from Google Cloud"
expectations: "Suggests moving workloads or node pools to other zones with available capacity if feasible"
}

cases {
name: "gke_mandatory_upgrade_override"
prompt: "We configured a maintenance exclusion window to block all upgrades during our peak sales event. However, GKE just upgraded our control plane anyway, causing some brief disruption. Why did GKE ignore our exclusion window? Is this a bug?"
expectations: "Explains that GKE reserves the right to override user-defined maintenance policies for mandatory operations (critical security patches, EOL version enforcement, expiring CAs)"
expectations: "Suggests checking GKE release notes or security bulletins to correlate the upgrade with emergency patches"
expectations: "Explains that mandatory overrides cannot be disabled or blocked by exclusions"
expectations: "Advises designing workloads to be resilient to unexpected node/control plane rotation (multi-zone, replicas > 1, PDBs)"
}

cases {
name: "gke_post_upgrade_gpu_regression"
prompt: "We upgraded our GPU node pool to a new GKE version. The upgrade completed successfully and the nodes are Ready, but our ML training pods are now stuck in `CrashLoopBackOff` showing `SIGSEGV` or driver initialization errors. What happened and how do we debug this?"
expectations: "Identifies that GKE node upgrades update the underlying node OS image, introducing new Linux Kernels and hardware drivers (NVIDIA GPU drivers)"
expectations: "Explains that ML workloads using CUDA are often tightly coupled to specific driver versions, and driver updates can break compatibility"
expectations: "Suggests comparing OS image, kernel version (uname -r), and driver versions between old and new nodes"
expectations: "Suggests deploying a test pod to verify GPU driver access directly on the new node"
expectations: "Recommends rolling back the node pool to the previous version as a quick mitigation"
expectations: "Advises checking CUDA version compatibility and updating workload dependencies before re-upgrading"
}

cases {
name: "gke_stockout_cert_expiry"
prompt: "Our GKE cluster control plane certificate is expiring in 10 days, and we need to upgrade to rotate it. However, the upgrade is failing with resource stockout errors in our region. We cannot provision new nodes or upgrade the control plane. We are stuck and facing an outage if the cert expires. What should we do?"
expectations: "Recognizes that a stockout is a Google-side capacity issue and cannot be resolved by customer-side tuning alone"
expectations: "Recommends opening a P1/P2 Google Cloud Support case immediately, citing urgent certificate expiration"
expectations: "Suggests checking if the upgrade can be retried in a different zone/region"
expectations: "Suggests performing a control plane credential rotation as a short-term mitigation to renew certs without a GKE version upgrade"
expectations: "Suggests enabling the DNS-based control plane endpoint as a mitigation for client connectivity"
}

cases {
name: "gke_upgrade_silent_pause"
prompt: "We started upgrading our GKE Standard cluster node pool during our weekly 4-hour maintenance window. The upgrade was only half done when the window closed, and now the status is just sitting there. There are no error logs, and the cluster is in a mixed-version state. Is the upgrade stuck?"
expectations: "Identifies that the upgrade was paused because the maintenance window closed before the rollout could complete"
expectations: "Explains that GKE pauses active rollouts when the window closes to prevent disruption outside allowed times, leaving the cluster in a stable mixed-version state"
expectations: "Explains that the upgrade will automatically resume when the next maintenance window opens"
expectations: "Suggests temporarily widening the maintenance window if the user wants to complete the upgrade immediately"
}
Loading