Incident Overview
On Friday, January 23, 2026, the Kubernetes cluster dictycr-dev-staging-dcr-experiments-k8s.local entered a degraded state where 50% of the worker nodes became unresponsive. This led to a cascading failure affecting networking (Cilium) and storage (Persistent Volumes).
Root Cause Analysis
- Zombie Worker Nodes: Two worker nodes (
nodes-us-central1-c-1mzf and nodes-us-central1-c-9psh) entered a state where the GCE instances were RUNNING, but the Kubelet service was unresponsive. This prevented the control plane from managing or draining them.
- Orchestration Failure:
kops rolling-update failed because it could not perform a graceful kubectl drain on the unresponsive nodes.
- CNI Identity Synchronization: Forced node replacement resulted in a network partition. New nodes were unable to communicate with legacy nodes due to out-of-sync Cilium identities and routing tables.
- Persistent Volume Deadlocks: Essential services like
redis and logto failed to recover due to "Multi-Attach" errors. The Google Cloud API continued to report disks as attached to the deleted "zombie" instances.
Resolution Steps
- Manual Node Purge: Manually deleted all 4 legacy worker instances in Google Cloud to force a clean cluster state.
- Network Data-Plane Reset: Re-synchronized the CNI by performing a coordinated rollout of
cilium-operator, the cilium DaemonSet, and coredns.
- Manual Volume Unlock: Manually terminated lingering pod objects and scaled legacy ReplicaSets to 0 to release GCE Persistent Disks.
- Service Restoration: Coordinated restart of backend microservices (
annotation, content, stock, redis) and the graphql-api-server.
Key Insights & Recommendations
- Avoid Deadlock Drains: If a node is
NotReady and SSH is unresponsive, bypass kops orchestration and delete the cloud instance immediately.
- Unified Data Plane: For CNI issues involving node rotation, a full worker pool refresh is often faster and more reliable than a piecemeal one.
- Monitoring Improvements: Implement alerting for sustained
NotReady node conditions (>15 mins) to catch "zombie" states before they impact the entire cluster.
- Storage Management: Be prepared to manually scale down legacy ReplicaSets to 0 during rolling updates involving Persistent Volumes if "Multi-Attach" errors persist.
Analyzed and resolved with help from gemini-cli
Incident Overview
On Friday, January 23, 2026, the Kubernetes cluster
dictycr-dev-staging-dcr-experiments-k8s.localentered a degraded state where 50% of the worker nodes became unresponsive. This led to a cascading failure affecting networking (Cilium) and storage (Persistent Volumes).Root Cause Analysis
nodes-us-central1-c-1mzfandnodes-us-central1-c-9psh) entered a state where the GCE instances were RUNNING, but the Kubelet service was unresponsive. This prevented the control plane from managing or draining them.kops rolling-updatefailed because it could not perform a gracefulkubectl drainon the unresponsive nodes.redisandlogtofailed to recover due to "Multi-Attach" errors. The Google Cloud API continued to report disks as attached to the deleted "zombie" instances.Resolution Steps
cilium-operator, theciliumDaemonSet, andcoredns.annotation,content,stock,redis) and thegraphql-api-server.Key Insights & Recommendations
NotReadyand SSH is unresponsive, bypasskopsorchestration and delete the cloud instance immediately.NotReadynode conditions (>15 mins) to catch "zombie" states before they impact the entire cluster.Analyzed and resolved with help from gemini-cli