Incident Analysis: Cluster-wide Outage due to Zombie Worker Nodes and CNI/Storage Deadlocks

## Incident Overview
On Friday, January 23, 2026, the Kubernetes cluster `dictycr-dev-staging-dcr-experiments-k8s.local` entered a degraded state where 50% of the worker nodes became unresponsive. This led to a cascading failure affecting networking (Cilium) and storage (Persistent Volumes).

## Root Cause Analysis
1. **Zombie Worker Nodes:** Two worker nodes (`nodes-us-central1-c-1mzf` and `nodes-us-central1-c-9psh`) entered a state where the GCE instances were RUNNING, but the Kubelet service was unresponsive. This prevented the control plane from managing or draining them.
2. **Orchestration Failure:** `kops rolling-update` failed because it could not perform a graceful `kubectl drain` on the unresponsive nodes.
3. **CNI Identity Synchronization:** Forced node replacement resulted in a network partition. New nodes were unable to communicate with legacy nodes due to out-of-sync Cilium identities and routing tables.
4. **Persistent Volume Deadlocks:** Essential services like `redis` and `logto` failed to recover due to "Multi-Attach" errors. The Google Cloud API continued to report disks as attached to the deleted "zombie" instances.

## Resolution Steps
1. **Manual Node Purge:** Manually deleted all 4 legacy worker instances in Google Cloud to force a clean cluster state.
2. **Network Data-Plane Reset:** Re-synchronized the CNI by performing a coordinated rollout of `cilium-operator`, the `cilium` DaemonSet, and `coredns`.
3. **Manual Volume Unlock:** Manually terminated lingering pod objects and scaled legacy ReplicaSets to 0 to release GCE Persistent Disks.
4. **Service Restoration:** Coordinated restart of backend microservices (`annotation`, `content`, `stock`, `redis`) and the `graphql-api-server`.

## Key Insights & Recommendations
- **Avoid Deadlock Drains:** If a node is `NotReady` and SSH is unresponsive, bypass `kops` orchestration and delete the cloud instance immediately.
- **Unified Data Plane:** For CNI issues involving node rotation, a full worker pool refresh is often faster and more reliable than a piecemeal one.
- **Monitoring Improvements:** Implement alerting for sustained `NotReady` node conditions (>15 mins) to catch "zombie" states before they impact the entire cluster.
- **Storage Management:** Be prepared to manually scale down legacy ReplicaSets to 0 during rolling updates involving Persistent Volumes if "Multi-Attach" errors persist.

---
*Analyzed and resolved with help from gemini-cli*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incident Analysis: Cluster-wide Outage due to Zombie Worker Nodes and CNI/Storage Deadlocks #9

Incident Overview

Root Cause Analysis

Resolution Steps

Key Insights & Recommendations

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Incident Analysis: Cluster-wide Outage due to Zombie Worker Nodes and CNI/Storage Deadlocks #9

Description

Incident Overview

Root Cause Analysis

Resolution Steps

Key Insights & Recommendations

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions