Skip to content

Incident Analysis: Cluster-wide Outage due to Zombie Worker Nodes and CNI/Storage Deadlocks #9

@cybersiddhu

Description

@cybersiddhu

Incident Overview

On Friday, January 23, 2026, the Kubernetes cluster dictycr-dev-staging-dcr-experiments-k8s.local entered a degraded state where 50% of the worker nodes became unresponsive. This led to a cascading failure affecting networking (Cilium) and storage (Persistent Volumes).

Root Cause Analysis

  1. Zombie Worker Nodes: Two worker nodes (nodes-us-central1-c-1mzf and nodes-us-central1-c-9psh) entered a state where the GCE instances were RUNNING, but the Kubelet service was unresponsive. This prevented the control plane from managing or draining them.
  2. Orchestration Failure: kops rolling-update failed because it could not perform a graceful kubectl drain on the unresponsive nodes.
  3. CNI Identity Synchronization: Forced node replacement resulted in a network partition. New nodes were unable to communicate with legacy nodes due to out-of-sync Cilium identities and routing tables.
  4. Persistent Volume Deadlocks: Essential services like redis and logto failed to recover due to "Multi-Attach" errors. The Google Cloud API continued to report disks as attached to the deleted "zombie" instances.

Resolution Steps

  1. Manual Node Purge: Manually deleted all 4 legacy worker instances in Google Cloud to force a clean cluster state.
  2. Network Data-Plane Reset: Re-synchronized the CNI by performing a coordinated rollout of cilium-operator, the cilium DaemonSet, and coredns.
  3. Manual Volume Unlock: Manually terminated lingering pod objects and scaled legacy ReplicaSets to 0 to release GCE Persistent Disks.
  4. Service Restoration: Coordinated restart of backend microservices (annotation, content, stock, redis) and the graphql-api-server.

Key Insights & Recommendations

  • Avoid Deadlock Drains: If a node is NotReady and SSH is unresponsive, bypass kops orchestration and delete the cloud instance immediately.
  • Unified Data Plane: For CNI issues involving node rotation, a full worker pool refresh is often faster and more reliable than a piecemeal one.
  • Monitoring Improvements: Implement alerting for sustained NotReady node conditions (>15 mins) to catch "zombie" states before they impact the entire cluster.
  • Storage Management: Be prepared to manually scale down legacy ReplicaSets to 0 during rolling updates involving Persistent Volumes if "Multi-Attach" errors persist.

Analyzed and resolved with help from gemini-cli

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions