Skip to content

Site/product controllers recreate children during foreground deletion → deletion deadlock #139

@stevenolen

Description

@stevenolen

Summary

When a Site (core.posit.team/v1beta1) is deleted with foreground propagation, the operator's controllers keep recreating the children of the Site that is being deleted, so the deletion never completes — a hard deadlock that requires manual intervention (scaling the operator to 0).

Root cause: none of the Site or product (Connect/Workbench/PackageManager/Chronicle/Flightdeck) Reconcile functions check DeletionTimestamp. These CRDs carry no operator finalizer, so while an object is Terminating (only Kubernetes' foregroundDeletion finalizer remaining) it is still readable via Get, and every controller takes its "found → (re)create resources" path.

Impact

  • Severity: high. Deleting a Site (e.g. removing a site from infra config and applying) hangs indefinitely. Foreground deletion blocks on owned children that the operator keeps recreating.
  • No clean automated recovery — operators must scale team-operator-controller-manager to 0, let GC drain, then scale back up.

Steps to reproduce

  1. Have a Site with the usual product children (Connect/Workbench/PackageManager/Chronicle/Flightdeck).
  2. Delete the Site with foreground propagation (e.g. kubectl delete site <name> / Pulumi delete). Kubernetes sets deletionTimestamp + a foregroundDeletion finalizer and begins GC of owned children.
  3. Observe the deletion never completes; child Deployments/StatefulSets/PVCs/Middlewares churn, and child product CRDs are recreated within seconds (e.g. a workbench/<site> CRD recreated roughly every ~9s).

Observed behavior (logs, genericized)

INFO  <Product> found; updating resources    {"name":"<site>"}
INFO  successfully created or updated PVC     {"pvc":"<site>-..."}
INFO  updated object  Deployment  <site>-...

The Site and all child product CRDs show deletionTimestamp set with only ["foregroundDeletion"] as the finalizer (the operator adds no finalizer of its own).

Root cause (code refs)

  • SiteReconciler.Reconcile does not inspect DeletionTimestamp; on a successful Get it logs "Site found; updating resources" and unconditionally calls reconcileResources, recreating the child product CRDsinternal/controller/core/site_controller.go:79-96. It relies solely on the IsNotFoundcleanupResources path (:657-705), which can never be reached because children are continually recreated.
  • All five product controllers share the same gap (Get → log "found; updating resources" → recreate, no DeletionTimestamp guard):
    • connect_controller.go:63-80
    • workbench_controller.go:64-82
    • packagemanager_controller.go:57-74
    • chronicle_controller.go:66-83
    • flightdeck_controller.go:59-80
  • Each controller Owns(...) its Deployments/StatefulSets, so GC-deleting an owned object re-enqueues the CRD and triggers immediate recreation.
  • The correct pattern already exists in PostgresDatabaseReconciler.Reconcile, which guards on deletion before any create — postgresdatabase_controller.go:77.

Proposed fix

Add an early-return deletion guard immediately after the successful Get (before any reconcile/create) in each Reconcile, mirroring postgresdatabase_controller.go:77. Since these CRDs carry no operator finalizer, the correct behavior is to stop reconciling and let GC proceed:

if !obj.GetDeletionTimestamp().IsZero() {
    l.Info("<Kind> is being deleted; skipping reconcile")
    return ctrl.Result{}, nil
}

Apply to SiteReconciler (site_controller.go, after the Get ~:85) and all five product reconcilers (connect/workbench/packagemanager/chronicle/flightdeck). The Site-level guard alone breaks the child-CRD recreate loop; adding it to all six is the complete, defense-in-depth fix. (Any future finalizer-backed destructive teardown would slot into this same deletion branch, gated by controllerutil.ContainsFinalizer, as PostgresDatabase does.)

Environment

  • team-operator (github.com/posit-dev/team-operator)
  • API group/version: core.posit.team/v1beta1
  • sigs.k8s.io/controller-runtime v0.22.4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions