Summary
When a Site (core.posit.team/v1beta1) is deleted with foreground propagation, the operator's controllers keep recreating the children of the Site that is being deleted, so the deletion never completes — a hard deadlock that requires manual intervention (scaling the operator to 0).
Root cause: none of the Site or product (Connect/Workbench/PackageManager/Chronicle/Flightdeck) Reconcile functions check DeletionTimestamp. These CRDs carry no operator finalizer, so while an object is Terminating (only Kubernetes' foregroundDeletion finalizer remaining) it is still readable via Get, and every controller takes its "found → (re)create resources" path.
Impact
- Severity: high. Deleting a
Site (e.g. removing a site from infra config and applying) hangs indefinitely. Foreground deletion blocks on owned children that the operator keeps recreating.
- No clean automated recovery — operators must scale
team-operator-controller-manager to 0, let GC drain, then scale back up.
Steps to reproduce
- Have a
Site with the usual product children (Connect/Workbench/PackageManager/Chronicle/Flightdeck).
- Delete the
Site with foreground propagation (e.g. kubectl delete site <name> / Pulumi delete). Kubernetes sets deletionTimestamp + a foregroundDeletion finalizer and begins GC of owned children.
- Observe the deletion never completes; child Deployments/StatefulSets/PVCs/Middlewares churn, and child product CRDs are recreated within seconds (e.g. a
workbench/<site> CRD recreated roughly every ~9s).
Observed behavior (logs, genericized)
INFO <Product> found; updating resources {"name":"<site>"}
INFO successfully created or updated PVC {"pvc":"<site>-..."}
INFO updated object Deployment <site>-...
The Site and all child product CRDs show deletionTimestamp set with only ["foregroundDeletion"] as the finalizer (the operator adds no finalizer of its own).
Root cause (code refs)
SiteReconciler.Reconcile does not inspect DeletionTimestamp; on a successful Get it logs "Site found; updating resources" and unconditionally calls reconcileResources, recreating the child product CRDs — internal/controller/core/site_controller.go:79-96. It relies solely on the IsNotFound → cleanupResources path (:657-705), which can never be reached because children are continually recreated.
- All five product controllers share the same gap (Get → log "found; updating resources" → recreate, no
DeletionTimestamp guard):
connect_controller.go:63-80
workbench_controller.go:64-82
packagemanager_controller.go:57-74
chronicle_controller.go:66-83
flightdeck_controller.go:59-80
- Each controller
Owns(...) its Deployments/StatefulSets, so GC-deleting an owned object re-enqueues the CRD and triggers immediate recreation.
- The correct pattern already exists in
PostgresDatabaseReconciler.Reconcile, which guards on deletion before any create — postgresdatabase_controller.go:77.
Proposed fix
Add an early-return deletion guard immediately after the successful Get (before any reconcile/create) in each Reconcile, mirroring postgresdatabase_controller.go:77. Since these CRDs carry no operator finalizer, the correct behavior is to stop reconciling and let GC proceed:
if !obj.GetDeletionTimestamp().IsZero() {
l.Info("<Kind> is being deleted; skipping reconcile")
return ctrl.Result{}, nil
}
Apply to SiteReconciler (site_controller.go, after the Get ~:85) and all five product reconcilers (connect/workbench/packagemanager/chronicle/flightdeck). The Site-level guard alone breaks the child-CRD recreate loop; adding it to all six is the complete, defense-in-depth fix. (Any future finalizer-backed destructive teardown would slot into this same deletion branch, gated by controllerutil.ContainsFinalizer, as PostgresDatabase does.)
Environment
team-operator (github.com/posit-dev/team-operator)
- API group/version:
core.posit.team/v1beta1
sigs.k8s.io/controller-runtime v0.22.4
Summary
When a
Site(core.posit.team/v1beta1) is deleted with foreground propagation, the operator's controllers keep recreating the children of the Site that is being deleted, so the deletion never completes — a hard deadlock that requires manual intervention (scaling the operator to 0).Root cause: none of the
Siteor product (Connect/Workbench/PackageManager/Chronicle/Flightdeck)Reconcilefunctions checkDeletionTimestamp. These CRDs carry no operator finalizer, so while an object isTerminating(only Kubernetes'foregroundDeletionfinalizer remaining) it is still readable viaGet, and every controller takes its "found → (re)create resources" path.Impact
Site(e.g. removing a site from infra config and applying) hangs indefinitely. Foreground deletion blocks on owned children that the operator keeps recreating.team-operator-controller-managerto 0, let GC drain, then scale back up.Steps to reproduce
Sitewith the usual product children (Connect/Workbench/PackageManager/Chronicle/Flightdeck).Sitewith foreground propagation (e.g.kubectl delete site <name>/ Pulumi delete). Kubernetes setsdeletionTimestamp+ aforegroundDeletionfinalizer and begins GC of owned children.workbench/<site>CRD recreated roughly every ~9s).Observed behavior (logs, genericized)
The
Siteand all child product CRDs showdeletionTimestampset with only["foregroundDeletion"]as the finalizer (the operator adds no finalizer of its own).Root cause (code refs)
SiteReconciler.Reconciledoes not inspectDeletionTimestamp; on a successfulGetit logs"Site found; updating resources"and unconditionally callsreconcileResources, recreating the child product CRDs —internal/controller/core/site_controller.go:79-96. It relies solely on theIsNotFound→cleanupResourcespath (:657-705), which can never be reached because children are continually recreated.DeletionTimestampguard):connect_controller.go:63-80workbench_controller.go:64-82packagemanager_controller.go:57-74chronicle_controller.go:66-83flightdeck_controller.go:59-80Owns(...)its Deployments/StatefulSets, so GC-deleting an owned object re-enqueues the CRD and triggers immediate recreation.PostgresDatabaseReconciler.Reconcile, which guards on deletion before any create —postgresdatabase_controller.go:77.Proposed fix
Add an early-return deletion guard immediately after the successful
Get(before any reconcile/create) in eachReconcile, mirroringpostgresdatabase_controller.go:77. Since these CRDs carry no operator finalizer, the correct behavior is to stop reconciling and let GC proceed:Apply to
SiteReconciler(site_controller.go, after the Get ~:85) and all five product reconcilers (connect/workbench/packagemanager/chronicle/flightdeck). The Site-level guard alone breaks the child-CRD recreate loop; adding it to all six is the complete, defense-in-depth fix. (Any future finalizer-backed destructive teardown would slot into this same deletion branch, gated bycontrollerutil.ContainsFinalizer, as PostgresDatabase does.)Environment
team-operator(github.com/posit-dev/team-operator)core.posit.team/v1beta1sigs.k8s.io/controller-runtime v0.22.4