Problem
In HA deployments (2+ replicas with leader election), the admission webhook intermittently rejects valid workload creates with:
Error: creating workload: admission webhook "vworkload.kb.io" denied the request: cluster <project-name> not found
This is reproducible in staging today. The error is intermittent because it only occurs on requests routed to the non-leader pod (~50% of traffic).
What's happening
The ValidateCreate webhook handler calls mgr.GetCluster(ctx, clusterName), which delegates to the Milo multicluster provider's Get(). The provider only has a cluster registered in its projects map after Reconcile() calls mcAware.Engage() — and Engage() requires mcAware to be set, which only happens when provider.Start() is called.
The problem: provider.Start() is added to the underlying controller-runtime manager as a manager.RunnableFunc (in mcManager.Start()), which has no NeedLeaderElection() implementation. Controller-runtime therefore treats it as a leader-elected runnable and only calls it on the leader pod.
Result:
- Leader pod:
provider.Start() fires → mcAware set → projects reconciled → p.projects populated → GetCluster() works → webhooks succeed
- Non-leader pod:
provider.Start() never fires → mcAware nil → every Reconcile() returns "Multicluster manager not yet started" → p.projects empty → GetCluster() always fails → webhooks always fail
The compute-webhook Service selects all pods (same label selector as the manager), so Kubernetes load-balances webhook traffic across both replicas.
This affects every project, not just one. Any workload create or update routed to the non-leader will fail.
Why the obvious fix doesn't work
Making the Milo provider implement NeedLeaderElection() bool { return false } doesn't help because the multicluster manager wraps provider.Start() in an anonymous manager.RunnableFunc before adding it — that wrapper doesn't forward the interface.
Even if it did, it would break controller isolation: provider.Start() on all pods → Engage() called on all pods → per-cluster controllers start on all pods, bypassing leader election entirely.
Root cause
The webhook and the controllers share a single provider instance that conflates two distinct responsibilities:
- Maintaining the
projects map for GetCluster() lookups (needed by the webhook on all pods)
- Calling
mcAware.Engage() to start per-cluster controllers (must only happen on the leader)
These are tangled in Reconcile(): p.projects[key] = cl only happens after Engage() succeeds, so the webhook's lookup capability is gated on the same leader election that protects controller startup.
Options to explore
1. Separate webhook and controller deployments (standard operator pattern, cleanest)
Split into two distinct Deployments:
- Webhook deployment: runs its own Milo provider, no leader election, no controllers — all pods serve webhook traffic with full
projects map
- Controller deployment: leader election enabled, runs controllers, no webhook server
2. Two provider instances within the same binary
Instantiate two independent Milo providers in main.go:
- A webhook-only provider: started as a non-leader runnable, populates a cluster map without engaging any controllers (requires Milo provider to support a watch-only/no-engage mode)
- A controller provider: the existing one, fully leader-gated
3. Webhook-safe GetCluster() path
Give the webhook its own lightweight multicluster.Provider implementation that builds direct REST client connections to project control planes on demand (via https://milo-apiserver/.../projects/{name}/control-plane), without needing the shared provider or leader election at all.
Workaround (staging)
Scale compute-manager to 1 replica. The single pod is always the leader, so the provider is always started and all webhook requests succeed.
kubectl scale deployment compute-manager -n compute-system --replicas=1
Questions to answer
- Do we want to separate the webhook into its own Deployment? What does that mean for cert management and the existing kustomize structure?
- Should the Milo provider support a "watch-only" mode (populate
projects without Engage()) to enable option 2?
- Is option 3 simpler to implement while we figure out the longer-term architecture?
Problem
In HA deployments (2+ replicas with leader election), the admission webhook intermittently rejects valid workload creates with:
This is reproducible in staging today. The error is intermittent because it only occurs on requests routed to the non-leader pod (~50% of traffic).
What's happening
The
ValidateCreatewebhook handler callsmgr.GetCluster(ctx, clusterName), which delegates to the Milo multicluster provider'sGet(). The provider only has a cluster registered in itsprojectsmap afterReconcile()callsmcAware.Engage()— andEngage()requiresmcAwareto be set, which only happens whenprovider.Start()is called.The problem:
provider.Start()is added to the underlying controller-runtime manager as amanager.RunnableFunc(inmcManager.Start()), which has noNeedLeaderElection()implementation. Controller-runtime therefore treats it as a leader-elected runnable and only calls it on the leader pod.Result:
provider.Start()fires →mcAwareset → projects reconciled →p.projectspopulated →GetCluster()works → webhooks succeedprovider.Start()never fires →mcAwarenil → everyReconcile()returns "Multicluster manager not yet started" →p.projectsempty →GetCluster()always fails → webhooks always failThe
compute-webhookService selects all pods (same label selector as the manager), so Kubernetes load-balances webhook traffic across both replicas.This affects every project, not just one. Any workload create or update routed to the non-leader will fail.
Why the obvious fix doesn't work
Making the Milo provider implement
NeedLeaderElection() bool { return false }doesn't help because the multicluster manager wrapsprovider.Start()in an anonymousmanager.RunnableFuncbefore adding it — that wrapper doesn't forward the interface.Even if it did, it would break controller isolation:
provider.Start()on all pods →Engage()called on all pods → per-cluster controllers start on all pods, bypassing leader election entirely.Root cause
The webhook and the controllers share a single provider instance that conflates two distinct responsibilities:
projectsmap forGetCluster()lookups (needed by the webhook on all pods)mcAware.Engage()to start per-cluster controllers (must only happen on the leader)These are tangled in
Reconcile():p.projects[key] = clonly happens afterEngage()succeeds, so the webhook's lookup capability is gated on the same leader election that protects controller startup.Options to explore
1. Separate webhook and controller deployments (standard operator pattern, cleanest)
Split into two distinct Deployments:
projectsmap2. Two provider instances within the same binary
Instantiate two independent Milo providers in
main.go:3. Webhook-safe
GetCluster()pathGive the webhook its own lightweight
multicluster.Providerimplementation that builds direct REST client connections to project control planes on demand (viahttps://milo-apiserver/.../projects/{name}/control-plane), without needing the shared provider or leader election at all.Workaround (staging)
Scale
compute-managerto 1 replica. The single pod is always the leader, so the provider is always started and all webhook requests succeed.Questions to answer
projectswithoutEngage()) to enable option 2?