Skip to content

Webhook admission failures in HA deployments due to leader-gated cluster provider #117

Description

@scotwells

Problem

In HA deployments (2+ replicas with leader election), the admission webhook intermittently rejects valid workload creates with:

Error: creating workload: admission webhook "vworkload.kb.io" denied the request: cluster <project-name> not found

This is reproducible in staging today. The error is intermittent because it only occurs on requests routed to the non-leader pod (~50% of traffic).

What's happening

The ValidateCreate webhook handler calls mgr.GetCluster(ctx, clusterName), which delegates to the Milo multicluster provider's Get(). The provider only has a cluster registered in its projects map after Reconcile() calls mcAware.Engage() — and Engage() requires mcAware to be set, which only happens when provider.Start() is called.

The problem: provider.Start() is added to the underlying controller-runtime manager as a manager.RunnableFunc (in mcManager.Start()), which has no NeedLeaderElection() implementation. Controller-runtime therefore treats it as a leader-elected runnable and only calls it on the leader pod.

Result:

  • Leader pod: provider.Start() fires → mcAware set → projects reconciled → p.projects populated → GetCluster() works → webhooks succeed
  • Non-leader pod: provider.Start() never fires → mcAware nil → every Reconcile() returns "Multicluster manager not yet started" → p.projects empty → GetCluster() always fails → webhooks always fail

The compute-webhook Service selects all pods (same label selector as the manager), so Kubernetes load-balances webhook traffic across both replicas.

This affects every project, not just one. Any workload create or update routed to the non-leader will fail.

Why the obvious fix doesn't work

Making the Milo provider implement NeedLeaderElection() bool { return false } doesn't help because the multicluster manager wraps provider.Start() in an anonymous manager.RunnableFunc before adding it — that wrapper doesn't forward the interface.

Even if it did, it would break controller isolation: provider.Start() on all pods → Engage() called on all pods → per-cluster controllers start on all pods, bypassing leader election entirely.

Root cause

The webhook and the controllers share a single provider instance that conflates two distinct responsibilities:

  1. Maintaining the projects map for GetCluster() lookups (needed by the webhook on all pods)
  2. Calling mcAware.Engage() to start per-cluster controllers (must only happen on the leader)

These are tangled in Reconcile(): p.projects[key] = cl only happens after Engage() succeeds, so the webhook's lookup capability is gated on the same leader election that protects controller startup.

Options to explore

1. Separate webhook and controller deployments (standard operator pattern, cleanest)

Split into two distinct Deployments:

  • Webhook deployment: runs its own Milo provider, no leader election, no controllers — all pods serve webhook traffic with full projects map
  • Controller deployment: leader election enabled, runs controllers, no webhook server

2. Two provider instances within the same binary

Instantiate two independent Milo providers in main.go:

  • A webhook-only provider: started as a non-leader runnable, populates a cluster map without engaging any controllers (requires Milo provider to support a watch-only/no-engage mode)
  • A controller provider: the existing one, fully leader-gated

3. Webhook-safe GetCluster() path

Give the webhook its own lightweight multicluster.Provider implementation that builds direct REST client connections to project control planes on demand (via https://milo-apiserver/.../projects/{name}/control-plane), without needing the shared provider or leader election at all.

Workaround (staging)

Scale compute-manager to 1 replica. The single pod is always the leader, so the provider is always started and all webhook requests succeed.

kubectl scale deployment compute-manager -n compute-system --replicas=1

Questions to answer

  • Do we want to separate the webhook into its own Deployment? What does that mean for cert management and the existing kustomize structure?
  • Should the Milo provider support a "watch-only" mode (populate projects without Engage()) to enable option 2?
  • Is option 3 simpler to implement while we figure out the longer-term architecture?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions