Webhook admission failures in HA deployments due to leader-gated cluster provider

## Problem

In HA deployments (2+ replicas with leader election), the admission webhook intermittently rejects valid workload creates with:

```
Error: creating workload: admission webhook "vworkload.kb.io" denied the request: cluster <project-name> not found
```

This is reproducible in staging today. The error is intermittent because it only occurs on requests routed to the non-leader pod (~50% of traffic).

## What's happening

The `ValidateCreate` webhook handler calls `mgr.GetCluster(ctx, clusterName)`, which delegates to the Milo multicluster provider's `Get()`. The provider only has a cluster registered in its `projects` map after `Reconcile()` calls `mcAware.Engage()` — and `Engage()` requires `mcAware` to be set, which only happens when `provider.Start()` is called.

The problem: `provider.Start()` is added to the underlying controller-runtime manager as a `manager.RunnableFunc` (in `mcManager.Start()`), which has no `NeedLeaderElection()` implementation. Controller-runtime therefore treats it as a leader-elected runnable and only calls it on the leader pod.

Result:
- **Leader pod**: `provider.Start()` fires → `mcAware` set → projects reconciled → `p.projects` populated → `GetCluster()` works → webhooks succeed
- **Non-leader pod**: `provider.Start()` never fires → `mcAware` nil → every `Reconcile()` returns "Multicluster manager not yet started" → `p.projects` empty → `GetCluster()` always fails → webhooks always fail

The `compute-webhook` Service selects all pods (same label selector as the manager), so Kubernetes load-balances webhook traffic across both replicas.

This affects _every_ project, not just one. Any workload create or update routed to the non-leader will fail.

## Why the obvious fix doesn't work

Making the Milo provider implement `NeedLeaderElection() bool { return false }` doesn't help because the multicluster manager wraps `provider.Start()` in an anonymous `manager.RunnableFunc` before adding it — that wrapper doesn't forward the interface.

Even if it did, it would break controller isolation: `provider.Start()` on all pods → `Engage()` called on all pods → per-cluster controllers start on all pods, bypassing leader election entirely.

## Root cause

The webhook and the controllers share a single provider instance that conflates two distinct responsibilities:
1. Maintaining the `projects` map for `GetCluster()` lookups (needed by the webhook on all pods)
2. Calling `mcAware.Engage()` to start per-cluster controllers (must only happen on the leader)

These are tangled in `Reconcile()`: `p.projects[key] = cl` only happens after `Engage()` succeeds, so the webhook's lookup capability is gated on the same leader election that protects controller startup.

## Options to explore

**1. Separate webhook and controller deployments** (standard operator pattern, cleanest)

Split into two distinct Deployments:
- Webhook deployment: runs its own Milo provider, no leader election, no controllers — all pods serve webhook traffic with full `projects` map
- Controller deployment: leader election enabled, runs controllers, no webhook server

**2. Two provider instances within the same binary**

Instantiate two independent Milo providers in `main.go`:
- A webhook-only provider: started as a non-leader runnable, populates a cluster map without engaging any controllers (requires Milo provider to support a watch-only/no-engage mode)
- A controller provider: the existing one, fully leader-gated

**3. Webhook-safe `GetCluster()` path**

Give the webhook its own lightweight `multicluster.Provider` implementation that builds direct REST client connections to project control planes on demand (via `https://milo-apiserver/.../projects/{name}/control-plane`), without needing the shared provider or leader election at all.

## Workaround (staging)

Scale `compute-manager` to 1 replica. The single pod is always the leader, so the provider is always started and all webhook requests succeed.

```bash
kubectl scale deployment compute-manager -n compute-system --replicas=1
```

## Questions to answer

- Do we want to separate the webhook into its own Deployment? What does that mean for cert management and the existing kustomize structure?
- Should the Milo provider support a "watch-only" mode (populate `projects` without `Engage()`) to enable option 2?
- Is option 3 simpler to implement while we figure out the longer-term architecture?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Webhook admission failures in HA deployments due to leader-gated cluster provider #117

Problem

What's happening

Why the obvious fix doesn't work

Root cause

Options to explore

Workaround (staging)

Questions to answer

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Webhook admission failures in HA deployments due to leader-gated cluster provider #117

Description

Problem

What's happening

Why the obvious fix doesn't work

Root cause

Options to explore

Workaround (staging)

Questions to answer

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions