fix: disable APF feature flag to prevent readyz-blocking informers by scotwells · Pull Request #30 · milo-os/ipam

scotwells · 2026-05-23T16:11:26Z

Problem

The IPAM apiserver pods are stuck 0/1 Ready in staging. The readiness probe returns HTTP 500 indefinitely:

informer-sync failed: 2 informers not started yet: [*v1.FlowSchema *v1.PriorityLevelConfiguration]

Why the previous fix (#29) didn't work

PR #29 moved genericConfig.FlowControl = nil to after ApplyTo, reasoning that ApplyTo re-initializes the field. That was correct but incomplete.

The real problem: FeatureOptions.ApplyTo calls utilflowcontrol.New(informers, ...), which registers FlowSchema and PriorityLevelConfiguration event handlers directly on the SharedInformerFactory. Setting FlowControl = nil afterward removes the controller reference but does nothing to the factory — those informers remain registered and appear in the informer-sync readyz check, where they block readyz because the IPAM apiserver has no flowcontrol.apiserver.k8s.io access.

Fix

Set EnablePriorityAndFairness = false on RecommendedOptions.Features in NewIPAMServerOptions(), before ApplyTo is ever called. This causes FeatureOptions.ApplyTo to skip the utilflowcontrol.New() call entirely — the informers are never registered, and readyz is unblocked.

The now-redundant genericConfig.FlowControl = nil is removed.

🤖 Generated with Claude Code

scotwells · 2026-05-24T01:37:46Z

Closing — this fix is not needed. The APF informers only failed because AUTHENTICATION_SKIP_LOOKUP=true was manually set in the staging deployment, which broke the kube-apiserver client setup and left the informers unable to sync. The quota and activity services both run with APF enabled and work correctly in staging. The real fix is reverting the skip-lookup drift to match the base config defaults (false).

scotwells · 2026-05-24T04:32:48Z

Re-opening. Neither the quota nor the activity service has a NetworkPolicy, so they can't confirm the default:443 egress rule is sufficient for APF informers in GKE. IPAM is the only aggregated apiserver behind a NetworkPolicy, and the informer failure is likely a genuine connectivity issue, not a side-effect of skip-lookup=true. Disabling APF is still the correct fix.

IPAM is a delegating aggregated apiserver: API Priority and Fairness is enforced by the main kube-apiserver, not here. With APF enabled, FeatureOptions.ApplyTo calls utilflowcontrol.New(), which registers FlowSchema and PriorityLevelConfiguration event handlers on the shared informer factory. Those informers are then counted by the informer-sync readyz check and never reliably sync (they require list/watch on flowcontrol.apiserver.k8s.io against the host apiserver), so /readyz returns 500 forever and the pod never becomes Ready. The aggregation layer then never registers the APIService. The previous fix nil-ed genericConfig.FlowControl after ApplyTo, but that is too late: the informers are already registered by the time ApplyTo returns. Set EnablePriorityAndFairness=false before ApplyTo so utilflowcontrol.New() is never called and the informers are never registered, and drop the now-redundant FlowControl=nil line. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

scotwells · 2026-06-17T20:35:13Z

Rebased onto current main and force-pushed (was conflicting — the branch predated the IPPool/IPClaim/IPAllocation refactor). Now a single clean commit against main: serve.go only, +11/-6.

What the fix does: sets RecommendedOptions.Features.EnablePriorityAndFairness = false before ApplyTo, so FeatureOptions.ApplyTo never calls utilflowcontrol.New() and the FlowSchema/PriorityLevelConfiguration informers are never registered. Also drops the prior genericConfig.FlowControl = nil line (PR #29), which was too late to help — the informers are already registered by the time ApplyTo returns, so informer-sync still failed and /readyz stayed 500 forever.

On the NetworkPolicy hypothesis (it's a red herring): config/base/networkpolicy.yaml already permits egress to the kube-apiserver (kube-system 6443/443, default ns 443) plus DNS — no rule is missing. On staging the informers do eventually sync (Caches populated for both types; Deployment currently 2/2 Available), but the flowcontrol watches are flaky (watch ended with error: client connection lost), so readiness depends on a fragile, unnecessary dependency. Disabling APF removes that dependency entirely rather than patching egress. No NetworkPolicy change needed.

Merging this to main triggers release.yml → new ghcr.io/milo-os/ipam image + ipam-kustomize bundle, which staging's OCIRepository (tracking latest *-main-*) auto-pulls.

scotwells · 2026-06-17T20:40:55Z

Added a second commit for the ingress side, which is the reason the APIService is still unavailable even though the pods are now Ready.

The aggregation front-proxy is in a different namespace than the NetworkPolicy assumes. On Datum staging the apiserver that hosts and proxies v1alpha1.ipam.miloapis.com is milo-apiserver, running in datum-system (kubectl -n datum-system get deploy milo-apiserver → 3/3). But the ingress rules only permit 8443 from kube-system and telemetry-system. So milo-apiserver in datum-system can't reach the IPAM Service to proxy aggregated requests → the APIService never reports Available, regardless of pod readiness. (The host GKE apiserver returns NotFound for the APIService because it's registered in Milo, not the host.) Auth is --authentication-skip-lookup=true (pure front-proxy mTLS), so this is the connection path that matters.

Fix adds an ingress rule allowing datum-system → 8443, keeping the kube-system rule for vanilla kind/kubeadm local dev. Egress is unchanged — delegated authz reaches the GKE host apiserver via the SA token and already works (informers sync on staging).

PR now has two commits: (1) disable APF (readyz), (2) allow datum-system ingress (APIService reachability). Both needed for staging to come fully Available.

scotwells closed this May 24, 2026

scotwells reopened this May 24, 2026

yahyafakhroji force-pushed the fix/disable-apf-before-apply-to branch from 1799a10 to f6f7489 Compare May 24, 2026 18:05

yahyafakhroji force-pushed the main branch from e57ce11 to 3c24ae9 Compare May 24, 2026 18:05

scotwells force-pushed the fix/disable-apf-before-apply-to branch from bb60647 to d69270c Compare June 17, 2026 20:34

scotwells force-pushed the fix/disable-apf-before-apply-to branch from dc6fc64 to d69270c Compare June 17, 2026 20:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: disable APF feature flag to prevent readyz-blocking informers#30

fix: disable APF feature flag to prevent readyz-blocking informers#30
scotwells wants to merge 1 commit into
mainfrom
fix/disable-apf-before-apply-to

scotwells commented May 23, 2026

Uh oh!

scotwells commented May 24, 2026

Uh oh!

scotwells commented May 24, 2026

Uh oh!

scotwells commented Jun 17, 2026

Uh oh!

scotwells commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

scotwells commented May 23, 2026

Problem

Why the previous fix (#29) didn't work

Fix

Uh oh!

scotwells commented May 24, 2026

Uh oh!

scotwells commented May 24, 2026

Uh oh!

scotwells commented Jun 17, 2026

Uh oh!

scotwells commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant