Skip to content

fix: disable APF feature flag to prevent readyz-blocking informers#30

Open
scotwells wants to merge 1 commit into
mainfrom
fix/disable-apf-before-apply-to
Open

fix: disable APF feature flag to prevent readyz-blocking informers#30
scotwells wants to merge 1 commit into
mainfrom
fix/disable-apf-before-apply-to

Conversation

@scotwells

Copy link
Copy Markdown
Contributor

Problem

The IPAM apiserver pods are stuck 0/1 Ready in staging. The readiness probe returns HTTP 500 indefinitely:

informer-sync failed: 2 informers not started yet: [*v1.FlowSchema *v1.PriorityLevelConfiguration]

Why the previous fix (#29) didn't work

PR #29 moved genericConfig.FlowControl = nil to after ApplyTo, reasoning that ApplyTo re-initializes the field. That was correct but incomplete.

The real problem: FeatureOptions.ApplyTo calls utilflowcontrol.New(informers, ...), which registers FlowSchema and PriorityLevelConfiguration event handlers directly on the SharedInformerFactory. Setting FlowControl = nil afterward removes the controller reference but does nothing to the factory — those informers remain registered and appear in the informer-sync readyz check, where they block readyz because the IPAM apiserver has no flowcontrol.apiserver.k8s.io access.

Fix

Set EnablePriorityAndFairness = false on RecommendedOptions.Features in NewIPAMServerOptions(), before ApplyTo is ever called. This causes FeatureOptions.ApplyTo to skip the utilflowcontrol.New() call entirely — the informers are never registered, and readyz is unblocked.

The now-redundant genericConfig.FlowControl = nil is removed.

🤖 Generated with Claude Code

@scotwells

Copy link
Copy Markdown
Contributor Author

Closing — this fix is not needed. The APF informers only failed because AUTHENTICATION_SKIP_LOOKUP=true was manually set in the staging deployment, which broke the kube-apiserver client setup and left the informers unable to sync. The quota and activity services both run with APF enabled and work correctly in staging. The real fix is reverting the skip-lookup drift to match the base config defaults (false).

@scotwells scotwells closed this May 24, 2026
@scotwells

Copy link
Copy Markdown
Contributor Author

Re-opening. Neither the quota nor the activity service has a NetworkPolicy, so they can't confirm the default:443 egress rule is sufficient for APF informers in GKE. IPAM is the only aggregated apiserver behind a NetworkPolicy, and the informer failure is likely a genuine connectivity issue, not a side-effect of skip-lookup=true. Disabling APF is still the correct fix.

IPAM is a delegating aggregated apiserver: API Priority and Fairness is
enforced by the main kube-apiserver, not here. With APF enabled,
FeatureOptions.ApplyTo calls utilflowcontrol.New(), which registers
FlowSchema and PriorityLevelConfiguration event handlers on the shared
informer factory. Those informers are then counted by the informer-sync
readyz check and never reliably sync (they require list/watch on
flowcontrol.apiserver.k8s.io against the host apiserver), so /readyz
returns 500 forever and the pod never becomes Ready. The aggregation
layer then never registers the APIService.

The previous fix nil-ed genericConfig.FlowControl after ApplyTo, but that
is too late: the informers are already registered by the time ApplyTo
returns. Set EnablePriorityAndFairness=false before ApplyTo so
utilflowcontrol.New() is never called and the informers are never
registered, and drop the now-redundant FlowControl=nil line.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@scotwells scotwells force-pushed the fix/disable-apf-before-apply-to branch from bb60647 to d69270c Compare June 17, 2026 20:34
@scotwells

Copy link
Copy Markdown
Contributor Author

Rebased onto current main and force-pushed (was conflicting — the branch predated the IPPool/IPClaim/IPAllocation refactor). Now a single clean commit against main: serve.go only, +11/-6.

What the fix does: sets RecommendedOptions.Features.EnablePriorityAndFairness = false before ApplyTo, so FeatureOptions.ApplyTo never calls utilflowcontrol.New() and the FlowSchema/PriorityLevelConfiguration informers are never registered. Also drops the prior genericConfig.FlowControl = nil line (PR #29), which was too late to help — the informers are already registered by the time ApplyTo returns, so informer-sync still failed and /readyz stayed 500 forever.

On the NetworkPolicy hypothesis (it's a red herring): config/base/networkpolicy.yaml already permits egress to the kube-apiserver (kube-system 6443/443, default ns 443) plus DNS — no rule is missing. On staging the informers do eventually sync (Caches populated for both types; Deployment currently 2/2 Available), but the flowcontrol watches are flaky (watch ended with error: client connection lost), so readiness depends on a fragile, unnecessary dependency. Disabling APF removes that dependency entirely rather than patching egress. No NetworkPolicy change needed.

Merging this to main triggers release.yml → new ghcr.io/milo-os/ipam image + ipam-kustomize bundle, which staging's OCIRepository (tracking latest *-main-*) auto-pulls.

@scotwells

Copy link
Copy Markdown
Contributor Author

Added a second commit for the ingress side, which is the reason the APIService is still unavailable even though the pods are now Ready.

The aggregation front-proxy is in a different namespace than the NetworkPolicy assumes. On Datum staging the apiserver that hosts and proxies v1alpha1.ipam.miloapis.com is milo-apiserver, running in datum-system (kubectl -n datum-system get deploy milo-apiserver → 3/3). But the ingress rules only permit 8443 from kube-system and telemetry-system. So milo-apiserver in datum-system can't reach the IPAM Service to proxy aggregated requests → the APIService never reports Available, regardless of pod readiness. (The host GKE apiserver returns NotFound for the APIService because it's registered in Milo, not the host.) Auth is --authentication-skip-lookup=true (pure front-proxy mTLS), so this is the connection path that matters.

Fix adds an ingress rule allowing datum-system → 8443, keeping the kube-system rule for vanilla kind/kubeadm local dev. Egress is unchanged — delegated authz reaches the GKE host apiserver via the SA token and already works (informers sync on staging).

PR now has two commits: (1) disable APF (readyz), (2) allow datum-system ingress (APIService reachability). Both needed for staging to come fully Available.

@scotwells scotwells force-pushed the fix/disable-apf-before-apply-to branch from dc6fc64 to d69270c Compare June 17, 2026 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant