fix(cilium): restart agents via a separate v0.19.2 migration#690
Merged
Conversation
…fig applies The startup migration (#689) ran `cilium upgrade --values` to flip existing clusters to loadBalancer.acceleration=disabled, but the Cilium chart's agent pod template has no checksum/config annotation, so a ConfigMap-only upgrade does not roll the DaemonSet. The agent reads bpf-lb-acceleration only at startup, so on the testnet blk hosts the ConfigMap flipped to "disabled" while the XDP program stayed attached to the public NIC (agent pod never restarted) — the fix was staged but never applied to the datapath. After `cilium upgrade`, restart the Cilium agents and wait for the rollout so they re-read the config and detach XDP: - kube.Client.RolloutRestart (new) stamps the pod template restartedAt annotation via the dynamic client (adds KindDaemonSet + GVR). - WaitForResource with a daemonSetRolledOut CheckFunc waits until the new spec is observed and every scheduled pod is updated+ready; transient get errors keep polling, since detaching XDP can briefly blip the NIC the API rides on. A failed restart now fails Execute instead of leaving XDP attached. Signed-off-by: Nathan Klick <nathan@swirldslabs.com>
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
Move the Cilium agent restart out of the v0.19.1 CiliumAccelerationMigration and into its own CiliumAgentRestartMigration gated on the v0.19.2 boundary. The v0.19.1 migration already ran on the testnet blk hosts (installed == 0.19.1): it flipped the cilium-config ConfigMap to acceleration=disabled but, because the Cilium chart has no checksum/config annotation, never rolled the DaemonSet — so the agents kept the old config with XDP still attached to the public NIC. Folding the restart into that migration would not help those hosts, since it no longer fires once installed >= 0.19.1. A separate 0.19.2-gated migration fires on the 0.19.1 -> 0.19.2 upgrade and restarts the agents so they apply the disabled config (detach XDP). Registered after CiliumAccelerationMigration, so a 0.18.x -> 0.19.2 upgrade flips the config then restarts in one pass. It only restarts when k8s + cilium are installed and acceleration is already disabled; a failed restart fails the migration. Adds kube.Client.RolloutRestart (+ KindDaemonSet) and a daemonSetRolledOut wait that tolerates the brief NIC blip XDP detach can cause. The v0.19.1 acceleration migration is unchanged. Signed-off-by: Nathan Klick <nathan@swirldslabs.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Follow-up to #689 (shipped in v0.19.1). That migration runs
cilium upgrade --valuesto flip existing clusters fromloadBalancer.acceleration: best-effort→disabled(#669/#674). Verified on the testnetblkhosts after rolling v0.19.1:cilium-configConfigMap flipped todisabled(helm releasecilium→ rev 2) ✅prog/xdp … cil_xdp_entryoneno1) and the cilium agent never restarted (age 25h, 0 restarts) ❌Root cause: the Cilium chart's agent pod template has no
checksum/configannotation, so a ConfigMap-onlyhelm upgradedoesn't roll the DaemonSet. The agent readsbpf-lb-accelerationonly at startup — staged but never applied.Fix — a separate migration tied to v0.19.2
The restart is its own migration (
CiliumAgentRestartMigration,minVersion = 0.19.2), not folded into the v0.19.1 migration — because theblkhosts already ran the v0.19.1 migration (installed == 0.19.1), so it won't re-fire on them. A 0.19.2-gated migration fires on the 0.19.1 → 0.19.2 upgrade and restarts the staged-but-not-applied agents.CiliumAccelerationMigration, so a0.18.x → 0.19.2upgrade flips the config then restarts in one pass.Executerestarts only when k8s + cilium are installed and acceleration is alreadydisabled(the acceleration migration owns flipping the config); no-op otherwise.kube.Client.RolloutRestart(ctx, KindDaemonSet, "kube-system", "cilium")(via the dynamic client; addsKindDaemonSet+ GVR) + aWaitForResource/daemonSetRolledOutrollout wait that tolerates a transient NIC blip during XDP detach.Execute. The v0.19.1 acceleration migration is unchanged.Review guide:
docs/claude/reviews/00669-cilium-migration-agent-restart.md.