Skip to content

fix(cilium): restart agents via a separate v0.19.2 migration#690

Merged
brunodam merged 2 commits into
mainfrom
fix/cilium-migration-agent-restart
Jun 13, 2026
Merged

fix(cilium): restart agents via a separate v0.19.2 migration#690
brunodam merged 2 commits into
mainfrom
fix/cilium-migration-agent-restart

Conversation

@nathanklick

@nathanklick nathanklick commented Jun 12, 2026

Copy link
Copy Markdown
Member

Problem

Follow-up to #689 (shipped in v0.19.1). That migration runs cilium upgrade --values to flip existing clusters from loadBalancer.acceleration: best-effortdisabled (#669/#674). Verified on the testnet blk hosts after rolling v0.19.1:

  • The migration ran and the cilium-config ConfigMap flipped to disabled (helm release cilium → rev 2) ✅
  • But XDP stayed attached to the public NIC (prog/xdp … cil_xdp_entry on eno1) and the cilium agent never restarted (age 25h, 0 restarts) ❌

Root cause: the Cilium chart's agent pod template has no checksum/config annotation, so a ConfigMap-only helm upgrade doesn't roll the DaemonSet. The agent reads bpf-lb-acceleration only at startup — staged but never applied.

Fix — a separate migration tied to v0.19.2

The restart is its own migration (CiliumAgentRestartMigration, minVersion = 0.19.2), not folded into the v0.19.1 migration — because the blk hosts already ran the v0.19.1 migration (installed == 0.19.1), so it won't re-fire on them. A 0.19.2-gated migration fires on the 0.19.1 → 0.19.2 upgrade and restarts the staged-but-not-applied agents.

  • Registered after CiliumAccelerationMigration, so a 0.18.x → 0.19.2 upgrade flips the config then restarts in one pass.
  • Execute restarts only when k8s + cilium are installed and acceleration is already disabled (the acceleration migration owns flipping the config); no-op otherwise.
  • New kube.Client.RolloutRestart(ctx, KindDaemonSet, "kube-system", "cilium") (via the dynamic client; adds KindDaemonSet + GVR) + a WaitForResource/daemonSetRolledOut rollout wait that tolerates a transient NIC blip during XDP detach.
  • A failed restart fails Execute. The v0.19.1 acceleration migration is unchanged.

Review guide: docs/claude/reviews/00669-cilium-migration-agent-restart.md.

…fig applies

The startup migration (#689) ran `cilium upgrade --values` to flip existing
clusters to loadBalancer.acceleration=disabled, but the Cilium chart's agent pod
template has no checksum/config annotation, so a ConfigMap-only upgrade does not
roll the DaemonSet. The agent reads bpf-lb-acceleration only at startup, so on the
testnet blk hosts the ConfigMap flipped to "disabled" while the XDP program stayed
attached to the public NIC (agent pod never restarted) — the fix was staged but
never applied to the datapath.

After `cilium upgrade`, restart the Cilium agents and wait for the rollout so they
re-read the config and detach XDP:

  - kube.Client.RolloutRestart (new) stamps the pod template restartedAt
    annotation via the dynamic client (adds KindDaemonSet + GVR).
  - WaitForResource with a daemonSetRolledOut CheckFunc waits until the new spec
    is observed and every scheduled pod is updated+ready; transient get errors
    keep polling, since detaching XDP can briefly blip the NIC the API rides on.

A failed restart now fails Execute instead of leaving XDP attached.

Signed-off-by: Nathan Klick <nathan@swirldslabs.com>
@nathanklick nathanklick requested a review from a team as a code owner June 12, 2026 20:02
@nathanklick nathanklick requested a review from boris-bonin June 12, 2026 20:02
@swirlds-automation

swirlds-automation commented Jun 12, 2026

Copy link
Copy Markdown

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

Move the Cilium agent restart out of the v0.19.1 CiliumAccelerationMigration and
into its own CiliumAgentRestartMigration gated on the v0.19.2 boundary.

The v0.19.1 migration already ran on the testnet blk hosts (installed == 0.19.1):
it flipped the cilium-config ConfigMap to acceleration=disabled but, because the
Cilium chart has no checksum/config annotation, never rolled the DaemonSet — so
the agents kept the old config with XDP still attached to the public NIC. Folding
the restart into that migration would not help those hosts, since it no longer
fires once installed >= 0.19.1.

A separate 0.19.2-gated migration fires on the 0.19.1 -> 0.19.2 upgrade and
restarts the agents so they apply the disabled config (detach XDP). Registered
after CiliumAccelerationMigration, so a 0.18.x -> 0.19.2 upgrade flips the config
then restarts in one pass. It only restarts when k8s + cilium are installed and
acceleration is already disabled; a failed restart fails the migration.

Adds kube.Client.RolloutRestart (+ KindDaemonSet) and a daemonSetRolledOut wait
that tolerates the brief NIC blip XDP detach can cause. The v0.19.1 acceleration
migration is unchanged.

Signed-off-by: Nathan Klick <nathan@swirldslabs.com>
@nathanklick nathanklick changed the title fix(cilium): restart agents in acceleration migration so disabled config applies fix(cilium): restart agents via a separate v0.19.2 migration Jun 12, 2026

@brunodam brunodam left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@brunodam brunodam merged commit 505960d into main Jun 13, 2026
21 checks passed
@brunodam brunodam deleted the fix/cilium-migration-agent-restart branch June 13, 2026 04:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants