Skip to content

If a primary pod fails to initialize, flagger doesn't always do the right thing #1898

@mtfurlan

Description

@mtfurlan

Describe the bug

Right now I have a bug unrelated to flagger causing any update pods in the primary deployment to fail.
Flagger did some things that surprised me in response to this.

With deployment mode Recreate:

  1. update deployment
  2. flagger does tests on canary, and tries to deploy to primary
  3. all traffic is routed to canary
  4. the primary deletes it's pod, then tries to deploy a new pod
  5. the new primary pod never initializes
  6. 1 minute of $canary not ready: waiting for rollout to finish: 0 of 1 (readyThreshold 100%) updated replicas are available
  7. flagger marks the canary as failed, deletes the canary

the application is now in a completely broken state, the only pod "running" is the new broken primary

With deployment mode RollingUpdate

  1. update deployment
  2. flagger does tests on canary, and tries to deploy to primary
  3. all traffic is routed to canary
  4. the primary tries to deploy a new pod
  5. the pod never successfully initializes
  6. 10 minutes of $canary not ready: waiting for rollout to finish: 1 old replicas are pending termination
  7. flagger marks the canary as failed, deletes the canary

both the working primary pod and the new broken primary pod are running, but at least the old one is still handling traffic.

I also have no idea why it only tries for 1 minute in Recreate mode but 10 minutes in RollingUpdate mode, I'm pretty sure the config was the same for both tests.

To Reproduce

  • Find some way to cause any update to the primary deployment to fail to initialize
  • update the canary deployment
  • see above

Expected behavior

The two things that surprise me about the current behavior:

  • in Recreate deployment mode, flagger knows the primary is bad, but will delete the canary anyway
    • maybe just don't use Recreate mode with flagger?
  • in rolling update mode, the failing pod replica doesn't get deleted after the canary gives up, so the canary sais update failed, but if the updated replica ever stops failing then the update will rollout even though the canary says failed.

I'm not really sure what the correct behavior is.
If flagger tried to keep around the working canary instead of the bad primary that would put everything in a weird state according to flagger's model, so I suspect that's a bad idea

Additional context

  • Flagger version: 1.42.0
  • Kubernetes version: v1.34.4-eks-3a10415
  • Service Mesh provider: gatewayapi:v1
  • Ingress provider: kong 2.47.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions