Describe the bug
Right now I have a bug unrelated to flagger causing any update pods in the primary deployment to fail.
Flagger did some things that surprised me in response to this.
With deployment mode Recreate:
- update deployment
- flagger does tests on canary, and tries to deploy to primary
- all traffic is routed to canary
- the primary deletes it's pod, then tries to deploy a new pod
- the new primary pod never initializes
- 1 minute of
$canary not ready: waiting for rollout to finish: 0 of 1 (readyThreshold 100%) updated replicas are available
- flagger marks the canary as failed, deletes the canary
the application is now in a completely broken state, the only pod "running" is the new broken primary
With deployment mode RollingUpdate
- update deployment
- flagger does tests on canary, and tries to deploy to primary
- all traffic is routed to canary
- the primary tries to deploy a new pod
- the pod never successfully initializes
- 10 minutes of
$canary not ready: waiting for rollout to finish: 1 old replicas are pending termination
- flagger marks the canary as failed, deletes the canary
both the working primary pod and the new broken primary pod are running, but at least the old one is still handling traffic.
I also have no idea why it only tries for 1 minute in Recreate mode but 10 minutes in RollingUpdate mode, I'm pretty sure the config was the same for both tests.
To Reproduce
- Find some way to cause any update to the primary deployment to fail to initialize
- update the canary deployment
- see above
Expected behavior
The two things that surprise me about the current behavior:
- in Recreate deployment mode, flagger knows the primary is bad, but will delete the canary anyway
- maybe just don't use Recreate mode with flagger?
- in rolling update mode, the failing pod replica doesn't get deleted after the canary gives up, so the canary sais update failed, but if the updated replica ever stops failing then the update will rollout even though the canary says failed.
I'm not really sure what the correct behavior is.
If flagger tried to keep around the working canary instead of the bad primary that would put everything in a weird state according to flagger's model, so I suspect that's a bad idea
Additional context
- Flagger version: 1.42.0
- Kubernetes version: v1.34.4-eks-3a10415
- Service Mesh provider: gatewayapi:v1
- Ingress provider: kong 2.47.0
Describe the bug
Right now I have a bug unrelated to flagger causing any update pods in the primary deployment to fail.
Flagger did some things that surprised me in response to this.
With deployment mode Recreate:
$canary not ready: waiting for rollout to finish: 0 of 1 (readyThreshold 100%) updated replicas are availablethe application is now in a completely broken state, the only pod "running" is the new broken primary
With deployment mode RollingUpdate
$canary not ready: waiting for rollout to finish: 1 old replicas are pending terminationboth the working primary pod and the new broken primary pod are running, but at least the old one is still handling traffic.
I also have no idea why it only tries for 1 minute in Recreate mode but 10 minutes in RollingUpdate mode, I'm pretty sure the config was the same for both tests.
To Reproduce
Expected behavior
The two things that surprise me about the current behavior:
I'm not really sure what the correct behavior is.
If flagger tried to keep around the working canary instead of the bad primary that would put everything in a weird state according to flagger's model, so I suspect that's a bad idea
Additional context