If a primary pod fails to initialize, flagger doesn't always do the right thing

### Describe the bug

Right now I have a bug unrelated to flagger causing any update pods in the primary deployment to fail.
Flagger did some things that surprised me in response to this.


With deployment mode Recreate:
1. update deployment
2. flagger does tests on canary, and tries to deploy to primary
3. all traffic is routed to canary
4. the primary deletes it's pod, then tries to deploy a new pod
5. the new primary pod never initializes
6. 1 minute of `$canary not ready: waiting for rollout to finish: 0 of 1 (readyThreshold 100%) updated replicas are available`
7. flagger marks the canary as failed, deletes the canary

the application is now in a completely broken state, the only pod "running" is the new broken primary


With deployment mode RollingUpdate
1. update deployment
2. flagger does tests on canary, and tries to deploy to primary
3. all traffic is routed to canary
4. the primary tries to deploy a new pod
5. the pod never successfully initializes
6. 10 minutes of `$canary not ready: waiting for rollout to finish: 1 old replicas are pending termination`
7. flagger marks the canary as failed, deletes the canary

both the working primary pod and the new broken primary pod are running, but at least the old one is still handling traffic.


I also have no idea why it only tries for 1 minute in Recreate mode but 10 minutes in RollingUpdate mode, I'm pretty sure the config was the same for both tests.

### To Reproduce
* Find some way to cause any update to the primary deployment to fail to initialize
* update the canary deployment
* see above

### Expected behavior

The two things that surprise me about the current behavior:
* in Recreate deployment mode, flagger knows the primary is bad, but will delete the canary anyway
  * maybe just don't use Recreate mode with flagger?
* in rolling update mode, the failing pod replica doesn't get deleted after the canary gives up, so the canary sais update failed, but if the updated replica ever stops failing then the update will rollout even though the canary says failed.

I'm not really sure what the correct behavior is.
If flagger tried to keep around the working canary instead of the bad primary that would put everything in a weird state according to flagger's model, so I suspect that's a bad idea

### Additional context

- Flagger version: 1.42.0
- Kubernetes version: v1.34.4-eks-3a10415
- Service Mesh provider: gatewayapi:v1
- Ingress provider: kong 2.47.0


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If a primary pod fails to initialize, flagger doesn't always do the right thing #1898

Describe the bug

To Reproduce

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

If a primary pod fails to initialize, flagger doesn't always do the right thing #1898

Description

Describe the bug

To Reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions