fix(alb): improve capacity checks and add ingress rollback on reconciliation failure#174
Draft
pablovilas wants to merge 5 commits intobetafrom
Draft
fix(alb): improve capacity checks and add ingress rollback on reconciliation failure#174pablovilas wants to merge 5 commits intobetafrom
pablovilas wants to merge 5 commits intobetafrom
Conversation
…liation failure - Fix rule count to check HTTPS (443) listener only instead of summing all listeners - Add estimation of rules/TGs this scope will add before comparing against thresholds - Lower default ALB_MAX_TARGET_GROUPS from 98 to 90 for blue-green safety margin - Add rule capacity check to deployment workflow (was only at scope creation) - Add rollback_failed_ingress script that deletes broken ingresses on reconciliation failure to prevent sync poisoning of the entire ALB group - Wire rollback into verify_ingress_reconciliation at cert error, event error, and timeout - Add ALB_ROLLBACK_ON_RECONCILIATION_FAILURE config (default: true)
…ment step Merge validate_alb_capacity and validate_alb_target_group_capacity into a single deployment/validate_alb_capacity script that checks both in one pass, sharing the ALB ARN lookup and DNS_TYPE guard. The scope-level validate_alb_capacity (rules only) remains for create.yaml where no deployment context exists.
… escape When rollback_failed_ingress is sourced inside handle_reconciliation_failure, its `return 0` (from skip guards) would exit the enclosing function, skipping the critical `exit 1`. Running in a subshell isolates the return boundary.
…w config vars - Remove validate_alb_target_group_capacity and its tests (superseded by the combined deployment/validate_alb_capacity script) - Document ALB_MAX_CAPACITY, ALB_MAX_TARGET_GROUPS, and ALB_ROLLBACK_ON_RECONCILIATION_FAILURE in k8s/README.md
- Document that scope-level domain estimation works correctly when .scope.domain is null during scope creation ([null] has jq length 1) - Document that blue-green TG estimation intentionally overcounts (existing blue TGs are already in TARGET_GROUP_COUNT) - Document that ADDITIONAL_PORT_COUNT is re-parsed independently of the rule check which may have been skipped
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses three critical ALB Ingress Controller issues: rule exhaustion, target group exhaustion, and sync poisoning when a broken ingress blocks reconciliation for the entire ALB group.
Bug fixes in existing capacity checks
resolve_balancer.ALB_MAX_TARGET_GROUPSfrom 98 to 90: Blue-green deployments temporarily double TGs. A threshold of 98/100 left only 2 TGs of headroom — insufficient for a blue-green deployment with additional ports.New: combined deployment capacity check
validate_alb_capacity(rules) andvalidate_alb_target_group_capacity(TGs) into a singledeployment/validate_alb_capacityscript that checks both in one pass, sharing the ALB ARN lookup and reducing duplicate AWS API calls. The standalone TG script is removed as dead code.New: ingress rollback on reconciliation failure (sync poisoning fix)
verify_ingress_reconciliationdetects a failure (certificate error, controller error, timeout), it now automatically deletes the broken ingress to prevent sync poisoning of the entire ALB group.group.name.ALB_ROLLBACK_ON_RECONCILIATION_FAILURE(default:true).return 0in skip guards from escaping the failure handler.Documentation
ALB_MAX_CAPACITY,ALB_MAX_TARGET_GROUPS, andALB_ROLLBACK_ON_RECONCILIATION_FAILUREtok8s/README.md.Changed files
k8s/deployment/validate_alb_capacityk8s/deployment/rollback_failed_ingressk8s/deployment/verify_ingress_reconciliationk8s/deployment/workflows/initial.yamlk8s/scope/validate_alb_capacityk8s/scope/build_contextALB_ROLLBACK_ON_RECONCILIATION_FAILUREk8s/values.yamlk8s/README.mdk8s/deployment/validate_alb_target_group_capacityTest plan
X current + N new = P/limit)ALB_ROLLBACK_ON_RECONCILIATION_FAILURE=false— verify rollback is skipped but deployment still failsDNS_TYPE=azure) — verify none of this activatesblue_green.yamlinherits the validation step frominitial.yaml🤖 Generated with Claude Code