Skip to content

fix(alb): improve capacity checks and add ingress rollback on reconciliation failure#174

Draft
pablovilas wants to merge 5 commits intobetafrom
feat/alb-ingress-rollback-and-fixes
Draft

fix(alb): improve capacity checks and add ingress rollback on reconciliation failure#174
pablovilas wants to merge 5 commits intobetafrom
feat/alb-ingress-rollback-and-fixes

Conversation

@pablovilas
Copy link
Copy Markdown

@pablovilas pablovilas commented Apr 15, 2026

Summary

Addresses three critical ALB Ingress Controller issues: rule exhaustion, target group exhaustion, and sync poisoning when a broken ingress blocks reconciliation for the entire ALB group.

Bug fixes in existing capacity checks

  • Fix rule count to check per-listener (HTTPS 443 only): Was summing rules across ALL listeners, overcounting ~2x since the AWS limit of 100 is per-listener, not per-ALB. Now matches the approach used by resolve_balancer.
  • Add capacity estimation before comparing thresholds: Both rule and TG checks now estimate what the current scope/deployment will add (based on domain count, additional ports, and deployment strategy) before comparing against limits. Prevents pass-then-exceed scenarios where a scope with 5 domains passes at 74/75 but pushes to 79.
  • Lower default ALB_MAX_TARGET_GROUPS from 98 to 90: Blue-green deployments temporarily double TGs. A threshold of 98/100 left only 2 TGs of headroom — insufficient for a blue-green deployment with additional ports.

New: combined deployment capacity check

  • Merged validate_alb_capacity (rules) and validate_alb_target_group_capacity (TGs) into a single deployment/validate_alb_capacity script that checks both in one pass, sharing the ALB ARN lookup and reducing duplicate AWS API calls. The standalone TG script is removed as dead code.
  • Added rule capacity check to the deployment workflow (was only at scope creation), closing a gap where other scopes could consume rules between creation and deployment.

New: ingress rollback on reconciliation failure (sync poisoning fix)

  • When verify_ingress_reconciliation detects a failure (certificate error, controller error, timeout), it now automatically deletes the broken ingress to prevent sync poisoning of the entire ALB group.
  • Previously, the broken ingress was left in the cluster, blocking the ALB controller from reconciling ALL other ingresses sharing the same group.name.
  • Rollback is configurable via ALB_ROLLBACK_ON_RECONCILIATION_FAILURE (default: true).
  • Rollback runs in a subshell to prevent return 0 in skip guards from escaping the failure handler.

Documentation

  • Added ALB_MAX_CAPACITY, ALB_MAX_TARGET_GROUPS, and ALB_ROLLBACK_ON_RECONCILIATION_FAILURE to k8s/README.md.

Changed files

File Change
k8s/deployment/validate_alb_capacity New — combined rules + TG capacity check
k8s/deployment/rollback_failed_ingress New — deletes broken ingresses on reconciliation failure
k8s/deployment/verify_ingress_reconciliation Call rollback on cert error, event error, and timeout
k8s/deployment/workflows/initial.yaml Replace 2 validation steps with 1 combined step
k8s/scope/validate_alb_capacity Fix per-listener counting + add rule estimation
k8s/scope/build_context Export ALB_ROLLBACK_ON_RECONCILIATION_FAILURE
k8s/values.yaml Lower TG threshold to 90, add rollback config
k8s/README.md Document new config variables
k8s/deployment/validate_alb_target_group_capacity Removed — superseded by combined script

Test plan

  • 54 bats tests pass (11 new combined + 7 new rollback + 25 scope capacity + 11 reconciliation)
  • Deploy a scope when ALB is near rule capacity — verify it fails with projected count (X current + N new = P/limit)
  • Deploy with blue-green strategy near TG capacity — verify estimation accounts for 2x TGs
  • Simulate reconciliation failure (e.g., invalid cert domain) — verify ingress is automatically rolled back and other scopes resume
  • Set ALB_ROLLBACK_ON_RECONCILIATION_FAILURE=false — verify rollback is skipped but deployment still fails
  • Deploy on Azure (DNS_TYPE=azure) — verify none of this activates
  • Verify blue_green.yaml inherits the validation step from initial.yaml

🤖 Generated with Claude Code

…liation failure

- Fix rule count to check HTTPS (443) listener only instead of summing all listeners
- Add estimation of rules/TGs this scope will add before comparing against thresholds
- Lower default ALB_MAX_TARGET_GROUPS from 98 to 90 for blue-green safety margin
- Add rule capacity check to deployment workflow (was only at scope creation)
- Add rollback_failed_ingress script that deletes broken ingresses on reconciliation
  failure to prevent sync poisoning of the entire ALB group
- Wire rollback into verify_ingress_reconciliation at cert error, event error, and timeout
- Add ALB_ROLLBACK_ON_RECONCILIATION_FAILURE config (default: true)
…ment step

Merge validate_alb_capacity and validate_alb_target_group_capacity into
a single deployment/validate_alb_capacity script that checks both in one
pass, sharing the ALB ARN lookup and DNS_TYPE guard.

The scope-level validate_alb_capacity (rules only) remains for create.yaml
where no deployment context exists.
… escape

When rollback_failed_ingress is sourced inside handle_reconciliation_failure,
its `return 0` (from skip guards) would exit the enclosing function, skipping
the critical `exit 1`. Running in a subshell isolates the return boundary.
…w config vars

- Remove validate_alb_target_group_capacity and its tests (superseded by
  the combined deployment/validate_alb_capacity script)
- Document ALB_MAX_CAPACITY, ALB_MAX_TARGET_GROUPS, and
  ALB_ROLLBACK_ON_RECONCILIATION_FAILURE in k8s/README.md
- Document that scope-level domain estimation works correctly when
  .scope.domain is null during scope creation ([null] has jq length 1)
- Document that blue-green TG estimation intentionally overcounts
  (existing blue TGs are already in TARGET_GROUP_COUNT)
- Document that ADDITIONAL_PORT_COUNT is re-parsed independently of
  the rule check which may have been skipped
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant