Skip to content

feat(istio): expose istiod_replicas to guarantee HA for node drains#292

Open
agustincelentano wants to merge 3 commits intomainfrom
feat/istiod-replicas
Open

feat(istio): expose istiod_replicas to guarantee HA for node drains#292
agustincelentano wants to merge 3 commits intomainfrom
feat/istiod-replicas

Conversation

@agustincelentano
Copy link
Copy Markdown
Collaborator

@agustincelentano agustincelentano commented Apr 17, 2026

Summary

Single-replica istiod + chart-default PDB (minAvailable=1) yields disruptionsAllowed=0, which blocks every EKS node rolling update with PodEvictionFailure: Reached max retries. In clusters that use this module, tofu apply fails as soon as any change triggers a node group replacement (AMI bumps, instance_type changes, etc.).

Change

  • Add a new istiod_replicas variable (default 1, validated >= 1).
  • Wire it into both pilot.replicaCount and pilot.autoscaleMin on the helm_release "istiod".

Why default = 1

Backwards compatibility. Existing consumers see no behavior change after upgrading the module. Callers that need HA (recommended for clusters doing node rolling updates) opt in explicitly:

module "istio" {
  source = "...//infrastructure/commons/istio?ref=v1.52.0"

  istiod_replicas = 2
}

Why both replicaCount AND autoscaleMin

The upstream istiod chart enables the HPA by default (pilot.autoscaleEnabled=true, pilot.autoscaleMin=1). Setting only pilot.replicaCount is insufficient — the HPA would scale the deployment back to 1 replica shortly after install, leaving us with the same problem. Overriding autoscaleMin locks in the floor.

Test plan

  • Apply on a dev cluster with istiod_replicas = 2 and verify kubectl -n istio-system get deploy istiod shows READY 2/2.
  • Verify kubectl -n istio-system get hpa istiod shows MINPODS=2.
  • Verify kubectl -n istio-system get pdb istiod shows ALLOWED DISRUPTIONS=1.
  • Trigger a node rolling update and confirm the drain succeeds.
  • Apply on another cluster without setting the variable and verify the deployment stays at 1 replica (backwards compat).

Single-replica istiod + PDB minAvailable=1 (chart default) yields
disruptionsAllowed=0, which blocks every EKS node rolling update with
'PodEvictionFailure: Reached max retries'.

Expose a new istiod_replicas variable (default 2) and wire it into both
pilot.replicaCount and pilot.autoscaleMin on the helm_release. Setting
only replicaCount is insufficient because the chart enables the HPA by
default with autoscaleMin=1, and the HPA would scale back to 1 replica
shortly after install.
The hashicorp/helm v3 provider replaced the 'set {}' block with a 'set'
attribute taking a list of objects.
…atibility

Flip the default from 2 to 1 so existing consumers of this module see
no behavior change after upgrading. Callers that need HA (recommended
for clusters doing node rolling updates) opt in explicitly with
istiod_replicas = 2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant