Skip to content

Modernize EKS Auto Mode: foundation updates, new examples, observability#7

Draft
utkarpun wants to merge 8 commits into
mainfrom
modernize-auto-mode
Draft

Modernize EKS Auto Mode: foundation updates, new examples, observability#7
utkarpun wants to merge 8 commits into
mainfrom
modernize-auto-mode

Conversation

@utkarpun
Copy link
Copy Markdown
Contributor

@utkarpun utkarpun commented May 21, 2026

Summary

  • Remove redundant manual IAM tagging policy (module handles via enable_auto_mode_custom_tags)
  • Bump K8s default 1.33→1.34, tighten provider version constraints
  • Add disruption budgets (WhenEmptyOrUnderutilized + 10% node budget) to all NodePools
  • Add optional KMS encryption for ephemeral storage
  • Expand GPU families (p5, p5e) and Neuron families (trn2)
  • Add 5 new educational examples: cost-optimization, capacity-reservation, static-capacity, batch-jobs, disruption-budgets
  • Add CloudWatch Container Insights addon (opt-in via enable_observability)
  • Rewrite README with educational structure, configuration table, categorized examples
  • Add Docusaurus documentation site (misc/website/) with GitHub Pages deployment, landing page, all examples + architecture docs surfaced

Test plan

  • Deploy fresh cluster WITHOUT base_domain — verify all NodePools apply, tagging works on EC2/EBS, consolidation behaves correctly
  • Deploy fresh cluster WITH base_domain — verify external-dns, ACM cert, ALB Ingress, public hostnames all work
  • Deploy each new example and verify it schedules pods correctly
  • Enable observability (-var='enable_observability=true') and confirm CloudWatch agent pods + dashboard
  • Destroy cluster via ./scripts/cleanup.sh — no orphans remain
  • Verify terraform plan on existing live cluster shows clean diff (IAM policy removal + local_file changes only)
  • Docs site builds cleanly (cd misc/website && npm ci && npm run build)

utkarpun added 8 commits May 21, 2026 03:13
The EKS module (v20.37.2) already creates the custom-tags IAM policy
via enable_auto_mode_custom_tags (default true). The manual 138-line
inline policy in tagging.tf was a duplicate. Remove it and make the
module flag explicit in eks.tf for documentation clarity.

Also bumps: K8s default 1.33→1.34, Terraform >=1.5, AWS provider
>=5.79, loosens Helm/kubectl patch pins.
…e families

- Upgrade consolidation from WhenEmpty/30s to WhenEmptyOrUnderutilized/60s
  with 10% node budget across all 4 NodePools
- Add optional ephemeralStorage KMS encryption (gated by variable)
- Expand GPU families to include p5, p5e (large model training)
- Expand Neuron families to include trn2 (Trainium2)
…terns

New examples with teaching-oriented READMEs:
- cost-optimization: OD/Spot split via topology spread + overprovision headroom
- capacity-reservation: ODCR targeting for GPU workloads
- static-capacity: fixed-node pools via spec.replicas
- batch-jobs: do-not-disrupt annotation for long-running jobs
- disruption-budgets: time-windowed, reason-specific budget patterns
Adds amazon-cloudwatch-observability EKS addon gated by
enable_observability variable (default false). Includes IAM policy
attachment and educational README covering metrics, logs, Application
Signals, and cost awareness.
Reorganize examples by category (compute, cost, scheduling, observability).
Add What's New section, full configuration variables table, Learn More
links. Expand Components section with Auto Mode architecture explanation
and 5-layer tagging flow.
Reorder: Examples + Cleanup moved before Configuration/Components.
Remove What's New section and 5-layer tagging detail from README.
Fix How EKS Auto Mode Works to cover full scope (not just 3 controllers).
Extract security considerations into SECURITY_CONSIDERATIONS.md.
Adds a GitHub Pages site under misc/website/ with:
- Landing page with hero + feature cards
- All 11 example READMEs surfaced as docs pages
- Architecture section (5-layer tagging, cleanup playbook, security)
- Getting started guide with configuration reference
- CopyMarkdownButton component for agent users
- GitHub Actions workflow for Pages deployment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant