-
Notifications
You must be signed in to change notification settings - Fork 32
Description
Is this related to an existing feature request or issue?
No response
Summary
The AWS Resiliency Plugin gives your AI agent a failure-mode-first approach to architecture reviews. Share your IaC or application code, and the agent reviews it across 7 resiliency domains — surfacing single points of failure, missing failover configurations, and application-level gaps that cause production incidents.
Use case
The AWS Resiliency Plugin gives your AI agent a failure-mode-first approach to architecture reviews. Share your IaC or application code, and the agent reviews it across 7 resiliency domains — surfacing single points of failure, missing failover configurations, and application-level gaps that cause production incidents.
Target users: Cloud Engineers, SREs, Platform Engineers, and Cloud Architects preparing production deployments, operational readiness reviews (ORRs), game-day exercises, or Well-Architected Reviews (WARs).
Key use cases:
Pre-production hardening: Review a CDK/Terraform stack before it goes to production — find the missing Multi-AZ, the absent DLQ, the Lambda timeout that will silently drop requests
ORR preparation: Structured review across all seven resiliency domains to validate readiness criteria before launch
Incident post-mortem: Share the IaC of a service that just had an outage — the agent finds the architectural gap that caused it
Game-day planning: Identify blast radius and failure modes to design meaningful chaos experiments
WAR prep: Map findings to Well-Architected REL pillar questions with evidence from the actual code
###Before / After user experience:
Before (without the plugin):
User: Is this CDK stack resilient?
Agent: Here are some general resiliency best practices:
1. Use Multi-AZ deployments
2. Implement retry logic
3. Consider disaster recovery
[...10 generic bullet points with no reference to the actual code...]
After (with aws-resiliency):
User: Is this CDK stack resilient?
[pastes CDK TypeScript stack]
Agent: Two-layer resiliency review complete across 7 domains.
| Domain | Finding | Severity | Blast Radius | RTO Impact | Fix |
|-------------|--------------------------------------|----------|---------------------------|-------------|----------------------------------------|
| COMPUTE | ECS desiredCount=1, no multi-AZ | Critical | Full service outage | ~5 min | desiredCount≥2 + AZ placement strategy |
| DATA | RDS missing multiAz | Critical | Read+write outage | 10–30 min | multiAz: true |
| COMPUTE | Lambda timeout=3s on payments fn | High | Silent 504s to ALB | Per-request | timeout: 29 (ALB max) |
| MESSAGING | SQS consumer missing DLQ | High | Message loss on failure | N/A | deadLetterQueue, maxReceiveCount: 3 |
| NETWORKING | Single NAT Gateway (one AZ) | Medium | Outbound loss if AZ fails | ~5 min | One NAT GW per AZ |
Layer 2 (application code): RDS client is correctly initialised outside the handler , but connection pool maxConnections=10 will exhaust during the RDS failover reconnection storm. Recommend maxConnections=2 with exponential backoff on connect. Top risk: RDS Multi-AZ + ECS single-task together mean any AZ failure causes a full outage with manual recovery. Fix both before production. REL 6 (REL_FAULT_ISOLATION), REL 9 (REL_MITIGATE_BULK_FAILURES).
Proposal
Plugin structure
plugins/aws-resiliency/ aws-resiliency-plugin/ ├── .claude-plugin/ │ └── plugin.json ├── skills/ │ ├── SKILL.md ← orchestrator (loads references on demand) │ └── references/ │ ├── compute-resiliency.md │ ├── data-resiliency.md │ ├── networking-resiliency.md │ ├── storage-resiliency.md │ ├── messaging-resiliency.md │ ├── observability-resiliency.md │ ├── multi-region-dr.md │ ├── service-failure-modes.md │ └── well-architected-reliability.md ├── mcp.json ← 6 AWS MCP server configs └── README.md
MCP server dependencies
| Server | Type | Purpose | Required? |
|---|---|---|---|
| awslabs.aws-documentation-mcp-server | stdio | Well-Architected REL pillar references, AWS service documentation, service limits, SLA definitions | Required |
| awslabs.aws-iac-mcp-server | stdio | CDK + CloudFormation resource schema lookups, cfn-lint validation, cfn-guard compliance checks, IaC best-practice patterns | When CDK/CFn detected |
| awslabs.terraform-mcp-server | stdio | Terraform provider registry, module validation, HCL resource schema checks | When Terraform HCL detected |
| awslabs.aws-knowledge-mcp-server | stdio | Architecture best practices, cross-service integration patterns, service-specific failure behaviour FAQs | On-demand for service behaviour lookups |
awslabs.aws-documentation-mcp-server is the only required server —
it provides Well-Architected REL pillar grounding for all findings. The IaC and
Terraform servers are conditionally invoked based on detected input format. The
knowledge server is invoked when the agent needs to confirm specific service
behaviours (e.g., exact RDS DNS failover propagation window, DynamoDB eventual
consistency semantics under partition).
Defaults
| Setting | Default | How to Override |
|---|---|---|
| IaC languages | CDK (TypeScript, Python, Java, Go), Terraform HCL, CloudFormation YAML/JSON, SAM YAML | Auto-detected from syntax and file extension |
| Severity: Critical | Single point of failure, no automated recovery, estimated RTO > 4 hours | State: "my RTO requirement is X" |
| Severity: High | Significant reliability gap, degraded service, RTO 1–4 hours | |
| Severity: Medium | Well-Architected best-practice violation, limited blast radius, RTO < 1 hour | |
| Severity: Low | Improvement opportunity, no immediate blast radius, affects future scale | |
| Default RTO threshold | 4 hours (Critical if exceeded) | State target RTO explicitly |
| Default RPO threshold | 1 hour (Critical if exceeded) | State target RPO explicitly |
| Review scope | All IaC files in the current directory + application code shared in the conversation | Specify: "review only the data layer" |
| Output format | Markdown findings table (Domain / Finding / Severity / Blast Radius / RTO Impact / Fix / WAF Ref) + Layer 2 narrative + top-risk summary | |
| Fix format | Corrected code/config snippet in the same IaC language as the input | |
| WAF references | Mapped to Well-Architected REL pillar question titles (stable across versions) |
Dependencies and Integrations
MCP dependencies (all from AWS Labs):
awslabs.aws-documentation-mcp-server— Well-Architected REL pillar, service docs, SLAsawslabs.aws-iac-mcp-server— CDK + CloudFormation validation, cfn-lint, cfn-guardawslabs.terraform-mcp-server— Terraform provider/module validationawslabs.aws-knowledge-mcp-server— Architecture patterns, service failure behaviour
Integration with existing plugins:
- Complements
deploy-on-aws:deploy-on-awsgenerates dev-sized IaC;aws-resiliencyreviews it for production hardening. Recommended workflow: generate → review → harden → redeploy. - Complements
aws-observability(RFC RFC: Add AWS Observability plugin #67): Resiliency review surfaces missing CloudWatch alarms, absent X-Ray tracing, and health check gaps — findings that feed directly intoaws-observabilityworkflows.
Reference implementation: A working version of the skill and reference files is available at https://github.com/nirmal84/aws-resiliency-plugin
Potential Challenges
- IaC language detection in monorepos: Multi-file CDK projects with multiple stacks require heuristic detection. Mitigation: SKILL.md instructs the agent to ask the user to identify the main stack file if auto-detection is ambiguous.
- Two-layer correlation: Connecting an IaC finding (RDS Multi-AZ enabled) with an application finding (connection pool not handling the 60s DNS failover window) requires both layers to be present. Mitigation: SKILL.md instructs the agent to flag missing layers and deliver a partial review with explicit caveats.
- WAF question ID currency: REL pillar question IDs change
across framework versions. Mitigation: References use stable question titles rather
than volatile IDs;
aws-documentation-mcp-serverfetches current content when available. - Reference file size: Domain reference files covering
IaC checks, application code checks, and failure modes will approach the
100-line guideline. Mitigation: Files load only when the relevant domain
is detected — a Lambda-only review never loads
multi-region-dr.md. SKILL.md stays under 200 lines.