Skip to content

RFC: AWS Resiliency Plugin #75

@nirmal84

Description

@nirmal84

Is this related to an existing feature request or issue?

No response

Summary

The AWS Resiliency Plugin gives your AI agent a failure-mode-first approach to architecture reviews. Share your IaC or application code, and the agent reviews it across 7 resiliency domains — surfacing single points of failure, missing failover configurations, and application-level gaps that cause production incidents.

Use case

The AWS Resiliency Plugin gives your AI agent a failure-mode-first approach to architecture reviews. Share your IaC or application code, and the agent reviews it across 7 resiliency domains — surfacing single points of failure, missing failover configurations, and application-level gaps that cause production incidents.

Target users: Cloud Engineers, SREs, Platform Engineers, and Cloud Architects preparing production deployments, operational readiness reviews (ORRs), game-day exercises, or Well-Architected Reviews (WARs).

Key use cases:

Pre-production hardening: Review a CDK/Terraform stack before it goes to production — find the missing Multi-AZ, the absent DLQ, the Lambda timeout that will silently drop requests
ORR preparation: Structured review across all seven resiliency domains to validate readiness criteria before launch
Incident post-mortem: Share the IaC of a service that just had an outage — the agent finds the architectural gap that caused it
Game-day planning: Identify blast radius and failure modes to design meaningful chaos experiments
WAR prep: Map findings to Well-Architected REL pillar questions with evidence from the actual code

###Before / After user experience:

Before (without the plugin):

User: Is this CDK stack resilient?

Agent: Here are some general resiliency best practices:
1. Use Multi-AZ deployments
2. Implement retry logic
3. Consider disaster recovery
[...10 generic bullet points with no reference to the actual code...]
After (with aws-resiliency):

User: Is this CDK stack resilient?
[pastes CDK TypeScript stack]

Agent: Two-layer resiliency review complete across 7 domains.

   | Domain      | Finding                              | Severity | Blast Radius              | RTO Impact  | Fix                                    |
   |-------------|--------------------------------------|----------|---------------------------|-------------|----------------------------------------|
   | COMPUTE     | ECS desiredCount=1, no multi-AZ      | Critical | Full service outage        | ~5 min      | desiredCount≥2 + AZ placement strategy |
   | DATA        | RDS missing multiAz                  | Critical | Read+write outage          | 10–30 min   | multiAz: true                          |
   | COMPUTE     | Lambda timeout=3s on payments fn     | High     | Silent 504s to ALB         | Per-request | timeout: 29 (ALB max)                  |
   | MESSAGING   | SQS consumer missing DLQ             | High     | Message loss on failure    | N/A         | deadLetterQueue, maxReceiveCount: 3    |
   | NETWORKING  | Single NAT Gateway (one AZ)          | Medium   | Outbound loss if AZ fails  | ~5 min      | One NAT GW per AZ                      |

Layer 2 (application code): RDS client is correctly initialised outside the handler , but connection pool maxConnections=10 will exhaust during the RDS failover reconnection storm. Recommend maxConnections=2 with exponential backoff on connect. Top risk: RDS Multi-AZ + ECS single-task together mean any AZ failure causes a full outage with manual recovery. Fix both before production. REL 6 (REL_FAULT_ISOLATION), REL 9 (REL_MITIGATE_BULK_FAILURES).

Proposal

Plugin structure

plugins/aws-resiliency/
aws-resiliency-plugin/
├── .claude-plugin/
│   └── plugin.json          
├── skills/
│   ├── SKILL.md             ← orchestrator (loads references on demand)
│   └── references/
│       ├── compute-resiliency.md
│       ├── data-resiliency.md
│       ├── networking-resiliency.md
│       ├── storage-resiliency.md
│       ├── messaging-resiliency.md
│       ├── observability-resiliency.md
│       ├── multi-region-dr.md
│       ├── service-failure-modes.md
│       └── well-architected-reliability.md
├── mcp.json                 ← 6 AWS MCP server configs
└── README.md

MCP server dependencies

Server Type Purpose Required?
awslabs.aws-documentation-mcp-server stdio Well-Architected REL pillar references, AWS service documentation, service limits, SLA definitions Required
awslabs.aws-iac-mcp-server stdio CDK + CloudFormation resource schema lookups, cfn-lint validation, cfn-guard compliance checks, IaC best-practice patterns When CDK/CFn detected
awslabs.terraform-mcp-server stdio Terraform provider registry, module validation, HCL resource schema checks When Terraform HCL detected
awslabs.aws-knowledge-mcp-server stdio Architecture best practices, cross-service integration patterns, service-specific failure behaviour FAQs On-demand for service behaviour lookups
<style> </style>

awslabs.aws-documentation-mcp-server is the only required server — it provides Well-Architected REL pillar grounding for all findings. The IaC and Terraform servers are conditionally invoked based on detected input format. The knowledge server is invoked when the agent needs to confirm specific service behaviours (e.g., exact RDS DNS failover propagation window, DynamoDB eventual consistency semantics under partition).

Defaults

Setting Default How to Override
IaC languages CDK (TypeScript, Python, Java, Go), Terraform HCL, CloudFormation YAML/JSON, SAM YAML Auto-detected from syntax and file extension
Severity: Critical Single point of failure, no automated recovery, estimated RTO > 4 hours State: "my RTO requirement is X"
Severity: High Significant reliability gap, degraded service, RTO 1–4 hours  
Severity: Medium Well-Architected best-practice violation, limited blast radius, RTO < 1 hour  
Severity: Low Improvement opportunity, no immediate blast radius, affects future scale  
Default RTO threshold 4 hours (Critical if exceeded) State target RTO explicitly
Default RPO threshold 1 hour (Critical if exceeded) State target RPO explicitly
Review scope All IaC files in the current directory + application code shared in the conversation Specify: "review only the data layer"
Output format Markdown findings table (Domain / Finding / Severity / Blast Radius / RTO Impact / Fix / WAF Ref) + Layer 2 narrative + top-risk summary  
Fix format Corrected code/config snippet in the same IaC language as the input  
WAF references Mapped to Well-Architected REL pillar question titles (stable across versions)  

Dependencies and Integrations

MCP dependencies (all from AWS Labs):

Integration with existing plugins:

  • Complements deploy-on-aws: deploy-on-aws generates dev-sized IaC; aws-resiliency reviews it for production hardening. Recommended workflow: generate → review → harden → redeploy.
  • Complements aws-observability (RFC RFC: Add AWS Observability plugin #67): Resiliency review surfaces missing CloudWatch alarms, absent X-Ray tracing, and health check gaps — findings that feed directly into aws-observability workflows.

Reference implementation: A working version of the skill and reference files is available at https://github.com/nirmal84/aws-resiliency-plugin

 

Potential Challenges

  • IaC language detection in monorepos: Multi-file CDK projects with multiple stacks require heuristic detection. Mitigation: SKILL.md instructs the agent to ask the user to identify the main stack file if auto-detection is ambiguous.
  • Two-layer correlation: Connecting an IaC finding (RDS Multi-AZ enabled) with an application finding (connection pool not handling the 60s DNS failover window) requires both layers to be present. Mitigation: SKILL.md instructs the agent to flag missing layers and deliver a partial review with explicit caveats.
  • WAF question ID currency: REL pillar question IDs change across framework versions. Mitigation: References use stable question titles rather than volatile IDs; aws-documentation-mcp-server fetches current content when available.
  • Reference file size: Domain reference files covering IaC checks, application code checks, and failure modes will approach the 100-line guideline. Mitigation: Files load only when the relevant domain is detected — a Lambda-only review never loads multi-region-dr.md. SKILL.md stays under 200 lines.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions