RFC: AWS Resiliency Plugin

### Is this related to an existing feature request or issue?

_No response_

### Summary

The AWS Resiliency Plugin gives your AI agent a failure-mode-first approach to architecture reviews. Share your IaC or application code, and the agent reviews it across 7 resiliency domains — surfacing single points of failure, missing failover configurations, and application-level gaps that cause production incidents.



### Use case

The AWS Resiliency Plugin gives your AI agent a failure-mode-first approach to architecture reviews. Share your IaC or application code, and the agent reviews it across 7 resiliency domains — surfacing single points of failure, missing failover configurations, and application-level gaps that cause production incidents.

Target users: Cloud Engineers, SREs, Platform Engineers, and Cloud Architects preparing production deployments, operational readiness reviews (ORRs), game-day exercises, or Well-Architected Reviews (WARs).

**Key use cases:**

**Pre-production hardening:** Review a CDK/Terraform stack before it goes to production — find the missing Multi-AZ, the absent DLQ, the Lambda timeout that will silently drop requests
**ORR preparation:** Structured review across all seven resiliency domains to validate readiness criteria before launch
**Incident post-mortem:** Share the IaC of a service that just had an outage — the agent finds the architectural gap that caused it
**Game-day planning:** Identify blast radius and failure modes to design meaningful chaos experiments
**WAR prep:** Map findings to Well-Architected REL pillar questions with evidence from the actual code

###Before / After user experience:

**Before (without the plugin):**

User: Is this CDK stack resilient?

Agent: Here are some general resiliency best practices:
 1. Use Multi-AZ deployments
 2. Implement retry logic
 3. Consider disaster recovery
 [...10 generic bullet points with no reference to the actual code...]
After (with aws-resiliency):

User: Is this CDK stack resilient?
 [pastes CDK TypeScript stack]


Agent: Two-layer resiliency review complete across 7 domains.

 | Domain | Finding | Severity | Blast Radius | RTO Impact | Fix |
 |-------------|--------------------------------------|----------|---------------------------|-------------|----------------------------------------|
 | COMPUTE | ECS desiredCount=1, no multi-AZ | Critical | Full service outage | ~5 min | desiredCount≥2 + AZ placement strategy |
 | DATA | RDS missing multiAz | Critical | Read+write outage | 10–30 min | multiAz: true |
 | COMPUTE | Lambda timeout=3s on payments fn | High | Silent 504s to ALB | Per-request | timeout: 29 (ALB max) |
 | MESSAGING | SQS consumer missing DLQ | High | Message loss on failure | N/A | deadLetterQueue, maxReceiveCount: 3 |
 | NETWORKING | Single NAT Gateway (one AZ) | Medium | Outbound loss if AZ fails | ~5 min | One NAT GW per AZ |


Layer 2 (application code): RDS client is correctly initialised outside the handler , but connection pool maxConnections=10 will exhaust during the RDS failover reconnection storm. Recommend maxConnections=2 with exponential backoff on connect. Top risk: RDS Multi-AZ + ECS single-task together mean any AZ failure causes a full outage with manual recovery. Fix both before production. REL 6 (REL_FAULT_ISOLATION), REL 9 (REL_MITIGATE_BULK_FAILURES).


<h2>Proposal</h2>
<h3>Plugin structure</h3>
<pre><code>plugins/aws-resiliency/
aws-resiliency-plugin/
├── .claude-plugin/
│ └── plugin.json 
├── skills/
│ ├── SKILL.md ← orchestrator (loads references on demand)
│ └── references/
│ ├── compute-resiliency.md
│ ├── data-resiliency.md
│ ├── networking-resiliency.md
│ ├── storage-resiliency.md
│ ├── messaging-resiliency.md
│ ├── observability-resiliency.md
│ ├── multi-region-dr.md
│ ├── service-failure-modes.md
│ └── well-architected-reliability.md
├── mcp.json ← 6 AWS MCP server configs
└── README.md

</code></pre>

<h3>MCP server dependencies</h3>

Server | Type | Purpose | Required?
-- | -- | -- | --
awslabs.aws-documentation-mcp-server | stdio | Well-Architected REL pillar references, AWS service documentation, service limits, SLA definitions | Required
awslabs.aws-iac-mcp-server | stdio | CDK + CloudFormation resource schema lookups, cfn-lint validation, cfn-guard compliance checks, IaC best-practice patterns | When CDK/CFn detected
awslabs.terraform-mcp-server | stdio | Terraform provider registry, module validation, HCL resource schema checks | When Terraform HCL detected
awslabs.aws-knowledge-mcp-server | stdio | Architecture best practices, cross-service integration patterns, service-specific failure behaviour FAQs | On-demand for service behaviour lookups




<head>

<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 15">
<meta name=Originator content="Microsoft Word 15">
<link rel=File-List
href="file:////Users/awsrajan/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_filelist.xml">
<link rel=Edit-Time-Data
href="file:////Users/awsrajan/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_editdata.mso">

<link rel=themeData
href="file:////Users/awsrajan/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_themedata.thmx">
<link rel=colorSchemeMapping
href="file:////Users/awsrajan/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_colorschememapping.xml">

<style>

</style>

</head>

<body lang=EN-US link="#0563C1" vlink="#96607D" style='tab-interval:.5in;
word-wrap:break-word'>


<code>awslabs.aws-documentation-mcp-server</code> is the only required server —
it provides Well-Architected REL pillar grounding for all findings. The IaC and
Terraform servers are conditionally invoked based on detected input format. The
knowledge server is invoked when the agent needs to confirm specific service
behaviours (e.g., exact RDS DNS failover propagation window, DynamoDB eventual
consistency semantics under partition).<o:p></o:p>

<h3>Defaults<o:p></o:p></h3>


Setting | Default | How to Override
-- | -- | --
IaC languages | CDK (TypeScript, Python, Java, Go), Terraform HCL, CloudFormation YAML/JSON, SAM YAML | Auto-detected from syntax and file extension
Severity: Critical | Single point of failure, no automated recovery, estimated RTO > 4 hours | State: "my RTO requirement is X"
Severity: High | Significant reliability gap, degraded service, RTO 1–4 hours |  
Severity: Medium | Well-Architected best-practice violation, limited blast radius, RTO < 1 hour |  
Severity: Low | Improvement opportunity, no immediate blast radius, affects future scale |  
Default RTO threshold | 4 hours (Critical if exceeded) | State target RTO explicitly
Default RPO threshold | 1 hour (Critical if exceeded) | State target RPO explicitly
Review scope | All IaC files in the current directory + application code shared in the conversation | Specify: "review only the data layer"
Output format | Markdown findings table (Domain / Finding / Severity / Blast Radius / RTO Impact / Fix / WAF Ref) + Layer 2 narrative + top-risk summary |  
Fix format | Corrected code/config snippet in the same IaC language as the input |  
WAF references | Mapped to Well-Architected REL pillar question titles (stable across versions) |  



<div class=MsoNormal align=center style='text-align:center'>

<hr size=0 width="100%" align=center>

</div>

<h2>Dependencies and
Integrations<o:p></o:p></h2>

MCP dependencies (all
from AWS Labs):<o:p></o:p>

<ul type=disc>
 <li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;
 mso-list:l2 level1 lfo1;tab-stops:list .5in'><a href="https://github.com/awslabs/mcp"
 target="_blank"
 data-saferedirecturl="https://www.google.com/url?q=https://github.com/awslabs/mcp&amp;source=gmail&amp;ust=1772674514203000&amp;usg=AOvVaw2k3lZjeHd551yUY6pnVfFU"><code>awslabs.aws-documentation-mcp-server</code></a>
 — Well-Architected REL pillar, service docs, SLAs<o:p></o:p></li>
 <li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;
 mso-list:l2 level1 lfo1;tab-stops:list .5in'><a href="https://github.com/awslabs/mcp"
 target="_blank"
 data-saferedirecturl="https://www.google.com/url?q=https://github.com/awslabs/mcp&amp;source=gmail&amp;ust=1772674514203000&amp;usg=AOvVaw2k3lZjeHd551yUY6pnVfFU"><code>awslabs.aws-iac-mcp-server</code></a> — CDK +
 CloudFormation validation, cfn-lint, cfn-guard<o:p></o:p></li>
 <li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;
 mso-list:l2 level1 lfo1;tab-stops:list .5in'><a href="https://github.com/awslabs/mcp"
 target="_blank"
 data-saferedirecturl="https://www.google.com/url?q=https://github.com/awslabs/mcp&amp;source=gmail&amp;ust=1772674514203000&amp;usg=AOvVaw2k3lZjeHd551yUY6pnVfFU"><code>awslabs.terraform-mcp-server</code></a> —
 Terraform provider/module validation<o:p></o:p></li>
 <li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;
 mso-list:l2 level1 lfo1;tab-stops:list .5in'><a href="https://github.com/awslabs/mcp"
 target="_blank"
 data-saferedirecturl="https://www.google.com/url?q=https://github.com/awslabs/mcp&amp;source=gmail&amp;ust=1772674514203000&amp;usg=AOvVaw2k3lZjeHd551yUY6pnVfFU"><code>awslabs.aws-knowledge-mcp-server</code></a>
 — Architecture patterns, service failure behaviour<o:p></o:p></li>
</ul>

Integration with
existing plugins:<o:p></o:p>

<ul type=disc>
 <li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;
 mso-list:l1 level1 lfo2;tab-stops:list .5in'>Complements <code>deploy-on-aws</code>:
 <code>deploy-on-aws</code> generates dev-sized IaC; <code>aws-resiliency</code>
 reviews it for production hardening. Recommended workflow: generate →
 review → harden → redeploy.<o:p></o:p></li>
 <li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;
 mso-list:l1 level1 lfo2;tab-stops:list .5in'>Complements <code>aws-observability</code> (RFC #67): Resiliency review surfaces missing CloudWatch
 alarms, absent X-Ray tracing, and health check gaps — findings that feed
 directly into <code>aws-observability</code> workflows.<o:p></o:p></li>
</ul>

Reference
implementation:
A working version of the skill and reference files is available at <a
href="https://github.com/nirmal84/aws-resiliency-plugin" target="_blank"
data-saferedirecturl="https://www.google.com/url?q=https://github.com/nirmal84/aws-resiliency-plugin&amp;source=gmail&amp;ust=1772674514203000&amp;usg=AOvVaw14I3t0Z2GnK8gIwxs8XAkN">https://github.com/nirmal84/aws-resiliency-plugin</a>
<o:p></o:p>

<o:p>&nbsp;</o:p>

<h2>Potential Challenges<o:p></o:p></h2>

<ul type=disc>
 <li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;
 mso-list:l0 level1 lfo3;tab-stops:list .5in'>IaC language detection in
 monorepos:
 Multi-file CDK projects with multiple stacks require heuristic detection.
 Mitigation: SKILL.md instructs the agent to ask the user to identify the
 main stack file if auto-detection is ambiguous.<o:p></o:p></li>
 <li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;
 mso-list:l0 level1 lfo3;tab-stops:list .5in'>Two-layer correlation: Connecting an IaC finding (RDS
 Multi-AZ enabled) with an application finding (connection pool not
 handling the 60s DNS failover window) requires both layers to be present.
 Mitigation: SKILL.md instructs the agent to flag missing layers and
 deliver a partial review with explicit caveats.<o:p></o:p></li>
 <li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;
 mso-list:l0 level1 lfo3;tab-stops:list .5in'>WAF question ID currency: REL pillar question IDs change
 across framework versions. Mitigation: References use stable question titles rather
 than volatile IDs; <code>aws-documentation-mcp-server</code> fetches current content when
 available.<o:p></o:p></li>
 <li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;
 mso-list:l0 level1 lfo3;tab-stops:list .5in'>Reference file size: Domain reference files covering
 IaC checks, application code checks, and failure modes will approach the
 100-line guideline. Mitigation: Files load only when the relevant domain
 is detected — a Lambda-only review never loads <code>multi-region-dr.md</code>.
 SKILL.md stays under 200 lines.<o:p></o:p></li>
</ul>


</body>

</html>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: AWS Resiliency Plugin #75

Is this related to an existing feature request or issue?

Summary

Use case

Proposal

Plugin structure

MCP server dependencies

Defaults

Dependencies and Integrations

Potential Challenges

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Server	Type	Purpose	Required?
awslabs.aws-documentation-mcp-server	stdio	Well-Architected REL pillar references, AWS service documentation, service limits, SLA definitions	Required
awslabs.aws-iac-mcp-server	stdio	CDK + CloudFormation resource schema lookups, cfn-lint validation, cfn-guard compliance checks, IaC best-practice patterns	When CDK/CFn detected
awslabs.terraform-mcp-server	stdio	Terraform provider registry, module validation, HCL resource schema checks	When Terraform HCL detected
awslabs.aws-knowledge-mcp-server	stdio	Architecture best practices, cross-service integration patterns, service-specific failure behaviour FAQs	On-demand for service behaviour lookups

Setting	Default	How to Override
IaC languages	CDK (TypeScript, Python, Java, Go), Terraform HCL, CloudFormation YAML/JSON, SAM YAML	Auto-detected from syntax and file extension
Severity: Critical	Single point of failure, no automated recovery, estimated RTO > 4 hours	State: "my RTO requirement is X"
Severity: High	Significant reliability gap, degraded service, RTO 1–4 hours
Severity: Medium	Well-Architected best-practice violation, limited blast radius, RTO < 1 hour
Severity: Low	Improvement opportunity, no immediate blast radius, affects future scale
Default RTO threshold	4 hours (Critical if exceeded)	State target RTO explicitly
Default RPO threshold	1 hour (Critical if exceeded)	State target RPO explicitly
Review scope	All IaC files in the current directory + application code shared in the conversation	Specify: "review only the data layer"
Output format	Markdown findings table (Domain / Finding / Severity / Blast Radius / RTO Impact / Fix / WAF Ref) + Layer 2 narrative + top-risk summary
Fix format	Corrected code/config snippet in the same IaC language as the input
WAF references	Mapped to Well-Architected REL pillar question titles (stable across versions)

RFC: AWS Resiliency Plugin #75

Description

Is this related to an existing feature request or issue?

Summary

Use case

Proposal

Plugin structure

MCP server dependencies

Defaults

Dependencies and Integrations

Potential Challenges

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions