Note: All documentation in this repo is available as rendered/searchable HTML here.
The Modern Data Architecture Accelerator (MDAA) helps organizations deploy secure, compliant data analytics and AI environments on Amazon Web Services (AWS) through simple YAML configuration files. Whether you need a basic data lake, a full data science platform, SageMaker Unified Studio or a generative AI solution, MDAA provides prepackaged starter kits and reusable infrastructure components that handle security compliance out of the box. It supports teams of all sizes, from small organizations looking for code-free deployment to large enterprises building complex Lake House or Data Mesh architectures.
- Who Is This For?
- Key Features
- Architecture
- Security
- Quick Start
- Implementation Guide
- Workshops and Learning Resources
- Starter Kits
- Sample Configurations
- Available Modules
- Using and Extending MDAA
- For Developers
- Contributing
- License
- Data and Cloud Architects: Design and govern enterprise data platforms with standardized, compliance-ready building blocks.
- Data Engineers: Build and manage data pipelines, lakes, and warehouses with pre-configured, compliant infrastructure.
- Data Scientists and ML Engineers: Get a ready-to-use SageMaker Unified Studio environment with governed data access so you can focus on models, not infrastructure.
- Business Analysts: Access governed data through Athena, QuickSight, and other analytics tools deployed by your platform team.
- Compliance Officers: Gain confidence that deployed infrastructure aligns with NIST 800-53, HIPAA, and PCI-DSS security control requirements.
- Security compliance built in: Modules are designed for compliance with AWS Solutions, NIST 800-53 Rev5, HIPAA, PCI-DSS, and ITSG-33 CDK Nag rulesets.
- Configuration-driven deployment: Define your entire modern data and analytics environment in YAML files and deploy with a single CLI command. No custom code required.
- Starter kits for common use cases: Prepackaged configurations for data lakes, data science platforms, generative AI, governed lakehouses, and healthcare data.
- Multi-account and multi-region: Deploy across multiple AWS accounts and regions with built-in cross-account trust and governance.
- Multi-language support: Reusable CDK L2 constructs available in TypeScript, Python, Java, and .NET via JSII (JavaScript Interop Interface). L3 constructs are currently TypeScript-only.
MDAA is designed as a set of modules. Each module configures and deploys a set of resources which constitute the data analytics environment. Modules may have dependencies on each other, and may also leverage non-MDAA resources deployed within the environment.
While MDAA can be used to implement a comprehensive, end-to-end modern data architecture, it does not result in a closed system. MDAA may be freely integrated with non-MDAA deployed platform elements and data capabilities. Any individual module of MDAA can be replaced by a non-MDAA component, and the remaining modules will continue to function.
See SECURITY.md for details on MDAA's security design principles and compliance approach.
See CONTRIBUTING.md for information on reporting security issues.
Deploy your first data lake in minutes using the Basic DataLake starter kit. Alternatively, quickly deploy one of these other starter kits
- Node.js 22.x and npm 10.x
- AWS credentials configured with appropriate permissions (AWS CLI setup)
- AWS CDK (Cloud Development Kit) bootstrapped in your target account (CDK bootstrap guide)
- Clone the repo and navigate to the Basic DataLake starter kit:
git clone https://github.com/aws/modern-data-architecture-accelerator.git
cd modern-data-architecture-accelerator/starter_kits/basic_datalake-
Edit
mdaa.yamlto specify an organization name. This must be globally unique, as it is used in the naming of all deployed resources (including globally named resources such as S3 buckets). -
If required, edit
mdaa.yamlto specifycontext:values specific to your environment. -
Ensure you are authenticated to your target AWS account.
-
Bootstrap your AWS account for CDK (if not already done):
npx cdk bootstrap- Deploy using npx (no installation required):
npx @aws-mdaa/cli deploy -c mdaa.yamlOr install the CLI globally and then deploy:
npm install -g @aws-mdaa/cli
mdaa deploy -c mdaa.yamlEstimated deployment time: ~15–20 minutes
For full deployment details, see the Basic DataLake starter kit README.
The Basic DataLake starter kit creates a secure, encrypted Amazon S3 data lake with AWS Glue databases and crawlers, AWS Identity and Access Management (IAM) roles with least-privilege policies, and AWS Key Management Service (KMS) encryption keys, all configured for compliance with standard security rulesets.
Looking for a different starting point? See Starter Kits for other prepackaged options including data science platforms, generative AI, and more.
MDAA follows a five-phase deployment lifecycle: Architecture (define your target platform design), Configuration (author YAML config files for each module), Customization (optionally extend via code-based escape hatches), Predeployment (bootstrap AWS accounts), and Deployment (deploy via the MDAA CLI). Each phase builds on the previous one, and starter kits can accelerate the first two phases significantly.
| Phase | Description | Time Estimate |
|---|---|---|
| Architecture | Define your target platform design and select modules | 1–2 days |
| Configuration | Author YAML config files for each module | 1–3 days |
| Customization | Optionally extend via code-based escape hatches | 0–2 days |
| Predeployment | Bootstrap AWS accounts with CDK | 2 - 10 mins |
| Deployment | Deploy via the MDAA CLI | 15 min – 1 hour |
For the full step-by-step guide, see the MDAA Implementation Guide. Starter kits and sample configurations provide ready-made configurations that can accelerate the early phases significantly.
- MDAA Hands-On Workshop: A guided, hands-on workshop that walks you through deploying and configuring MDAA from scratch.
- External Sample Configurations: A community-maintained repository of additional MDAA configurations for various use cases and architectures.
- Starter Kits: Prepackaged, secure MDAA configurations for common use cases, included in this repository.
Browse the full documentation, module references, and configuration schemas at aws.github.io/modern-data-architecture-accelerator.
- Architecture and Design Guide: Reference architectures and design patterns for MDAA deployments.
- Configuration Guide: How to author MDAA YAML configuration files.
- Customization Guide: How to extend MDAA modules with code-based escape hatches.
- Predeployment Guide: How to prepare your AWS accounts for MDAA deployment.
- Deployment Guide: Step-by-step deployment instructions using the MDAA CLI.
Starter kits provide secure, prepackaged foundations for common use cases:
| Starter Kit | Description | Est. Deploy Time |
|---|---|---|
| Basic DataLake | A secure S3 data lake with Glue databases and crawlers | ~15–20 min |
| Basic DataScience Platform | A standalone SageMaker AI Studio data science environment | ~20–30 min |
| Governed Lakehouse | DataZone-governed lakehouse with fine-grained access control | ~20–25 min |
| Health Data Accelerator | Healthcare data lake with DMS (Database Migration Service) integration | ~30–45 min |
| SMUS Research Environment | A SageMaker Unified Studio-enabled architecture suitable for facilitating team-based research activities | ~20–25 min |
| SMUS Data Mesh | Multi-account SageMaker Unified Studio deployment with cross-account data sharing and custom blueprints | ~30–45 min |
Additional sample configurations are available in a dedicated repository for easier community contribution and faster updates.
MDAA is implemented as a set of compliant modules deployed via a unified orchestration layer. For detailed module documentation, configuration schemas, and API references, see the MDAA Documentation Site.
- SageMaker Unified Studio - Deploy SageMaker Unified Studio domains and associated resources.
- DataZone - Deploy DataZone domains and environment blueprints.
- Macie Session - Deploy Macie sessions at the account level.
- LakeFormation Data Lake Settings - Administer LakeFormation settings using IaC.
- LakeFormation Access Controls - Administer LakeFormation access controls using IaC.
- Glue Catalog - Configure Glue Catalog encryption and cross-account access.
- IAM Roles and Policies - Generate IAM roles for the data environment.
- Audit - Generate audit resources for data capture and Athena querying.
- Audit Trail - Generate CloudTrail for S3 data events.
- Service Catalog - Deploy Service Catalog portfolios and grant access.
- SageMaker Projects - Deploy SageMaker Unified Studio projects and associated resources.
- Datalake KMS and Buckets - Generate encrypted data lake buckets with compliant policies.
- Athena Workgroup - Generate Athena workgroups for data lake querying.
- Data Ops Project - Shared secure resources for data ops pipelines.
- Data Ops Crawlers - Glue crawlers for data ops pipelines.
- Data Ops Jobs - Glue jobs for data ops pipelines.
- Data Ops Workflows - Glue workflows for orchestrating pipelines.
- Data Ops Step Functions - Step Functions for pipeline orchestration.
- Data Ops Lambda - Lambda functions for data event processing.
- Data Ops DataBrew - Glue DataBrew for data profiling and cleansing.
- Data Ops Nifi - Apache Nifi clusters for event-driven data flows.
- Data Ops DMS - DMS replication instances, endpoints, and tasks.
- Data Ops Dashboard - CloudWatch dashboards for MDAA observability.
- Data Ops Data Quality - Glue Data Quality rulesets for automated data validation.
- Data Ops DynamoDB - DynamoDB tables for data operations.
- Redshift Data Warehouse - Secure Redshift data warehouse clusters.
- OpenSearch Domain - Secure OpenSearch domains and dashboards.
- QuickSight Account - Deploy QuickSight account resources.
- QuickSight Namespace - QuickSight namespaces for multi-tenancy.
- QuickSight Project - QuickSight shared folders and permissions.
- SageMaker Unified Studio - Secured SageMaker Unified Studio.
- SageMaker Notebooks - Secured SageMaker notebooks.
- Data Science Team/Project - Resources for team data science activities.
- Generative AI Accelerator - Authenticated GenAI chatbot with WebSocket streaming, Bedrock Knowledge Base RAG, and admin and client UIs.
- Generative AI Accelerator v1 (deprecated) - Previous generation of the GenAI chatbot. New deployments should use
@aws-mdaa/gaia-v2; v1 remains published for existing deployments and will be removed in a future release. See the migration guide. - SageMaker Ground Truth - Automated continuous data labeling pipeline with Ground Truth and Feature Store integration.
- SageMaker MLOps - Unified ML training and deployment pipeline with cross-account model promotion.
- SageMaker Pipeline - Declarative SageMaker Pipeline defined in CDK/CloudFormation with no seed code required.
- SageMaker Endpoint - Real-time SageMaker inference endpoint from an approved model package.
- SageMaker Model Monitoring - Continuous monitoring of production inference endpoints for drift, degradation, and bias.
- Bedrock AgentCore Runtime - Deploy Amazon Bedrock AgentCore Runtimes with custom Docker containers.
- Bedrock Builder - Deploy secure Bedrock Agents, Knowledge Bases, and associated resources.
- Bedrock Settings - Configure Bedrock model invocation audit logging to S3 and CloudWatch.
- EC2 - Secure EC2 instances and security groups.
- SFTP Transfer Family Server - SFTP Transfer Family for data lake ingestion.
- SFTP Transfer Family User Admin - Administer SFTP Transfer Family users.
- DataSync - DataSync for on-premises to cloud data movement.
- EventBridge - EventBridge resources such as event buses.
- Machine to Machine API - REST API for programmatic data lake interaction.
MDAA can be used and extended in three ways:
Deploy compliant, end-to-end analytics environments using YAML config files and the MDAA CLI. No code required - accessible to all roles, from simple to complex deployments with high compliance assurance.
Build custom analytics environments using MDAA's reusable CDK constructs. Multi-language support (TypeScript, Python, Java, .NET) for L2 constructs; L3 constructs are currently TypeScript-only.
Independently developed workloads (CDK or CloudFormation) can leverage MDAA-deployed resources via the standard set of SSM (Systems Manager) parameters published by all MDAA modules.
This solution collects anonymous operational metrics to help AWS improve quality and features. For more information, including how to disable this capability, see the CDK version reporting documentation.
For detailed guides, see:
- CONTRIBUTING.md - Project architecture, coding guidelines, and pull request process.
- DEVELOPMENT.md - Development environment setup, build process, and tooling.
- TESTING.md - Testing standards, architecture, and coverage requirements.
Full documentation and module reference is available at aws.github.io/modern-data-architecture-accelerator. To generate the docs locally, run mkdocs serve from the project root (requires MkDocs).
We welcome contributions from the community. See CONTRIBUTING.md for guidelines on how to get started, set up your development environment, and submit pull requests.
This project is licensed under the Apache-2.0 License.


