Production‑Grade AWS Data Platform (ETL/ELT · Redshift · IaC)

A production-grade data engineering platform demonstrating end-to-end ETL/ELT pipelines on AWS, featuring advanced Python scripting, complex SQL analytics, and comprehensive data architecture.

🎯 Project Overview

This project showcases a complete data platform that:

Ingests data from multiple sources into a data lake
Transforms data through scalable ETL pipelines
Loads analytics-ready datasets into a data warehouse
Demonstrates advanced Python, SQL, and AWS skills

Skills Demonstrated:

✅ Advanced Python (ETL frameworks, data generators, AWS automation)
✅ Complex SQL (window functions, CTEs, star schema design)
✅ AWS Services (S3, Glue, Redshift, Lambda, CloudWatch)
✅ Infrastructure as Code (Terraform)
✅ Data Architecture (Data Lake + Data Warehouse)

🏗️ Architecture

┌─────────────────┐
│  Data Sources   │
│  (Synthetic)    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Data Lake     │
│   (Amazon S3)   │
│  Raw → Stage →  │
│    Curated      │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Processing    │
│  (AWS Glue)     │
│   PySpark ETL   │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Data Warehouse │
│ (Amazon Redshift│
│   Star Schema)  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Analytics &   │
│     BI Tools    │
└─────────────────┘

📁 Project Structure

├── scripts/              # Management scripts
│   ├── install.sh        # Install dependencies
│   ├── deploy.sh         # Deploy infrastructure
│   ├── start.sh          # Start resources (resume Redshift)
│   ├── stop.sh           # Stop resources (pause Redshift)
│   ├── status.sh         # Check resource status
│   ├── run_pipeline.sh   # Execute data pipeline
│   ├── load_sample_data.sh  # Load sample data
│   └── destroy.sh        # Destroy infrastructure
├── terraform/            # Infrastructure as Code
│   ├── modules/          # Terraform modules
│   │   ├── s3/           # Data lake buckets
│   │   ├── iam/          # IAM roles & policies
│   │   ├── glue/         # Glue catalog & jobs
│   │   ├── redshift/     # Redshift cluster
│   │   └── monitoring/   # CloudWatch dashboards
│   └── environments/     # Environment configs
├── python/               # Python modules
│   ├── etl_framework/    # Reusable ETL base classes
│   ├── data_generators/  # Synthetic data generators
│   ├── data_quality/     # Validation framework
│   ├── transformations/  # Business logic
│   ├── aws_utils/        # AWS helper functions
│   └── monitoring/       # Metrics & alerting
├── etl/                  # ETL jobs
│   ├── glue_jobs/        # PySpark jobs for Glue
│   └── lambda_functions/ # Lambda functions
├── sql/                  # SQL scripts
│   ├── ddl/              # Schema definitions
│   ├── dml/              # Data manipulation
│   ├── analytics/        # Analytical queries
│   └── data_quality/     # Data quality checks
└── tests/                # Test suite
    ├── unit/             # Unit tests
    └── integration/      # Integration tests

🚀 Quick Start

Prerequisites

Python 3.11+
Terraform 1.5+
AWS CLI configured
AWS Account with appropriate permissions

Installation

Clone the repository:

git clone https://github.com/omaranda/production-grade-data-pipeline-demo.git
cd production-grade-data-pipeline-demo

Run installation script:

./scripts/install.sh

This will:

Check dependencies
Create Python virtual environment
Install Python packages
Validate AWS credentials
Create .env configuration file

Configure environment: Edit .env file with your settings:

AWS_REGION=us-east-1
PROJECT_NAME=data-engineering-portfolio
ENVIRONMENT=dev
REDSHIFT_MASTER_PASSWORD=YourSecurePassword123!

Deployment

Deploy infrastructure:

./scripts/deploy.sh dev

This will:

Initialize Terraform
Create S3 buckets (raw, stage, curated)
Provision Redshift cluster
Set up Glue catalog and jobs
Configure IAM roles
Create CloudWatch dashboards

Deployment takes approximately 5-10 minutes.

Check status:

./scripts/status.sh

Running the Pipeline

Load sample data:

./scripts/load_sample_data.sh

This generates synthetic data:

10,000 customers
500 products
50,000 transactions

Execute pipeline:

./scripts/run_pipeline.sh

This runs the complete ETL pipeline:

Uploads data to S3 raw zone
Triggers Glue job (raw → stage)
Triggers Glue job (stage → curated)
Loads data into Redshift
Runs sample analytical queries

Cost Management

Stop resources when not in use:

./scripts/stop.sh

This pauses the Redshift cluster, saving ~$6/day.

Resume resources:

./scripts/start.sh

This resumes the Redshift cluster (takes 2-3 minutes).

💰 Cost Estimates

Development Environment (with automation)

Redshift (paused 16 hrs/day): ~$60/month
S3 Storage (100GB): ~$2.30/month
Glue (10 DPU-hours/day): ~$140/month
Lambda (1M invocations): ~$0.20/month
Data Transfer: ~$5-20/month

Total: ~$50-100/month with proper pause automation

Tips to Minimize Costs

Run ./scripts/stop.sh when not actively developing
Use enable_glue_jobs = false in tfvars during setup
Consider using Glue Dev Endpoints for testing
Set up budget alerts in AWS Cost Explorer

🛠️ Technologies Used

Category	Technologies
Languages	Python 3.11, SQL (Redshift), PySpark
Cloud	AWS (S3, Glue, Redshift, Lambda, CloudWatch)
IaC	Terraform 1.5+
Testing	pytest, moto
Code Quality	black, flake8, mypy

📊 Data Architecture

Data Lake (S3)

Three-zone architecture:

Raw Zone: Landing zone for ingested data (7-day retention)
Stage Zone: Cleansed and validated data (30-day retention)
Curated Zone: Analytics-ready datasets (permanent storage)

Data Warehouse (Redshift)

Star schema design:

Fact Tables: fact_transactions, fact_clickstream
Dimension Tables: dim_customers, dim_products, dim_time, dim_location

Optimization strategies:

Distribution keys for co-located joins
Sort keys on frequently filtered columns
Materialized views for common aggregations

📚 Documentation

Architecture Diagrams: Detailed architecture documentation (coming soon)
Setup Guide: Step-by-step setup instructions (coming soon)

🧪 Testing

Run tests:

# Activate virtual environment
source venv/bin/activate

# Run all tests
pytest

# Run with coverage
pytest --cov=python --cov=etl --cov-report=html

# Run specific test suite
pytest tests/unit/

🤝 Contributing

This is a portfolio project, but feedback and suggestions are welcome!

Fork the repository
Create a feature branch (git checkout -b feature/improvement)
Commit your changes (git commit -m 'Add improvement')
Push to the branch (git push origin feature/improvement)
Open a Pull Request

📝 License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this project except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

See the LICENSE file for the full license text.

👤 Author

Omar Miranda

Portfolio: github.com/omaranda
LinkedIn: linkedin.com/in/omar-miranda-aranda

🙏 Acknowledgments

AWS Documentation
Terraform Registry
Python Data Engineering Community

📌 Project Status

Current Phase: Foundation Complete ✅

Completed:

✅ Project structure and configuration
✅ Terraform infrastructure modules
✅ Shell management scripts
✅ Python ETL framework base classes
✅ Environment configuration

In Progress:

🔄 Data generator implementations
🔄 Glue ETL jobs
🔄 Redshift SQL scripts

Upcoming:

⏳ Unit and integration tests
⏳ Power BI dashboards
⏳ CI/CD pipeline
⏳ Documentation and architecture diagrams

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.aws-codepipeline		.aws-codepipeline
.github		.github
docs		docs
etl/glue_jobs		etl/glue_jobs
python		python
scripts		scripts
sql		sql
terraform		terraform
.flake8		.flake8
.gitignore		.gitignore
.pylintrc		.pylintrc
.python-version		.python-version
.sqlfluff		.sqlfluff
CICD-IMPLEMENTATION-SUMMARY.md		CICD-IMPLEMENTATION-SUMMARY.md
LICENSE		LICENSE
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
README.md		README.md
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Production‑Grade AWS Data Platform (ETL/ELT · Redshift · IaC)

🎯 Project Overview

🏗️ Architecture

📁 Project Structure

🚀 Quick Start

Prerequisites

Installation

Deployment

Running the Pipeline

Cost Management

💰 Cost Estimates

Development Environment (with automation)

Tips to Minimize Costs

🛠️ Technologies Used

📊 Data Architecture

Data Lake (S3)

Data Warehouse (Redshift)

📚 Documentation

🧪 Testing

🤝 Contributing

📝 License

👤 Author

🙏 Acknowledgments

📌 Project Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Production‑Grade AWS Data Platform (ETL/ELT · Redshift · IaC)

🎯 Project Overview

🏗️ Architecture

📁 Project Structure

🚀 Quick Start

Prerequisites

Installation

Deployment

Running the Pipeline

Cost Management

💰 Cost Estimates

Development Environment (with automation)

Tips to Minimize Costs

🛠️ Technologies Used

📊 Data Architecture

Data Lake (S3)

Data Warehouse (Redshift)

📚 Documentation

🧪 Testing

🤝 Contributing

📝 License

👤 Author

🙏 Acknowledgments

📌 Project Status

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages