Skip to content

omaranda/Production-Grade_AWS-Data-Platform

Repository files navigation

Production‑Grade AWS Data Platform (ETL/ELT · Redshift · IaC)

A production-grade data engineering platform demonstrating end-to-end ETL/ELT pipelines on AWS, featuring advanced Python scripting, complex SQL analytics, and comprehensive data architecture.

License Python Terraform AWS

🎯 Project Overview

This project showcases a complete data platform that:

  • Ingests data from multiple sources into a data lake
  • Transforms data through scalable ETL pipelines
  • Loads analytics-ready datasets into a data warehouse
  • Demonstrates advanced Python, SQL, and AWS skills

Skills Demonstrated:

  • ✅ Advanced Python (ETL frameworks, data generators, AWS automation)
  • ✅ Complex SQL (window functions, CTEs, star schema design)
  • ✅ AWS Services (S3, Glue, Redshift, Lambda, CloudWatch)
  • ✅ Infrastructure as Code (Terraform)
  • ✅ Data Architecture (Data Lake + Data Warehouse)

🏗️ Architecture

┌─────────────────┐
│  Data Sources   │
│  (Synthetic)    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Data Lake     │
│   (Amazon S3)   │
│  Raw → Stage →  │
│    Curated      │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Processing    │
│  (AWS Glue)     │
│   PySpark ETL   │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Data Warehouse │
│ (Amazon Redshift│
│   Star Schema)  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Analytics &   │
│     BI Tools    │
└─────────────────┘

📁 Project Structure

├── scripts/              # Management scripts
│   ├── install.sh        # Install dependencies
│   ├── deploy.sh         # Deploy infrastructure
│   ├── start.sh          # Start resources (resume Redshift)
│   ├── stop.sh           # Stop resources (pause Redshift)
│   ├── status.sh         # Check resource status
│   ├── run_pipeline.sh   # Execute data pipeline
│   ├── load_sample_data.sh  # Load sample data
│   └── destroy.sh        # Destroy infrastructure
├── terraform/            # Infrastructure as Code
│   ├── modules/          # Terraform modules
│   │   ├── s3/           # Data lake buckets
│   │   ├── iam/          # IAM roles & policies
│   │   ├── glue/         # Glue catalog & jobs
│   │   ├── redshift/     # Redshift cluster
│   │   └── monitoring/   # CloudWatch dashboards
│   └── environments/     # Environment configs
├── python/               # Python modules
│   ├── etl_framework/    # Reusable ETL base classes
│   ├── data_generators/  # Synthetic data generators
│   ├── data_quality/     # Validation framework
│   ├── transformations/  # Business logic
│   ├── aws_utils/        # AWS helper functions
│   └── monitoring/       # Metrics & alerting
├── etl/                  # ETL jobs
│   ├── glue_jobs/        # PySpark jobs for Glue
│   └── lambda_functions/ # Lambda functions
├── sql/                  # SQL scripts
│   ├── ddl/              # Schema definitions
│   ├── dml/              # Data manipulation
│   ├── analytics/        # Analytical queries
│   └── data_quality/     # Data quality checks
└── tests/                # Test suite
    ├── unit/             # Unit tests
    └── integration/      # Integration tests

🚀 Quick Start

Prerequisites

  • Python 3.11+
  • Terraform 1.5+
  • AWS CLI configured
  • AWS Account with appropriate permissions

Installation

  1. Clone the repository:
git clone https://github.com/omaranda/production-grade-data-pipeline-demo.git
cd production-grade-data-pipeline-demo
  1. Run installation script:
./scripts/install.sh

This will:

  • Check dependencies
  • Create Python virtual environment
  • Install Python packages
  • Validate AWS credentials
  • Create .env configuration file
  1. Configure environment: Edit .env file with your settings:
AWS_REGION=us-east-1
PROJECT_NAME=data-engineering-portfolio
ENVIRONMENT=dev
REDSHIFT_MASTER_PASSWORD=YourSecurePassword123!

Deployment

  1. Deploy infrastructure:
./scripts/deploy.sh dev

This will:

  • Initialize Terraform
  • Create S3 buckets (raw, stage, curated)
  • Provision Redshift cluster
  • Set up Glue catalog and jobs
  • Configure IAM roles
  • Create CloudWatch dashboards

Deployment takes approximately 5-10 minutes.

  1. Check status:
./scripts/status.sh

Running the Pipeline

  1. Load sample data:
./scripts/load_sample_data.sh

This generates synthetic data:

  • 10,000 customers
  • 500 products
  • 50,000 transactions
  1. Execute pipeline:
./scripts/run_pipeline.sh

This runs the complete ETL pipeline:

  • Uploads data to S3 raw zone
  • Triggers Glue job (raw → stage)
  • Triggers Glue job (stage → curated)
  • Loads data into Redshift
  • Runs sample analytical queries

Cost Management

Stop resources when not in use:

./scripts/stop.sh

This pauses the Redshift cluster, saving ~$6/day.

Resume resources:

./scripts/start.sh

This resumes the Redshift cluster (takes 2-3 minutes).

💰 Cost Estimates

Development Environment (with automation)

  • Redshift (paused 16 hrs/day): ~$60/month
  • S3 Storage (100GB): ~$2.30/month
  • Glue (10 DPU-hours/day): ~$140/month
  • Lambda (1M invocations): ~$0.20/month
  • Data Transfer: ~$5-20/month

Total: ~$50-100/month with proper pause automation

Tips to Minimize Costs

  • Run ./scripts/stop.sh when not actively developing
  • Use enable_glue_jobs = false in tfvars during setup
  • Consider using Glue Dev Endpoints for testing
  • Set up budget alerts in AWS Cost Explorer

🛠️ Technologies Used

Category Technologies
Languages Python 3.11, SQL (Redshift), PySpark
Cloud AWS (S3, Glue, Redshift, Lambda, CloudWatch)
IaC Terraform 1.5+
Testing pytest, moto
Code Quality black, flake8, mypy

📊 Data Architecture

Data Lake (S3)

Three-zone architecture:

  • Raw Zone: Landing zone for ingested data (7-day retention)
  • Stage Zone: Cleansed and validated data (30-day retention)
  • Curated Zone: Analytics-ready datasets (permanent storage)

Data Warehouse (Redshift)

Star schema design:

  • Fact Tables: fact_transactions, fact_clickstream
  • Dimension Tables: dim_customers, dim_products, dim_time, dim_location

Optimization strategies:

  • Distribution keys for co-located joins
  • Sort keys on frequently filtered columns
  • Materialized views for common aggregations

📚 Documentation

🧪 Testing

Run tests:

# Activate virtual environment
source venv/bin/activate

# Run all tests
pytest

# Run with coverage
pytest --cov=python --cov=etl --cov-report=html

# Run specific test suite
pytest tests/unit/

🤝 Contributing

This is a portfolio project, but feedback and suggestions are welcome!

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/improvement)
  3. Commit your changes (git commit -m 'Add improvement')
  4. Push to the branch (git push origin feature/improvement)
  5. Open a Pull Request

📝 License

Copyright 2024 Omar Miranda

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this project except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

See the LICENSE file for the full license text.

👤 Author

Omar Miranda

🙏 Acknowledgments

  • AWS Documentation
  • Terraform Registry
  • Python Data Engineering Community

📌 Project Status

Current Phase: Foundation Complete ✅

Completed:

  • ✅ Project structure and configuration
  • ✅ Terraform infrastructure modules
  • ✅ Shell management scripts
  • ✅ Python ETL framework base classes
  • ✅ Environment configuration

In Progress:

  • 🔄 Data generator implementations
  • 🔄 Glue ETL jobs
  • 🔄 Redshift SQL scripts

Upcoming:

  • ⏳ Unit and integration tests
  • ⏳ Power BI dashboards
  • ⏳ CI/CD pipeline
  • ⏳ Documentation and architecture diagrams

About

Demonstrates an end‑to‑end AWS data platform: multi‑source ingestion to S3 (data lake), PySpark ETL on Glue, Star‑schema analytics in Redshift, and IaC with Terraform; full CI/CD docs and architecture. Tech: S3 · Glue · Redshift · Lambda · Python · SQL · Terraform · CI/CD

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors