A production-grade data engineering platform demonstrating end-to-end ETL/ELT pipelines on AWS, featuring advanced Python scripting, complex SQL analytics, and comprehensive data architecture.
This project showcases a complete data platform that:
- Ingests data from multiple sources into a data lake
- Transforms data through scalable ETL pipelines
- Loads analytics-ready datasets into a data warehouse
- Demonstrates advanced Python, SQL, and AWS skills
Skills Demonstrated:
- ✅ Advanced Python (ETL frameworks, data generators, AWS automation)
- ✅ Complex SQL (window functions, CTEs, star schema design)
- ✅ AWS Services (S3, Glue, Redshift, Lambda, CloudWatch)
- ✅ Infrastructure as Code (Terraform)
- ✅ Data Architecture (Data Lake + Data Warehouse)
┌─────────────────┐
│ Data Sources │
│ (Synthetic) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Data Lake │
│ (Amazon S3) │
│ Raw → Stage → │
│ Curated │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Processing │
│ (AWS Glue) │
│ PySpark ETL │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Data Warehouse │
│ (Amazon Redshift│
│ Star Schema) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Analytics & │
│ BI Tools │
└─────────────────┘
├── scripts/ # Management scripts
│ ├── install.sh # Install dependencies
│ ├── deploy.sh # Deploy infrastructure
│ ├── start.sh # Start resources (resume Redshift)
│ ├── stop.sh # Stop resources (pause Redshift)
│ ├── status.sh # Check resource status
│ ├── run_pipeline.sh # Execute data pipeline
│ ├── load_sample_data.sh # Load sample data
│ └── destroy.sh # Destroy infrastructure
├── terraform/ # Infrastructure as Code
│ ├── modules/ # Terraform modules
│ │ ├── s3/ # Data lake buckets
│ │ ├── iam/ # IAM roles & policies
│ │ ├── glue/ # Glue catalog & jobs
│ │ ├── redshift/ # Redshift cluster
│ │ └── monitoring/ # CloudWatch dashboards
│ └── environments/ # Environment configs
├── python/ # Python modules
│ ├── etl_framework/ # Reusable ETL base classes
│ ├── data_generators/ # Synthetic data generators
│ ├── data_quality/ # Validation framework
│ ├── transformations/ # Business logic
│ ├── aws_utils/ # AWS helper functions
│ └── monitoring/ # Metrics & alerting
├── etl/ # ETL jobs
│ ├── glue_jobs/ # PySpark jobs for Glue
│ └── lambda_functions/ # Lambda functions
├── sql/ # SQL scripts
│ ├── ddl/ # Schema definitions
│ ├── dml/ # Data manipulation
│ ├── analytics/ # Analytical queries
│ └── data_quality/ # Data quality checks
└── tests/ # Test suite
├── unit/ # Unit tests
└── integration/ # Integration tests
- Python 3.11+
- Terraform 1.5+
- AWS CLI configured
- AWS Account with appropriate permissions
- Clone the repository:
git clone https://github.com/omaranda/production-grade-data-pipeline-demo.git
cd production-grade-data-pipeline-demo- Run installation script:
./scripts/install.shThis will:
- Check dependencies
- Create Python virtual environment
- Install Python packages
- Validate AWS credentials
- Create
.envconfiguration file
- Configure environment:
Edit
.envfile with your settings:
AWS_REGION=us-east-1
PROJECT_NAME=data-engineering-portfolio
ENVIRONMENT=dev
REDSHIFT_MASTER_PASSWORD=YourSecurePassword123!- Deploy infrastructure:
./scripts/deploy.sh devThis will:
- Initialize Terraform
- Create S3 buckets (raw, stage, curated)
- Provision Redshift cluster
- Set up Glue catalog and jobs
- Configure IAM roles
- Create CloudWatch dashboards
Deployment takes approximately 5-10 minutes.
- Check status:
./scripts/status.sh- Load sample data:
./scripts/load_sample_data.shThis generates synthetic data:
- 10,000 customers
- 500 products
- 50,000 transactions
- Execute pipeline:
./scripts/run_pipeline.shThis runs the complete ETL pipeline:
- Uploads data to S3 raw zone
- Triggers Glue job (raw → stage)
- Triggers Glue job (stage → curated)
- Loads data into Redshift
- Runs sample analytical queries
Stop resources when not in use:
./scripts/stop.shThis pauses the Redshift cluster, saving ~$6/day.
Resume resources:
./scripts/start.shThis resumes the Redshift cluster (takes 2-3 minutes).
- Redshift (paused 16 hrs/day): ~$60/month
- S3 Storage (100GB): ~$2.30/month
- Glue (10 DPU-hours/day): ~$140/month
- Lambda (1M invocations): ~$0.20/month
- Data Transfer: ~$5-20/month
Total: ~$50-100/month with proper pause automation
- Run
./scripts/stop.shwhen not actively developing - Use
enable_glue_jobs = falsein tfvars during setup - Consider using Glue Dev Endpoints for testing
- Set up budget alerts in AWS Cost Explorer
| Category | Technologies |
|---|---|
| Languages | Python 3.11, SQL (Redshift), PySpark |
| Cloud | AWS (S3, Glue, Redshift, Lambda, CloudWatch) |
| IaC | Terraform 1.5+ |
| Testing | pytest, moto |
| Code Quality | black, flake8, mypy |
Three-zone architecture:
- Raw Zone: Landing zone for ingested data (7-day retention)
- Stage Zone: Cleansed and validated data (30-day retention)
- Curated Zone: Analytics-ready datasets (permanent storage)
Star schema design:
- Fact Tables:
fact_transactions,fact_clickstream - Dimension Tables:
dim_customers,dim_products,dim_time,dim_location
Optimization strategies:
- Distribution keys for co-located joins
- Sort keys on frequently filtered columns
- Materialized views for common aggregations
- Architecture Diagrams: Detailed architecture documentation (coming soon)
- Setup Guide: Step-by-step setup instructions (coming soon)
Run tests:
# Activate virtual environment
source venv/bin/activate
# Run all tests
pytest
# Run with coverage
pytest --cov=python --cov=etl --cov-report=html
# Run specific test suite
pytest tests/unit/This is a portfolio project, but feedback and suggestions are welcome!
- Fork the repository
- Create a feature branch (
git checkout -b feature/improvement) - Commit your changes (
git commit -m 'Add improvement') - Push to the branch (
git push origin feature/improvement) - Open a Pull Request
Copyright 2024 Omar Miranda
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this project except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
See the LICENSE file for the full license text.
Omar Miranda
- Portfolio: github.com/omaranda
- LinkedIn: linkedin.com/in/omar-miranda-aranda
- AWS Documentation
- Terraform Registry
- Python Data Engineering Community
Current Phase: Foundation Complete ✅
Completed:
- ✅ Project structure and configuration
- ✅ Terraform infrastructure modules
- ✅ Shell management scripts
- ✅ Python ETL framework base classes
- ✅ Environment configuration
In Progress:
- 🔄 Data generator implementations
- 🔄 Glue ETL jobs
- 🔄 Redshift SQL scripts
Upcoming:
- ⏳ Unit and integration tests
- ⏳ Power BI dashboards
- ⏳ CI/CD pipeline
- ⏳ Documentation and architecture diagrams