Production-style ETL pipeline that extracts from Postgres and CSV sources, stores dated artifacts, loads curated tables into a final Postgres database, and materializes analysis-ready outputs.
Make ingestion and load steps reproducible, rerunnable by execution date, and safe through explicit quality gates before loading final tables.
- Apache Airflow
- PostgreSQL
- Docker / Docker Compose
- Python (Pandas, SQLAlchemy, Psycopg2)
dags/data_engineering_pipeline.py: DAG definitions and ETL logicdata/: source SQL/CSV and generated output foldersdocker-compose.yml: local platform (Airflow + Postgres + Redis)main.ipynb: notebook execution and inspectionutils/health_check.py: database container health check
- Create env files:
cp .env.example .env
cp dags/.env.example dags/.env- Start services:
make start-
Open Airflow at
http://localhost:8080(airflow/airflow). -
Trigger DAGs in order:
data_extraction_and_local_storagedata_loading_to_final_database
Stop services:
make stopLocal checks:
python -m compileall dags utils
docker compose -f docker-compose.yml config > /dev/null
python utils/validate_data_contracts.py \
--contract contracts/order_details.contract.json \
--csv data/order_details.csvCI (.github/workflows/ci.yml) validates:
- Python syntax for DAG/util modules
- Docker Compose configuration with env templates
- Required environment template files
- Fixture data validation against contract rules
- Two-step DAG flow is date-parameterized and rerunnable.
- Extraction output is validated before final DB load.
- Load DAG runs table-level quality checks before finishing.
- Notebook-first exploration remains part of the workflow.
- Data quality rules are still minimal and table-specific.
- No performance benchmarking for larger source volumes yet.
- Expand data contracts (schema/value constraints per table).
- Add fixture-based integration tests for ETL paths.
- Publish a sample dashboard/queries over the final dataset.
- Contract definitions live in
contracts/. - Current contract coverage includes
order_detailsviacontracts/order_details.contract.json. - Detailed notes: docs/data-contracts.md.
See CONTRIBUTING.md.