This repository contains a beginner-friendly Databricks training project designed to teach fundamental data engineering concepts using PySpark and Delta Lake.
You are working as a data engineer at an e-commerce company.
The company already stores order data in a Delta table, but customer information is delivered as a new CSV file. Your mission:
- Ingest the customer dataset
- Join it with the existing Delta orders
- Overwrite the original Delta table with enriched data
- Perform analytical queries
- Explore Delta Lake versioning and rollback features
orders.csvβ Main dataset (~50k rows of order data)customers.csvβ Supplementary dataset (~10k rows of customer info)DatabricksPractice.ipynbβ Full guided notebook with instructions, code, and exercises
By the end of this project, learners will be able to:
- Read/write Delta tables using PySpark
- Load CSV files from DBFS
- Perform
joinoperations between datasets - Run aggregations (
groupBy,count,sum, etc.) - Use Delta Lake's time travel and rollback features
- Practice using real-world data pipelines
- Find the most popular product category
- Identify top spending customers
- Extract temporal insights (e.g., peak order month)
- Create new columns using conditional logic
- Count high-value orders per region