A Data Engineering project using Databricks | Medallion Architecture | Lakeflow Spark Declarative Pipelines
- Project Overview
- Problem Statement
- Technology Stack
- Project Architecture
- Key Features
- Project Structure
- How to Run
- Gold Layer Tables
- Dashboard & Analytics
- Exporting the Dashboard
- Project Outcomes
RouteFlow is a data engineering project built for Good Cap, a fast-growing cab service company operating across multiple cities and regions. The project addresses a critical operational challenge: regional managers were not receiving timely, region-specific data to run daily operations.
The solution adopts a declarative pipeline approach using Databricks Lakeflow Spark Declarative Pipelines to replace slow, manually orchestrated Spark jobs โ delivering faster regional insights, eliminating manual rework, and restoring leadership confidence in the data platform.
Good Cap's existing platform relied on tightly coupled, procedural Spark pipelines with manual orchestration. This caused three critical failures:
- ๐ Regional managers received generic dashboards that were difficult to act on
- ๐ Teams were forced to manually export and rework data for their specific regions
- ๐ซ Platform innovation had stalled, eroding leadership trust in the data team
The data team was challenged to run a focused pilot proving that a new declarative approach could reduce manual effort, accelerate regional delivery, and demonstrate platform innovation.
| Component | Technology | Purpose |
|---|---|---|
| Data Platform | Databricks Free Edition | Core processing, notebooks, pipelines |
| Pipeline Framework | Lakeflow Spark Declarative Pipelines | Declarative Bronze โ Silver โ Gold |
| Storage Layer | Databricks Volumes (DBFS) | Replaces AWS S3 for file ingestion |
| Data Governance | Unity Catalog | Catalog, schema, and access management |
| File Format | Delta Lake | ACID transactions, time travel |
| Ingestion | Auto Loader (cloudFiles) | Incremental CSV file loading |
| Version Control | GitHub + Databricks Repos | Notebook and pipeline versioning |
| BI & Reporting | Databricks Dashboards + Genie | Regional analytics and alerting |
| Orchestration | Databricks Jobs & Workflows | Scheduled pipeline execution |
Trip Data (Simulated CSV Files)
โ
โผ
Databricks Volumes (routeflow_raw_data)
โ
โผ
Auto Loader (cloudFiles)
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Databricks Workspace โ
โ โ
โ Bronze โโโบ Silver โโโบ Gold โ
โ โ
โ Unity Catalog + Delta Lake โ
โ Lakeflow Spark Declarative Pipelines โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
BI Dashboards + Genie + Alerts
| Layer | Tables | Description |
|---|---|---|
| ๐ฅ Bronze | raw_trips, raw_cities, raw_calendar | Raw ingestion with Auto Loader, schema rescue, no transformations |
| ๐ฅ Silver | trips, cities, calendar | Cleaned data, type casting, date keys, deduplication, SCD logic |
| ๐ฅ Gold | fact_trips, fact_trips_<city> | BI-ready aggregations, region-partitioned fact tables for dashboards |
routeflow_catalog
โโโ routeflow_schema
โโโ routeflow_raw_data (Volume - stores CSV files)
โโโ bronze (Schema - raw Delta tables)
โโโ silver (Schema - cleaned Delta tables)
โโโ gold (Schema - BI-ready fact tables)
- Pipelines declare what to do, not how to do it
- Spark handles execution planning, dependency management, and incremental processing
- Eliminates manual orchestration from the previous procedural approach
df = (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("cloudFiles.inferColumnTypes", "true")
.option("cloudFiles.schemaEvolutionMode", "rescue")
.option("cloudFiles.maxFilesPerTrigger", 100)
.option("cloudFiles.schemaLocation", SCHEMA_PATH)
.load(SOURCE_PATH)
)- Reads CSV files directly from Databricks Volumes
- Automatically infers column types and handles schema evolution
- Supports incremental loading โ only processes new files on each run
- Individual fact tables generated per city
- Regional managers get fast, pre-aggregated data without manual exports
- Directly connected to Databricks BI Dashboards
- End-to-end pipeline scheduled via Databricks Jobs
- BI Dashboards built natively inside the Databricks workspace
- Alerts configured to trigger notifications on pipeline completion or data thresholds
routeflow_transportation_pipeline/
โโโ transformations/
โ โโโ bronze/ # Auto Loader ingestion notebooks
โ โ โโโ city.py
โ โ โโโ trips.py
โ โโโ silver/
โ โ โโโ cities.py # City table transformations
โ โ โโโ calender.py # Calendar table with date keys
โ โ โโโ trips.py # Trips table cleaning & SCD
โ โโโ gold/
โ โโโ fact_trips.py # BI-ready aggregations per region
โ โโโ fact_....py
โ โโโ fact_....py
โโโ project_setup.py
-- Run in Databricks notebook
CREATE CATALOG transportation;
CREATE SCHEMA transportation.gold;
CREATE VOLUME transportation.gold.raw_data;- Go to Data โ Volumes โ transportation.gold.raw_data
- Upload your CSV files:
trips.csv,cities.csv,calendar.csv
- Go to Repos in Databricks workspace
- Add your GitHub repository URL
- Pull the project notebooks into your workspace
SOURCE_PATH = "/Volumes/transportation.gold.raw_data/"
SCHEMA_PATH = "/Volumes/transportation.gold.raw_data/_schema"- Open the Lakeflow Declarative Pipeline configuration
- Run Bronze โ Silver โ Gold in sequence
- Verify tables appear under
transportation โ gold
- Open Dashboards in the Databricks workspace
- Connect to Gold layer fact tables
- Configure alerts to trigger notifications on data thresholds
Region-specific fact tables generated in the Gold layer:
| Table | Region |
|---|---|
fact_trips |
All regions combined |
fact_trips_chandigarh |
Chandigarh |
fact_trips_coimbatore |
Coimbatore |
fact_trips_indore |
Indore |
fact_trips_jaipur |
Jaipur |
fact_trips_kochi |
Kochi |
fact_trips_lucknow |
Lucknow |
fact_trips_mysore |
Mysore |
fact_trips_surat |
Surat |
fact_trips_vadodara |
Vadodara |
fact_trips_visakhapatnam |
Visakhapatnam |
The RouteFlow dashboard is built natively inside Databricks Dashboards, connected directly to the Gold layer fact tables. It gives regional managers a real-time, city-specific view of trip analytics โ no manual exports needed.
All Gold layer fact tables share the following schema:
| Column | Description |
|---|---|
id |
Unique trip identifier |
business_date |
Date of the trip |
city_id |
Unique city identifier |
city_name |
Name of the city |
passenger_category |
Category of the passenger |
distance_kms |
Trip distance in kilometres |
sales_amt |
Fare / revenue generated for the trip |
passenger_rating |
Rating given by the passenger (1โ5) |
driver_rating |
Rating given to the driver (1โ5) |
month |
Month number |
day_of_month |
Day of the month |
day_of_week |
Day of the week |
month_name |
Full month name |
month_year |
Month and year combined |
quarter |
Quarter number (1โ4) |
quarter_year |
Quarter and year combined |
week_of_year |
Week number of the year |
is_weekday |
Boolean โ true if weekday |
is_weekend |
Boolean โ true if weekend |
national_holiday |
Boolean โ true if national holiday |
A single all_cities_trips dataset powers every widget on the dashboard. It combines all city-specific Gold tables via a UNION ALL query:
SELECT 'Chandigarh' AS city, * FROM transportation.gold.fact_trips_chandigarh
UNION ALL
SELECT 'Coimbatore' AS city, * FROM transportation.gold.fact_trips_coimbatore
UNION ALL
SELECT 'Indore' AS city, * FROM transportation.gold.fact_trips_indore
UNION ALL
SELECT 'Jaipur' AS city, * FROM transportation.gold.fact_trips_jaipur
UNION ALL
SELECT 'Kochi' AS city, * FROM transportation.gold.fact_trips_kochi
UNION ALL
SELECT 'Lucknow' AS city, * FROM transportation.gold.fact_trips_lucknow
UNION ALL
SELECT 'Mysore' AS city, * FROM transportation.gold.fact_trips_mysore
UNION ALL
SELECT 'Surat' AS city, * FROM transportation.gold.fact_trips_surat
UNION ALL
SELECT 'Vadodara' AS city, * FROM transportation.gold.fact_trips_vadodara
UNION ALL
SELECT 'Visakhapatnam' AS city, * FROM transportation.gold.fact_trips_visakhapatnamA city dropdown filter is applied on top โ each regional manager sees only their city's data from this one shared dataset.
| Widget | Type | Metric |
|---|---|---|
| Total Trips | Counter | COUNT(id) |
| Total Revenue | Counter | SUM(sales_amt) |
| Avg Passenger Rating | Counter | AVG(passenger_rating) |
| Avg Distance (km) | Counter | AVG(distance_kms) |
| Revenue by Month | Bar Chart | SUM(sales_amt) grouped by month_name |
| Daily Trip Trend | Line Chart | COUNT(id) grouped by business_date |
| Weekday vs Weekend Performance | Bar Chart | is_weekday / is_weekend flag |
| Revenue by Passenger Category | Bar Chart | SUM(sales_amt) grouped by passenger_category |
| Revenue Heatmap | Heatmap | day_of_week ร month_name |
| Quarterly Revenue Trend | Line Chart | SUM(sales_amt) grouped by quarter_year |
| Trip Detail Log | Table | Full row-level detail with city filter |
Since RouteFlow runs on Databricks Free Edition, the dashboard can be exported as a JSON definition file for backup and portability.
- Go to Dashboards in the Databricks workspace
- Find the RouteFlow dashboard
- Click โฎ (three dots) โ Export
- A
.lvdash.jsonfile is downloaded โ this is the full dashboard definition
The exported file is saved in the /dashboard folder in this repository.
To restore or share the dashboard in another Databricks workspace:
- Go to Dashboards โ Import
- Upload the
.lvdash.jsonfile - The full dashboard is restored with all widgets and queries intact
To download the raw data behind any chart:
- Click on any widget in the dashboard
- Click the โฎ menu on the widget โ Download data
- Choose CSV or Excel
| Success Criteria | Result |
|---|---|
| Regional managers get faster insights | โ Region-partitioned Gold tables per city |
| Manual data rework eliminated | โ Automated pipeline, no exports needed |
| Declarative approach demonstrated | โ Lakeflow SDP replaces procedural Spark |
| Platform innovation shown to leadership | โ Dashboards + Genie + Automated alerts |
| Zero cloud cost solution | โ 100% on Databricks Free Edition |
| City-level dashboard delivered | โ 10 regional views from one unified dataset |
| Dashboard portability | โ
Exported as .lvdash.json for backup & sharing |
Project: RouteFlow โ Regional Data Pipeline for Transportation Analytics
Domain: Transportation / Cab Services
Platform: Databricks Free Edition (No cloud account required)
Architecture: Medallion (Bronze โ Silver โ Gold) with Lakeflow Spark Declarative Pipelines
Built as a learning and portfolio project to demonstrate real-world data engineering skills using modern Databricks tooling.


