Skip to content

Pratik5767/RouteFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

14 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿš– RouteFlow

Regional Data Pipeline for Transportation Analytics

A Data Engineering project using Databricks | Medallion Architecture | Lakeflow Spark Declarative Pipelines


๐Ÿ“Œ Table of Contents


๐Ÿ“– Project Overview

RouteFlow is a data engineering project built for Good Cap, a fast-growing cab service company operating across multiple cities and regions. The project addresses a critical operational challenge: regional managers were not receiving timely, region-specific data to run daily operations.

The solution adopts a declarative pipeline approach using Databricks Lakeflow Spark Declarative Pipelines to replace slow, manually orchestrated Spark jobs โ€” delivering faster regional insights, eliminating manual rework, and restoring leadership confidence in the data platform.


โ— Problem Statement

Good Cap's existing platform relied on tightly coupled, procedural Spark pipelines with manual orchestration. This caused three critical failures:

  • ๐Ÿ“Š Regional managers received generic dashboards that were difficult to act on
  • ๐Ÿ”„ Teams were forced to manually export and rework data for their specific regions
  • ๐Ÿšซ Platform innovation had stalled, eroding leadership trust in the data team

The data team was challenged to run a focused pilot proving that a new declarative approach could reduce manual effort, accelerate regional delivery, and demonstrate platform innovation.


โš™๏ธ Technology Stack

Component Technology Purpose
Data Platform Databricks Free Edition Core processing, notebooks, pipelines
Pipeline Framework Lakeflow Spark Declarative Pipelines Declarative Bronze โ†’ Silver โ†’ Gold
Storage Layer Databricks Volumes (DBFS) Replaces AWS S3 for file ingestion
Data Governance Unity Catalog Catalog, schema, and access management
File Format Delta Lake ACID transactions, time travel
Ingestion Auto Loader (cloudFiles) Incremental CSV file loading
Version Control GitHub + Databricks Repos Notebook and pipeline versioning
BI & Reporting Databricks Dashboards + Genie Regional analytics and alerting
Orchestration Databricks Jobs & Workflows Scheduled pipeline execution

๐Ÿ—๏ธ Project Architecture

High-Level Data Flow

Trip Data (Simulated CSV Files)
        โ”‚
        โ–ผ
Databricks Volumes (routeflow_raw_data)
        โ”‚
        โ–ผ
Auto Loader (cloudFiles)
        โ”‚
        โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚           Databricks Workspace            โ”‚
โ”‚                                           โ”‚
โ”‚   Bronze โ”€โ”€โ–บ Silver โ”€โ”€โ–บ Gold              โ”‚
โ”‚                                           โ”‚
โ”‚   Unity Catalog + Delta Lake              โ”‚
โ”‚   Lakeflow Spark Declarative Pipelines    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
        โ”‚
        โ–ผ
BI Dashboards + Genie + Alerts

Medallion Architecture

Layer Tables Description
๐Ÿฅ‰ Bronze raw_trips, raw_cities, raw_calendar Raw ingestion with Auto Loader, schema rescue, no transformations
๐Ÿฅˆ Silver trips, cities, calendar Cleaned data, type casting, date keys, deduplication, SCD logic
๐Ÿฅ‡ Gold fact_trips, fact_trips_<city> BI-ready aggregations, region-partitioned fact tables for dashboards

Unity Catalog Structure

routeflow_catalog
    โ””โ”€โ”€ routeflow_schema
            โ””โ”€โ”€ routeflow_raw_data  (Volume - stores CSV files)
            โ””โ”€โ”€ bronze              (Schema - raw Delta tables)
            โ””โ”€โ”€ silver              (Schema - cleaned Delta tables)
            โ””โ”€โ”€ gold                (Schema - BI-ready fact tables)

Project Flow

project flow

Pipeline Graph

pipeline graph


โœจ Key Features

๐Ÿ”ท Declarative Pipelines (Lakeflow SDP)

  • Pipelines declare what to do, not how to do it
  • Spark handles execution planning, dependency management, and incremental processing
  • Eliminates manual orchestration from the previous procedural approach

๐Ÿ”ท Auto Loader Ingestion

df = (
    spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("cloudFiles.inferColumnTypes", "true")
        .option("cloudFiles.schemaEvolutionMode", "rescue")
        .option("cloudFiles.maxFilesPerTrigger", 100)
        .option("cloudFiles.schemaLocation", SCHEMA_PATH)
        .load(SOURCE_PATH)
)
  • Reads CSV files directly from Databricks Volumes
  • Automatically infers column types and handles schema evolution
  • Supports incremental loading โ€” only processes new files on each run

๐Ÿ”ท Region-Partitioned Gold Layer

  • Individual fact tables generated per city
  • Regional managers get fast, pre-aggregated data without manual exports
  • Directly connected to Databricks BI Dashboards

๐Ÿ”ท Jobs, Workflows & Alerts

  • End-to-end pipeline scheduled via Databricks Jobs
  • BI Dashboards built natively inside the Databricks workspace
  • Alerts configured to trigger notifications on pipeline completion or data thresholds

๐Ÿ“ Project Structure

routeflow_transportation_pipeline/
  โ”œโ”€โ”€ transformations/
  โ”‚     โ”œโ”€โ”€ bronze/                     # Auto Loader ingestion notebooks
  โ”‚     โ”‚     โ””โ”€โ”€ city.py
  โ”‚     โ”‚     โ””โ”€โ”€ trips.py
  โ”‚     โ”œโ”€โ”€ silver/
  โ”‚     โ”‚     โ”œโ”€โ”€ cities.py             # City table transformations
  โ”‚     โ”‚     โ”œโ”€โ”€ calender.py           # Calendar table with date keys
  โ”‚     โ”‚     โ””โ”€โ”€ trips.py              # Trips table cleaning & SCD
  โ”‚     โ””โ”€โ”€ gold/
  โ”‚           โ””โ”€โ”€ fact_trips.py         # BI-ready aggregations per region
  โ”‚           โ””โ”€โ”€ fact_....py
  โ”‚           โ””โ”€โ”€ fact_....py
  โ””โ”€โ”€ project_setup.py

๐Ÿš€ How to Run

Step 1: Setup Catalog, Schema & Volume

-- Run in Databricks notebook
CREATE CATALOG transportation;
CREATE SCHEMA transportation.gold;
CREATE VOLUME transportation.gold.raw_data;

Step 2: Upload CSV Files

  • Go to Data โ†’ Volumes โ†’ transportation.gold.raw_data
  • Upload your CSV files: trips.csv, cities.csv, calendar.csv

Step 3: Connect GitHub Repo

  • Go to Repos in Databricks workspace
  • Add your GitHub repository URL
  • Pull the project notebooks into your workspace

Step 4: Configure Source Path

SOURCE_PATH  = "/Volumes/transportation.gold.raw_data/"
SCHEMA_PATH  = "/Volumes/transportation.gold.raw_data/_schema"

Step 5: Run Pipelines

  • Open the Lakeflow Declarative Pipeline configuration
  • Run Bronze โ†’ Silver โ†’ Gold in sequence
  • Verify tables appear under transportation โ†’ gold

Step 6: Launch Dashboards & Alerts

  • Open Dashboards in the Databricks workspace
  • Connect to Gold layer fact tables
  • Configure alerts to trigger notifications on data thresholds

๐Ÿ“Š Gold Layer Tables

Region-specific fact tables generated in the Gold layer:

Table Region
fact_trips All regions combined
fact_trips_chandigarh Chandigarh
fact_trips_coimbatore Coimbatore
fact_trips_indore Indore
fact_trips_jaipur Jaipur
fact_trips_kochi Kochi
fact_trips_lucknow Lucknow
fact_trips_mysore Mysore
fact_trips_surat Surat
fact_trips_vadodara Vadodara
fact_trips_visakhapatnam Visakhapatnam

๐Ÿ“ˆ Dashboard & Analytics

The RouteFlow dashboard is built natively inside Databricks Dashboards, connected directly to the Gold layer fact tables. It gives regional managers a real-time, city-specific view of trip analytics โ€” no manual exports needed.

Gold Layer Schema

All Gold layer fact tables share the following schema:

Column Description
id Unique trip identifier
business_date Date of the trip
city_id Unique city identifier
city_name Name of the city
passenger_category Category of the passenger
distance_kms Trip distance in kilometres
sales_amt Fare / revenue generated for the trip
passenger_rating Rating given by the passenger (1โ€“5)
driver_rating Rating given to the driver (1โ€“5)
month Month number
day_of_month Day of the month
day_of_week Day of the week
month_name Full month name
month_year Month and year combined
quarter Quarter number (1โ€“4)
quarter_year Quarter and year combined
week_of_year Week number of the year
is_weekday Boolean โ€” true if weekday
is_weekend Boolean โ€” true if weekend
national_holiday Boolean โ€” true if national holiday

Unified Dataset (All Cities)

A single all_cities_trips dataset powers every widget on the dashboard. It combines all city-specific Gold tables via a UNION ALL query:

SELECT 'Chandigarh'     AS city, * FROM transportation.gold.fact_trips_chandigarh
UNION ALL
SELECT 'Coimbatore'     AS city, * FROM transportation.gold.fact_trips_coimbatore
UNION ALL
SELECT 'Indore'         AS city, * FROM transportation.gold.fact_trips_indore
UNION ALL
SELECT 'Jaipur'         AS city, * FROM transportation.gold.fact_trips_jaipur
UNION ALL
SELECT 'Kochi'          AS city, * FROM transportation.gold.fact_trips_kochi
UNION ALL
SELECT 'Lucknow'        AS city, * FROM transportation.gold.fact_trips_lucknow
UNION ALL
SELECT 'Mysore'         AS city, * FROM transportation.gold.fact_trips_mysore
UNION ALL
SELECT 'Surat'          AS city, * FROM transportation.gold.fact_trips_surat
UNION ALL
SELECT 'Vadodara'       AS city, * FROM transportation.gold.fact_trips_vadodara
UNION ALL
SELECT 'Visakhapatnam'  AS city, * FROM transportation.gold.fact_trips_visakhapatnam

A city dropdown filter is applied on top โ€” each regional manager sees only their city's data from this one shared dataset.

Dashboard Widgets

Widget Type Metric
Total Trips Counter COUNT(id)
Total Revenue Counter SUM(sales_amt)
Avg Passenger Rating Counter AVG(passenger_rating)
Avg Distance (km) Counter AVG(distance_kms)
Revenue by Month Bar Chart SUM(sales_amt) grouped by month_name
Daily Trip Trend Line Chart COUNT(id) grouped by business_date
Weekday vs Weekend Performance Bar Chart is_weekday / is_weekend flag
Revenue by Passenger Category Bar Chart SUM(sales_amt) grouped by passenger_category
Revenue Heatmap Heatmap day_of_week ร— month_name
Quarterly Revenue Trend Line Chart SUM(sales_amt) grouped by quarter_year
Trip Detail Log Table Full row-level detail with city filter

RouteFlow Dashboard

routeflow dashboard


๐Ÿ’พ Exporting the Dashboard

Since RouteFlow runs on Databricks Free Edition, the dashboard can be exported as a JSON definition file for backup and portability.

Export as JSON

  1. Go to Dashboards in the Databricks workspace
  2. Find the RouteFlow dashboard
  3. Click โ‹ฎ (three dots) โ†’ Export
  4. A .lvdash.json file is downloaded โ€” this is the full dashboard definition

The exported file is saved in the /dashboard folder in this repository.

Re-import the Dashboard

To restore or share the dashboard in another Databricks workspace:

  1. Go to Dashboards โ†’ Import
  2. Upload the .lvdash.json file
  3. The full dashboard is restored with all widgets and queries intact

Export Data per Widget (CSV)

To download the raw data behind any chart:

  1. Click on any widget in the dashboard
  2. Click the โ‹ฎ menu on the widget โ†’ Download data
  3. Choose CSV or Excel

๐ŸŽฏ Project Outcomes

Success Criteria Result
Regional managers get faster insights โœ… Region-partitioned Gold tables per city
Manual data rework eliminated โœ… Automated pipeline, no exports needed
Declarative approach demonstrated โœ… Lakeflow SDP replaces procedural Spark
Platform innovation shown to leadership โœ… Dashboards + Genie + Automated alerts
Zero cloud cost solution โœ… 100% on Databricks Free Edition
City-level dashboard delivered โœ… 10 regional views from one unified dataset
Dashboard portability โœ… Exported as .lvdash.json for backup & sharing

๐Ÿ‘ค Author

Project: RouteFlow โ€” Regional Data Pipeline for Transportation Analytics
Domain: Transportation / Cab Services
Platform: Databricks Free Edition (No cloud account required)
Architecture: Medallion (Bronze โ†’ Silver โ†’ Gold) with Lakeflow Spark Declarative Pipelines


Built as a learning and portfolio project to demonstrate real-world data engineering skills using modern Databricks tooling.

About

RouteFlow : A Databricks project built for Good Cap, a fast-growing cab service company operating across multiple cities, regions and addresses a critical operational challenge: regional managers were not receiving timely, region-specific data to run daily operations

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages