NYC Taxi Data Pipeline

GCP-based data pipeline for NYC TLC taxi trip data: ingestion, transformation, ML (fleet recommender, anomaly detection), and static dashboard.

Project Structure

saithi thesis code/
├── README.md
├── requirements.txt
├── config/                   # Configuration
│   ├── gcs_cors.json
│   └── gcp/                  # GCP deployment (Composer DAG, deploy.sh)
├── dashboard/                # Static web dashboard
├── docs/
│   ├── GCP_OPTIMIZATION_PLAN.md
│   ├── LEGACY.md
│   ├── gcp/                  # GCP implementation guide
│   ├── aws/                  # AWS replication guide
│   └── azure/                # Azure replication guide
├── legacy/                   # Original notebooks (00–04, Gradio)
├── pipeline_utils/            # Shared utilities
│   ├── config.py
│   ├── spark_utils.py
│   ├── bq_utils.py
│   ├── gcs_utils.py
│   ├── schemas.py
│   └── logging_utils.py
└── pipeline/                 # Optimized pipeline scripts
    ├── ingest_tlc.py         # Stage 00
    ├── 01_gcs_to_bronze.py   # Stage 01 (PySpark)
    ├── 02_bronze_to_silver.py # Stage 02 (PySpark)
    ├── 03_silver_to_preml.py # Stage 03 (PySpark)
    ├── 04a_fleet_recommender.py
    ├── 04b_anomalies.py
    ├── export_dashboard_impl.py
    ├── 05_ExportDashboardData.py
    └── run_all.py

Quick Start

Option A: Optimized Pipeline (recommended)

# Python stages (00, 04a, 04b, 05)
python pipeline/ingest_tlc.py
python pipeline/04a_fleet_recommender.py
python pipeline/04b_anomalies.py
python pipeline/05_ExportDashboardData.py

# PySpark stages (01, 02, 03) — run on Dataproc
gcloud dataproc jobs submit pyspark pipeline/01_gcs_to_bronze.py --cluster=CLUSTER --region=us-central1
gcloud dataproc jobs submit pyspark pipeline/02_bronze_to_silver.py --cluster=CLUSTER --region=us-central1
gcloud dataproc jobs submit pyspark pipeline/03_silver_to_preml.py --cluster=CLUSTER --region=us-central1

Option B: Legacy Notebooks

Use the notebooks in legacy/ in order: 00 → 01 → 02a–d → 03 → 04a → 04b.

Static Dashboard

Open dashboard/index.html in a browser. Use "Use demo data" to test without GCS, or configure GCS and uncheck for real data.

GCP Resources

Resource	Name
Project	nyctaxi-467111
Raw bucket	nyc_raw_data_bucket
Dashboard bucket	nyc_dashboard_bucket
BigQuery	RawBronze, CleanSilver, PreMlGold, PostMlGold

Documentation

Doc	Description
Docs Index	Documentation overview
Pipeline README	Optimized pipeline stages, run commands, config
GCP Implementation	Full GCP replication steps, scheduling, config
GCP Optimization Plan	Architecture, optimization rationale, implementation status
Legacy Code	Original notebooks and migration path
AWS Replication	AWS setup (S3, EMR, MWAA)
Azure Replication	Azure setup (Blob, Databricks, Data Factory)

License

Academic / thesis use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NYC Taxi Data Pipeline

Project Structure

Quick Start

Option A: Optimized Pipeline (recommended)

Option B: Legacy Notebooks

Static Dashboard

GCP Resources

Documentation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
dashboard		dashboard
docs		docs
pipeline		pipeline
pipeline_utils		pipeline_utils
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
system_architecture.png		system_architecture.png

Folders and files

Latest commit

History

Repository files navigation

NYC Taxi Data Pipeline

Project Structure

Quick Start

Option A: Optimized Pipeline (recommended)

Option B: Legacy Notebooks

Static Dashboard

GCP Resources

Documentation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages