Skip to content

Akshith-github/NYC_taxi_Data_analysis

Repository files navigation

NYC Taxi Data Pipeline

GCP-based data pipeline for NYC TLC taxi trip data: ingestion, transformation, ML (fleet recommender, anomaly detection), and static dashboard.

Project Structure

saithi thesis code/
├── README.md
├── requirements.txt
├── config/                   # Configuration
│   ├── gcs_cors.json
│   └── gcp/                  # GCP deployment (Composer DAG, deploy.sh)
├── dashboard/                # Static web dashboard
├── docs/
│   ├── GCP_OPTIMIZATION_PLAN.md
│   ├── LEGACY.md
│   ├── gcp/                  # GCP implementation guide
│   ├── aws/                  # AWS replication guide
│   └── azure/                # Azure replication guide
├── legacy/                   # Original notebooks (00–04, Gradio)
├── pipeline_utils/            # Shared utilities
│   ├── config.py
│   ├── spark_utils.py
│   ├── bq_utils.py
│   ├── gcs_utils.py
│   ├── schemas.py
│   └── logging_utils.py
└── pipeline/                 # Optimized pipeline scripts
    ├── ingest_tlc.py         # Stage 00
    ├── 01_gcs_to_bronze.py   # Stage 01 (PySpark)
    ├── 02_bronze_to_silver.py # Stage 02 (PySpark)
    ├── 03_silver_to_preml.py # Stage 03 (PySpark)
    ├── 04a_fleet_recommender.py
    ├── 04b_anomalies.py
    ├── export_dashboard_impl.py
    ├── 05_ExportDashboardData.py
    └── run_all.py

Quick Start

Option A: Optimized Pipeline (recommended)

# Python stages (00, 04a, 04b, 05)
python pipeline/ingest_tlc.py
python pipeline/04a_fleet_recommender.py
python pipeline/04b_anomalies.py
python pipeline/05_ExportDashboardData.py

# PySpark stages (01, 02, 03) — run on Dataproc
gcloud dataproc jobs submit pyspark pipeline/01_gcs_to_bronze.py --cluster=CLUSTER --region=us-central1
gcloud dataproc jobs submit pyspark pipeline/02_bronze_to_silver.py --cluster=CLUSTER --region=us-central1
gcloud dataproc jobs submit pyspark pipeline/03_silver_to_preml.py --cluster=CLUSTER --region=us-central1

Option B: Legacy Notebooks

Use the notebooks in legacy/ in order: 00 → 01 → 02a–d → 03 → 04a → 04b.

Static Dashboard

Open dashboard/index.html in a browser. Use "Use demo data" to test without GCS, or configure GCS and uncheck for real data.

GCP Resources

Resource Name
Project nyctaxi-467111
Raw bucket nyc_raw_data_bucket
Dashboard bucket nyc_dashboard_bucket
BigQuery RawBronze, CleanSilver, PreMlGold, PostMlGold

Documentation

Doc Description
Docs Index Documentation overview
Pipeline README Optimized pipeline stages, run commands, config
GCP Implementation Full GCP replication steps, scheduling, config
GCP Optimization Plan Architecture, optimization rationale, implementation status
Legacy Code Original notebooks and migration path
AWS Replication AWS setup (S3, EMR, MWAA)
Azure Replication Azure setup (Blob, Databricks, Data Factory)

License

Academic / thesis use.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors