GCP-based data pipeline for NYC TLC taxi trip data: ingestion, transformation, ML (fleet recommender, anomaly detection), and static dashboard.
saithi thesis code/
├── README.md
├── requirements.txt
├── config/ # Configuration
│ ├── gcs_cors.json
│ └── gcp/ # GCP deployment (Composer DAG, deploy.sh)
├── dashboard/ # Static web dashboard
├── docs/
│ ├── GCP_OPTIMIZATION_PLAN.md
│ ├── LEGACY.md
│ ├── gcp/ # GCP implementation guide
│ ├── aws/ # AWS replication guide
│ └── azure/ # Azure replication guide
├── legacy/ # Original notebooks (00–04, Gradio)
├── pipeline_utils/ # Shared utilities
│ ├── config.py
│ ├── spark_utils.py
│ ├── bq_utils.py
│ ├── gcs_utils.py
│ ├── schemas.py
│ └── logging_utils.py
└── pipeline/ # Optimized pipeline scripts
├── ingest_tlc.py # Stage 00
├── 01_gcs_to_bronze.py # Stage 01 (PySpark)
├── 02_bronze_to_silver.py # Stage 02 (PySpark)
├── 03_silver_to_preml.py # Stage 03 (PySpark)
├── 04a_fleet_recommender.py
├── 04b_anomalies.py
├── export_dashboard_impl.py
├── 05_ExportDashboardData.py
└── run_all.py
# Python stages (00, 04a, 04b, 05)
python pipeline/ingest_tlc.py
python pipeline/04a_fleet_recommender.py
python pipeline/04b_anomalies.py
python pipeline/05_ExportDashboardData.py
# PySpark stages (01, 02, 03) — run on Dataproc
gcloud dataproc jobs submit pyspark pipeline/01_gcs_to_bronze.py --cluster=CLUSTER --region=us-central1
gcloud dataproc jobs submit pyspark pipeline/02_bronze_to_silver.py --cluster=CLUSTER --region=us-central1
gcloud dataproc jobs submit pyspark pipeline/03_silver_to_preml.py --cluster=CLUSTER --region=us-central1Use the notebooks in legacy/ in order: 00 → 01 → 02a–d → 03 → 04a → 04b.
Open dashboard/index.html in a browser. Use "Use demo data" to test without GCS, or configure GCS and uncheck for real data.
| Resource | Name |
|---|---|
| Project | nyctaxi-467111 |
| Raw bucket | nyc_raw_data_bucket |
| Dashboard bucket | nyc_dashboard_bucket |
| BigQuery | RawBronze, CleanSilver, PreMlGold, PostMlGold |
| Doc | Description |
|---|---|
| Docs Index | Documentation overview |
| Pipeline README | Optimized pipeline stages, run commands, config |
| GCP Implementation | Full GCP replication steps, scheduling, config |
| GCP Optimization Plan | Architecture, optimization rationale, implementation status |
| Legacy Code | Original notebooks and migration path |
| AWS Replication | AWS setup (S3, EMR, MWAA) |
| Azure Replication | Azure setup (Blob, Databricks, Data Factory) |
Academic / thesis use.