Skip to content

warzinnn/bus-data

Repository files navigation

SP Bus Data - DE Project

A data pipeline using a streaming dataset from SPTRANS API to get data from the currently running buses that operates in some regions of São Paulo (SPTRANS is a company that manages the bus transportation system in São Paulo, Brazil). This project was realized for studies purposes.

Overview

In this project, the creation and management of cloud resources (Google Cloud) was done with Terraform. The workflow orchestration was managed by Airflow, which coordenates the integration with GCS (data lake), DBT (data transformation) and BigQuery (data warehouse). The kafka, spark and airflow instances was containerized with docker and hosted in Google Compute Engine. The final data is served on Looker Studio.

The Kafka producer will stream events generated from SPTRANS API every two minutes to the target topics, and the pyspark will handle the stream processing of real-time data. The processed data will be stored in data lake periodically (also every two minutes). From that, the DAGs from airflow will be triggered every three minutes to run the creation of tables in bigquery and do the data transformation with DBT, so, in the end the data will be available in Looker Studio for visualization.

I also did a data scraping in sptrans API (src/scrap_data.py) to map from which company a bus-line is from, so the final information will be more understandable.

Pipeline Flow

pipeline_flow

Looker Studio Visualization

data_visualization

Documentation

Tools and Technologies Used

About

Data engineering project using a streaming dataset from SPTRANS API to get data from the currently running buses that operates in some regions of São Paulo

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors