SP Bus Data - DE Project

A data pipeline using a streaming dataset from SPTRANS API to get data from the currently running buses that operates in some regions of São Paulo (SPTRANS is a company that manages the bus transportation system in São Paulo, Brazil). This project was realized for studies purposes.

Overview

In this project, the creation and management of cloud resources (Google Cloud) was done with Terraform. The workflow orchestration was managed by Airflow, which coordenates the integration with GCS (data lake), DBT (data transformation) and BigQuery (data warehouse). The kafka, spark and airflow instances was containerized with docker and hosted in Google Compute Engine. The final data is served on Looker Studio.

The Kafka producer will stream events generated from SPTRANS API every two minutes to the target topics, and the pyspark will handle the stream processing of real-time data. The processed data will be stored in data lake periodically (also every two minutes). From that, the DAGs from airflow will be triggered every three minutes to run the creation of tables in bigquery and do the data transformation with DBT, so, in the end the data will be available in Looker Studio for visualization.

I also did a data scraping in sptrans API (src/scrap_data.py) to map from which company a bus-line is from, so the final information will be more understandable.

Pipeline Flow

Looker Studio Visualization

Documentation

Config & Installation

Tools and Technologies Used

Python
Stream Processing:
- Kafka
- Spark Structured Streaming
GCP - Google Cloud Platform
- Infrastructure as Code software (IaC): Terraform
- Data Lake: Google Cloud Storage
- Data Warehouse: BigQuery
- Google Compute Engine
Containerization: Docker
Workflow Orchestration: Apache Airflow
Data Transformation: dbt
Data Visualization - Looker Studio

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
airflow		airflow
bus-data		bus-data
docker		docker
src		src
terraform		terraform
.DS_Store		.DS_Store
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
documentation.md		documentation.md
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SP Bus Data - DE Project

Overview

Pipeline Flow

Looker Studio Visualization

Documentation

Tools and Technologies Used

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SP Bus Data - DE Project

Overview

Pipeline Flow

Looker Studio Visualization

Documentation

Tools and Technologies Used

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages