A data pipeline using a streaming dataset from SPTRANS API to get data from the currently running buses that operates in some regions of São Paulo (SPTRANS is a company that manages the bus transportation system in São Paulo, Brazil). This project was realized for studies purposes.
In this project, the creation and management of cloud resources (Google Cloud) was done with Terraform. The workflow orchestration was managed by Airflow, which coordenates the integration with GCS (data lake), DBT (data transformation) and BigQuery (data warehouse). The kafka, spark and airflow instances was containerized with docker and hosted in Google Compute Engine. The final data is served on Looker Studio.
The Kafka producer will stream events generated from SPTRANS API every two minutes to the target topics, and the pyspark will handle the stream processing of real-time data. The processed data will be stored in data lake periodically (also every two minutes). From that, the DAGs from airflow will be triggered every three minutes to run the creation of tables in bigquery and do the data transformation with DBT, so, in the end the data will be available in Looker Studio for visualization.
I also did a data scraping in sptrans API (src/scrap_data.py) to map from which company a bus-line is from, so the final information will be more understandable.
- Python
- Stream Processing:
- GCP - Google Cloud Platform
- Infrastructure as Code software (IaC): Terraform
- Data Lake: Google Cloud Storage
- Data Warehouse: BigQuery
- Google Compute Engine
- Containerization: Docker
- Workflow Orchestration: Apache Airflow
- Data Transformation: dbt
- Data Visualization - Looker Studio