Skip to content

evans25575/RealTime_Data_Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 

Repository files navigation

βš™οΈ Real-Time Data Pipeline

A real-time data engineering pipeline that ingests simulated stock data using Apache Kafka, processes it using Apache Spark, stores the results in MySQL, and schedules tasks using Apache Airflow.


πŸ“Œ Objectives

  • Simulate real-time data streaming from a Python producer.
  • Ingest data via Kafka and process it in real time using Spark Structured Streaming.
  • Store transformed data in MySQL for querying and reporting.
  • Automate the end-to-end process using Apache Airflow.

πŸ›  Tech Stack

Component Tool/Framework
Data Source Python Generator
Messaging Queue Apache Kafka
Stream Processing Apache Spark (Structured)
Storage MySQL
Orchestration Apache Airflow
Language Python

πŸ“‚ Project Structure

.
β”œβ”€β”€ kafka_producer.py          # Simulates streaming data into Kafka
β”œβ”€β”€ kafka_consumer_spark.py    # Consumes Kafka messages and processes in Spark
β”œβ”€β”€ create_mysql_db.sql        # MySQL schema setup
β”œβ”€β”€ airflow_dag.py             # Airflow DAG to schedule tasks
β”œβ”€β”€ requirements.txt           # Dependencies
└── README.md                  # Project documentation


---

πŸš€ How to Run the Project

1. Setup MySQL

CREATE DATABASE real_time_db;

USE real_time_db;

CREATE TABLE data_stream (
  id INT AUTO_INCREMENT PRIMARY KEY,
  timestamp DOUBLE,
  value INT
);

2. Start Kafka Broker

Ensure Kafka is running on localhost:9092. You can use a local Kafka setup or Docker container.

3. Run Kafka Producer

python kafka_producer.py

This script will send random stock data (timestamp + value) to the Kafka topic.

4. Run Spark Consumer

python kafka_consumer_spark.py

This script will read messages from Kafka, process them using Spark, and store the results in the MySQL database.

5. Schedule with Airflow (Optional but Recommended)

Use airflow_dag.py to automate the Kafka producer and Spark consumer jobs every 5 minutes.


---

🧠 Sample Output

Sent data: {'timestamp': 1722001944.174284, 'value': 45}
+------------------+-------+
|     timestamp    | value |
+------------------+-------+
| 1722001944.174284|   45  |
+------------------+-------+


---

🧱 Architecture Overview

graph TD;
  PythonProducer -->|Stream JSON| Kafka[Kafka Topic];
  Kafka --> Spark[Apache Spark Consumer];
  Spark --> MySQL[MySQL Database];
  Airflow -->|Schedules| PythonProducer;
  Airflow -->|Schedules| Spark;



πŸ‘¨β€πŸ’» Author

Evans Kiplangat
🌐 Portfolio
πŸ™ GitHub
πŸ’Ό LinkedIn



πŸ“œ License

MIT License

About

This project builds a real-time data pipeline that ingests, processes, and stores data using Apache Kafka, Apache Spark, and MySQL. It simulates streaming data, processes it in real time, and saves the results for analysis. Automated with Apache Airflow, it highlights expertise in data engineering and real-time data processing.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors