🚀 Big Data Management - Spark & Data Architecture

👩🏻‍💻🧑‍💻 Project by Lucia Sauer & Pablo Fernández
🎓 Master in Data Science – Barcelona School of Economics

🧠 Project Overview

This project demonstrates a full data lake architecture and end-to-end data pipeline using Apache Spark, Delta Lake, and MLflow. Our use case centers around real estate market analysis in Barcelona, integrating data on:

📊 Idealista Property Listings (2020–Mar 2021) – JSON
🏛️ Cultural Sites in Barcelona – CSV from Open Data BCN
💰 Income & Population Statistics (2017, used as 2020 proxy) – CSV from Open Data BCN

We implement a modular Spark-based pipeline that handles data ingestion, cleaning, feature engineering, and modeling, adhering to a modern layered architecture. The pipeline branches from the Exploitation Zone into two paths: one for descriptive analysis and another for predictive modeling. → Landing Zone → Formatted Zone → Exploitation Zone → Descriptive & Predictive Analysis.

The project culminates in two main outputs:

An interactive dashboard built with Streamlit for descriptive analysis.
A machine learning pipeline using Spark ML and MLflow to predict housing prices.

📁 Project Structure

📦 project-root
├── 1_landing_zone/               # Raw files from all data sources
├── 2_formatted_zone/             # Cleaned + typed data (Delta/Parquet)
│   ├── format_zone.py            # Spark pipeline: raw → formatted
│   └── validate_format_zone.py   # Validation script for formatted data
├── 3_exploitation_zone/          # Final features, aggregated views
│   ├── exploitation_zone.py      # Spark pipeline: formatted → enriched
│   └── validate_exploitation_zone.py # Validation script for exploitation data
├── 4_dashboard/                  # Streamlit dashboard interface
│   └── streamlit.py              # Dashboard app
├── 5_predictive_analysis/        # ML pipeline for price prediction
│   └── prediction.py             # Spark ML + MLflow script
└── README.md

📌 Key Objectives

Structure data using a multi-zone lakehouse architecture.
Perform transformations and persist data using Spark and Delta Lake.
Validate and enrich datasets with features and KPIs.
Provide an interactive data analysis experience with Streamlit.
Train and evaluate machine learning models to predict housing prices.
Track experiments, log metrics, and manage models using MLflow.
Align implementation with real-world scalable data practices.

🗂️ Data Lake Zones

Zone	Description	Format
🛬 Landing	Raw, untouched files as downloaded (one folder per dataset)	JSON, CSV
🏗️ Formatted	Typed, normalized, and partitioned data in canonical schema	Delta, Parquet
📈 Exploitation	Aggregated, cleaned datasets optimized for dashboards and analytics	Parquet

📊 Analysis & Outputs

Descriptive Analysis: Interactive Dashboard

The 4_dashboard/streamlit.py script launches a web-based application that consumes the final datasets from the exploitation zone. It provides interactive visualizations, maps, and filters, allowing users to explore housing market trends, cultural site distribution, and socioeconomic indicators across Barcelona's districts.

Predictive Analysis: Price Prediction Model

The 5_predictive_analysis/prediction.py script implements a complete machine learning pipeline to predict housing prices.

Frameworks: Uses Spark ML for modeling and MLflow for experiment tracking.
Models: Trains and compares GBTRegressor and LinearRegression.
Evaluation: Logs detailed metrics, including overall RMSE/MAE and per-district/property-type scores.
Feature Importance: Calculates and logs the most influential features for price prediction.
Deployment: Automatically identifies the best-performing model (lowest RMSE) and registers it in the MLflow Model Registry, transitioning it to the "Production" stage.

🎯 Key Performance Indicators (KPIs)

For descriptive analysis

Housing Market Insights

Average price per m² by district/neighborhood - Monitor spatial price variations
Housing offer by property type (apartment, house, studio) - Understand market composition
Property size distribution trends - Monitor housing supply characteristics

Cultural Amenities Impact

Cultural site density by neighborhood - Measure cultural richness
Cultural diversity index - Variety of cultural categories per area

Socioeconomic Patterns

Population density indicator - Measure average population per neighborhood for each district
Purchasing power indicator - Track average household disposable income per district

For predictive analysis

Price Prediction Accuracy

Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) of price predictions by property type
Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) of price predictions per district

Feature Importance Metrics

Cultural proximity impact - Coefficient/importance of distance to cultural sites and amenities
Income level predictive power - How well district (and, indirectly, average income index) predicts prices
Property characteristics importance - Rooms, size, exterior features impact

🚀 How to Run

0. Prerequisites

Make sure you have uv installed in your computer and sync the environment by running:

uv sync

This will install all the required dependencies in your environment. Then, activate the environment:

source .venv/bin/activate

1. Data Lake Zones

Within a terminal with the environment activated, run the following commands to create the data lake zones and populate them with the raw data files:

python 2_formatted_zone/format_zone.py
python 2_formatted_zone/validate_format_zone.py
python 3_exploitation_zone/exploitation_zone.py
python 3_exploitation_zone/validate_exploitation_zone.py

2. Run the MLFLow Experiment

This will create the mlruns directory with the experiment results.

mlflow run 4_predictive_analysis/prediction.py

3. Run the MLFlow Server

To initialize the MLFlow server, run:

mlflow server --port 5005

4. Run the Streamlit Dashboard

In another terminal, run the following command to start the Streamlit dashboard (having the MLFlow server running and with the environment activated):

streamlit run 5_dashboard/streamlit.py

To access the dashboard, open your browser and go to http://localhost:8501.

You can view the experiment results embedded in the streamlit or, metrics, and registered models in the MLflow UI (at http://localhost:5005).

The code will compute disaggregated metrics by district and property type (flat, chalet, penthouse, etc.).

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
1_landing_zone		1_landing_zone
2_format_zone		2_format_zone
3_exploitation_zone		3_exploitation_zone
4_predictive_analysis		4_predictive_analysis
5_dashboard		5_dashboard
figures		figures
utilities		utilities
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
Lab3-Assignment-Statement.pdf		Lab3-Assignment-Statement.pdf
README.md		README.md
l3_t02.pdf		l3_t02.pdf
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Big Data Management - Spark & Data Architecture

🧠 Project Overview

📁 Project Structure

📌 Key Objectives

🗂️ Data Lake Zones

📊 Analysis & Outputs

Descriptive Analysis: Interactive Dashboard

Predictive Analysis: Price Prediction Model

🎯 Key Performance Indicators (KPIs)

For descriptive analysis

Housing Market Insights

Cultural Amenities Impact

Socioeconomic Patterns

For predictive analysis

Price Prediction Accuracy

Feature Importance Metrics

🚀 How to Run

0. Prerequisites

1. Data Lake Zones

2. Run the MLFLow Experiment

3. Run the MLFlow Server

4. Run the Streamlit Dashboard

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 Big Data Management - Spark & Data Architecture

🧠 Project Overview

📁 Project Structure

📌 Key Objectives

🗂️ Data Lake Zones

📊 Analysis & Outputs

Descriptive Analysis: Interactive Dashboard

Predictive Analysis: Price Prediction Model

🎯 Key Performance Indicators (KPIs)

For descriptive analysis

Housing Market Insights

Cultural Amenities Impact

Socioeconomic Patterns

For predictive analysis

Price Prediction Accuracy

Feature Importance Metrics

🚀 How to Run

0. Prerequisites

1. Data Lake Zones

2. Run the MLFLow Experiment

3. Run the MLFlow Server

4. Run the Streamlit Dashboard

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages