Skip to content

luciasauer/bdm3

Repository files navigation

πŸš€ Big Data Management - Spark & Data Architecture

πŸ‘©πŸ»β€πŸ’»πŸ§‘β€πŸ’» Project by Lucia Sauer & Pablo FernΓ‘ndez
πŸŽ“ Master in Data Science – Barcelona School of Economics

Python Spark Delta Lake MLflow Streamlit Parquet


🧠 Project Overview

This project demonstrates a full data lake architecture and end-to-end data pipeline using Apache Spark, Delta Lake, and MLflow. Our use case centers around real estate market analysis in Barcelona, integrating data on:

  • πŸ“Š Idealista Property Listings (2020–Mar 2021) – JSON
  • πŸ›οΈ Cultural Sites in Barcelona – CSV from Open Data BCN
  • πŸ’° Income & Population Statistics (2017, used as 2020 proxy) – CSV from Open Data BCN

We implement a modular Spark-based pipeline that handles data ingestion, cleaning, feature engineering, and modeling, adhering to a modern layered architecture. The pipeline branches from the Exploitation Zone into two paths: one for descriptive analysis and another for predictive modeling. β†’ Landing Zone β†’ Formatted Zone β†’ Exploitation Zone β†’ Descriptive & Predictive Analysis.

The project culminates in two main outputs:

  1. An interactive dashboard built with Streamlit for descriptive analysis.
  2. A machine learning pipeline using Spark ML and MLflow to predict housing prices.


πŸ“ Project Structure

πŸ“¦ project-root
β”œβ”€β”€ 1_landing_zone/               # Raw files from all data sources
β”œβ”€β”€ 2_formatted_zone/             # Cleaned + typed data (Delta/Parquet)
β”‚   β”œβ”€β”€ format_zone.py            # Spark pipeline: raw β†’ formatted
β”‚   └── validate_format_zone.py   # Validation script for formatted data
β”œβ”€β”€ 3_exploitation_zone/          # Final features, aggregated views
β”‚   β”œβ”€β”€ exploitation_zone.py      # Spark pipeline: formatted β†’ enriched
β”‚   └── validate_exploitation_zone.py # Validation script for exploitation data
β”œβ”€β”€ 4_dashboard/                  # Streamlit dashboard interface
β”‚   └── streamlit.py              # Dashboard app
β”œβ”€β”€ 5_predictive_analysis/        # ML pipeline for price prediction
β”‚   └── prediction.py             # Spark ML + MLflow script
└── README.md

πŸ“Œ Key Objectives

  • Structure data using a multi-zone lakehouse architecture.
  • Perform transformations and persist data using Spark and Delta Lake.
  • Validate and enrich datasets with features and KPIs.
  • Provide an interactive data analysis experience with Streamlit.
  • Train and evaluate machine learning models to predict housing prices.
  • Track experiments, log metrics, and manage models using MLflow.
  • Align implementation with real-world scalable data practices.

πŸ—‚οΈ Data Lake Zones

Zone Description Format
πŸ›¬ Landing Raw, untouched files as downloaded (one folder per dataset) JSON, CSV
πŸ—οΈ Formatted Typed, normalized, and partitioned data in canonical schema Delta, Parquet
πŸ“ˆ Exploitation Aggregated, cleaned datasets optimized for dashboards and analytics Parquet

πŸ“Š Analysis & Outputs

Descriptive Analysis: Interactive Dashboard

The 4_dashboard/streamlit.py script launches a web-based application that consumes the final datasets from the exploitation zone. It provides interactive visualizations, maps, and filters, allowing users to explore housing market trends, cultural site distribution, and socioeconomic indicators across Barcelona's districts.

Predictive Analysis: Price Prediction Model

The 5_predictive_analysis/prediction.py script implements a complete machine learning pipeline to predict housing prices.

  • Frameworks: Uses Spark ML for modeling and MLflow for experiment tracking.
  • Models: Trains and compares GBTRegressor and LinearRegression.
  • Evaluation: Logs detailed metrics, including overall RMSE/MAE and per-district/property-type scores.
  • Feature Importance: Calculates and logs the most influential features for price prediction.
  • Deployment: Automatically identifies the best-performing model (lowest RMSE) and registers it in the MLflow Model Registry, transitioning it to the "Production" stage.

🎯 Key Performance Indicators (KPIs)

For descriptive analysis

Housing Market Insights

  • Average price per mΒ² by district/neighborhood - Monitor spatial price variations
  • Housing offer by property type (apartment, house, studio) - Understand market composition
  • Property size distribution trends - Monitor housing supply characteristics

Cultural Amenities Impact

  • Cultural site density by neighborhood - Measure cultural richness
  • Cultural diversity index - Variety of cultural categories per area

Socioeconomic Patterns

  • Population density indicator - Measure average population per neighborhood for each district
  • Purchasing power indicator - Track average household disposable income per district

For predictive analysis

Price Prediction Accuracy

  • Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) of price predictions by property type
  • Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) of price predictions per district

Feature Importance Metrics

  • Cultural proximity impact - Coefficient/importance of distance to cultural sites and amenities
  • Income level predictive power - How well district (and, indirectly, average income index) predicts prices
  • Property characteristics importance - Rooms, size, exterior features impact

πŸš€ How to Run

0. Prerequisites

Make sure you have uv installed in your computer and sync the environment by running:

uv sync

This will install all the required dependencies in your environment. Then, activate the environment:

source .venv/bin/activate

1. Data Lake Zones

Within a terminal with the environment activated, run the following commands to create the data lake zones and populate them with the raw data files:

python 2_formatted_zone/format_zone.py
python 2_formatted_zone/validate_format_zone.py
python 3_exploitation_zone/exploitation_zone.py
python 3_exploitation_zone/validate_exploitation_zone.py

2. Run the MLFLow Experiment

This will create the mlruns directory with the experiment results.

mlflow run 4_predictive_analysis/prediction.py

3. Run the MLFlow Server

To initialize the MLFlow server, run:

mlflow server --port 5005

4. Run the Streamlit Dashboard

In another terminal, run the following command to start the Streamlit dashboard (having the MLFlow server running and with the environment activated):

streamlit run 5_dashboard/streamlit.py

To access the dashboard, open your browser and go to http://localhost:8501.

You can view the experiment results embedded in the streamlit or, metrics, and registered models in the MLflow UI (at http://localhost:5005).

The code will compute disaggregated metrics by district and property type (flat, chalet, penthouse, etc.).

About

BigDataManagement project using Spark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors