π©π»βπ»π§βπ» Project by Lucia Sauer & Pablo FernΓ‘ndez
π Master in Data Science β Barcelona School of Economics
This project demonstrates a full data lake architecture and end-to-end data pipeline using Apache Spark, Delta Lake, and MLflow. Our use case centers around real estate market analysis in Barcelona, integrating data on:
- π Idealista Property Listings (
2020βMar 2021) β JSON - ποΈ Cultural Sites in Barcelona β CSV from Open Data BCN
- π° Income & Population Statistics (
2017, used as 2020 proxy) β CSV from Open Data BCN
We implement a modular Spark-based pipeline that handles data ingestion, cleaning, feature engineering, and modeling, adhering to a modern layered architecture. The pipeline branches from the Exploitation Zone into two paths: one for descriptive analysis and another for predictive modeling.
β Landing Zone β Formatted Zone β Exploitation Zone β Descriptive & Predictive Analysis.
The project culminates in two main outputs:
- An interactive dashboard built with Streamlit for descriptive analysis.
- A machine learning pipeline using Spark ML and MLflow to predict housing prices.
π¦ project-root
βββ 1_landing_zone/ # Raw files from all data sources
βββ 2_formatted_zone/ # Cleaned + typed data (Delta/Parquet)
β βββ format_zone.py # Spark pipeline: raw β formatted
β βββ validate_format_zone.py # Validation script for formatted data
βββ 3_exploitation_zone/ # Final features, aggregated views
β βββ exploitation_zone.py # Spark pipeline: formatted β enriched
β βββ validate_exploitation_zone.py # Validation script for exploitation data
βββ 4_dashboard/ # Streamlit dashboard interface
β βββ streamlit.py # Dashboard app
βββ 5_predictive_analysis/ # ML pipeline for price prediction
β βββ prediction.py # Spark ML + MLflow script
βββ README.md
- Structure data using a multi-zone lakehouse architecture.
- Perform transformations and persist data using Spark and Delta Lake.
- Validate and enrich datasets with features and KPIs.
- Provide an interactive data analysis experience with Streamlit.
- Train and evaluate machine learning models to predict housing prices.
- Track experiments, log metrics, and manage models using MLflow.
- Align implementation with real-world scalable data practices.
| Zone | Description | Format |
|---|---|---|
| π¬ Landing | Raw, untouched files as downloaded (one folder per dataset) | JSON, CSV |
| ποΈ Formatted | Typed, normalized, and partitioned data in canonical schema | Delta, Parquet |
| π Exploitation | Aggregated, cleaned datasets optimized for dashboards and analytics | Parquet |
The 4_dashboard/streamlit.py script launches a web-based application that consumes the final datasets from the exploitation zone. It provides interactive visualizations, maps, and filters, allowing users to explore housing market trends, cultural site distribution, and socioeconomic indicators across Barcelona's districts.
The 5_predictive_analysis/prediction.py script implements a complete machine learning pipeline to predict housing prices.
- Frameworks: Uses Spark ML for modeling and MLflow for experiment tracking.
- Models: Trains and compares
GBTRegressorandLinearRegression. - Evaluation: Logs detailed metrics, including overall RMSE/MAE and per-district/property-type scores.
- Feature Importance: Calculates and logs the most influential features for price prediction.
- Deployment: Automatically identifies the best-performing model (lowest RMSE) and registers it in the MLflow Model Registry, transitioning it to the "Production" stage.
- Average price per mΒ² by district/neighborhood - Monitor spatial price variations
- Housing offer by property type (apartment, house, studio) - Understand market composition
- Property size distribution trends - Monitor housing supply characteristics
- Cultural site density by neighborhood - Measure cultural richness
- Cultural diversity index - Variety of cultural categories per area
- Population density indicator - Measure average population per neighborhood for each district
- Purchasing power indicator - Track average household disposable income per district
- Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) of price predictions by property type
- Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) of price predictions per district
- Cultural proximity impact - Coefficient/importance of distance to cultural sites and amenities
- Income level predictive power - How well district (and, indirectly, average income index) predicts prices
- Property characteristics importance - Rooms, size, exterior features impact
Make sure you have uv installed in your computer and sync the environment by running:
uv syncThis will install all the required dependencies in your environment. Then, activate the environment:
source .venv/bin/activateWithin a terminal with the environment activated, run the following commands to create the data lake zones and populate them with the raw data files:
python 2_formatted_zone/format_zone.py
python 2_formatted_zone/validate_format_zone.py
python 3_exploitation_zone/exploitation_zone.py
python 3_exploitation_zone/validate_exploitation_zone.pyThis will create the mlruns directory with the experiment results.
mlflow run 4_predictive_analysis/prediction.pyTo initialize the MLFlow server, run:
mlflow server --port 5005In another terminal, run the following command to start the Streamlit dashboard (having the MLFlow server running and with the environment activated):
streamlit run 5_dashboard/streamlit.pyTo access the dashboard, open your browser and go to http://localhost:8501.
You can view the experiment results embedded in the streamlit or, metrics, and registered models in the MLflow UI (at http://localhost:5005).
The code will compute disaggregated metrics by district and property type (flat, chalet, penthouse, etc.).

