This project uses environmental and energy-related indicators from Eurostat to predict life expectancy for European countries.
Goal: Predict life expectancy across European countries using environmental, energy, and air quality data as features.
Data Sources: Eurostat (European Statistical Office)
- Air quality (PM10 & PM2.5 particulate matter)
- Renewable energy share
- Energy productivity
- Greenhouse gas emissions
- Urban environmental indicators (cities)
dianaPhd/
├── datasets/
│ ├── raw/ # Original Eurostat data files
│ │ ├── Eurostat-sdg_11_50 time series.csv
│ │ ├── share_of_Renewabled.xlsx
│ │ ├── energy_productivity.xlsx
│ │ ├── greenhouse_gas.xlsx
│ │ └── env_cities_And_greater_cities.xlsx
│ └── processed/ # Cleaned, curated CSV files
│ ├── country/ # Country-level features
│ │ ├── air_quality.csv
│ │ ├── renewable_energy_share.csv
│ │ ├── energy_productivity.csv
│ │ ├── greenhouse_gas_emissions.csv
│ │ ├── life_expectancy.csv # Target variable (all ages/sex)
│ │ └── life_expectancy_at_birth.csv # Target (simplified)
│ ├── city/ # City-level features
│ │ └── environmental_cities.csv
│ └── metadata/ # Quality reports & metadata
├── src/
│ └── data_processing/
│ ├── utils.py # Shared processing utilities
│ └── __init__.py
├── scripts/
│ ├── process_air_quality.py # Process air quality data
│ ├── process_renewable_share.py # Process renewable energy data
│ ├── process_energy_productivity.py # Process energy productivity
│ ├── process_greenhouse_gas.py # Process GHG emissions
│ ├── process_life_expectancy.py # Process life expectancy (target)
│ ├── process_env_cities.py # Process city environmental data
│ └── process_all.py # Master script - runs all processing
├── notebooks/ # Jupyter notebooks for analysis
├── requirements.txt
└── README.md
source venv/bin/activatepip install -r requirements.txtTo process all raw data files at once:
# Process all datasets (including sparse city data)
python scripts/process_all.py
# Or skip city-level data (recommended if focusing on country-level prediction)
python scripts/process_all.py --skip-citiesThis will:
- Read raw Eurostat Excel/CSV files
- Extract metadata and clean data
- Transform from wide format to long format
- Standardize country names and years
- Generate quality reports
- Save processed CSVs to
datasets/processed/
You can also process datasets individually:
# Air quality (PM10 & PM2.5)
python scripts/process_air_quality.py
# Renewable energy share
python scripts/process_renewable_share.py
# Energy productivity
python scripts/process_energy_productivity.py
# Greenhouse gas emissions
python scripts/process_greenhouse_gas.py
# Life expectancy (target variable)
python scripts/process_life_expectancy.py --birth-only
# Environmental cities (sparse data)
python scripts/process_env_cities.pyAll country-level datasets follow this standardized format:
country,year,value,metric_name,unit,source_file
Belgium,2020,18.3,air_quality,µg/m³,Eurostat-sdg_11_50 time series.csv
Belgium,2020,13.0,renewable_energy_share,percent,share_of_Renewabled.xlsxCity-level data includes additional columns:
city,country,year,value,area_type,metric_name,unit,source_file
Brussels,Belgium,2020,42.3,city,urban_environment,unknown,env_cities_And_greater_cities.xlsx| File | Metric | Description | Unit | Years | Coverage |
|---|---|---|---|---|---|
air_quality.csv |
PM10, PM2.5 | Particulate matter concentration | µg/m³ | 2000-2019 | 64 countries |
renewable_energy_share.csv |
Renewable energy | % of renewable energy in gross final consumption | % | 2004-2024 | 43 countries |
energy_productivity.csv |
Energy efficiency | Economic output per unit of energy | euro/kgoe | 2000-2024 | 43 countries |
greenhouse_gas_emissions.csv |
GHG emissions | Absolute & indexed emissions | various | 1990-2023 | 31 countries |
life_expectancy_at_birth.csv |
Life expectancy | Life expectancy at birth (TARGET) | years | 1960-2024 | 56 countries |
| File | Description | Cities | Sparsity |
|---|---|---|---|
environmental_cities.csv |
Urban environmental indicators | 876 | 85.7% missing |
Note: City data is very sparse and may be better used aggregated to country level or excluded from initial models.
Each processed dataset includes a quality report in datasets/processed/metadata/:
{
"file": "air_quality.csv",
"processed_date": "2026-01-06",
"rows": 1280,
"countries": 64,
"year_range": [2000, 2019],
"completeness": {
"overall": 0.80,
"by_country": {...},
"by_year": {...}
},
"missing_values": {
"count": 256,
"percentage": 20.0
}
}import pandas as pd
# Load processed country-level features
air_quality = pd.read_csv('datasets/processed/country/air_quality.csv')
renewable = pd.read_csv('datasets/processed/country/renewable_energy_share.csv')
energy_prod = pd.read_csv('datasets/processed/country/energy_productivity.csv')
ghg = pd.read_csv('datasets/processed/country/greenhouse_gas_emissions.csv')
# Load target variable (life expectancy at birth)
life_exp = pd.read_csv('datasets/processed/country/life_expectancy_at_birth.csv')
# For air quality, pivot to have PM10 and PM2.5 as separate columns
air_pivot = air_quality.pivot_table(
index=['country', 'year'],
columns='particle_type',
values='value'
).reset_index()
air_pivot.columns.name = None
# Merge all features
features = air_pivot.merge(renewable[['country', 'year', 'value']],
on=['country', 'year'], how='outer',
suffixes=('', '_renewable'))
features = features.merge(energy_prod[['country', 'year', 'value']],
on=['country', 'year'], how='outer',
suffixes=('', '_energy'))
# For GHG, filter to get only one metric type (e.g., absolute values from Sheet 1)
ghg_filtered = ghg[(ghg['metric_type'] == 'absolute') & (ghg['sheet_name'] == 'Sheet 1')]
features = features.merge(ghg_filtered[['country', 'year', 'value']],
on=['country', 'year'], how='outer',
suffixes=('', '_ghg'))
# Rename value columns for clarity
features.rename(columns={
'value': 'renewable_share',
'value_energy': 'energy_productivity',
'value_ghg': 'ghg_emissions'
}, inplace=True)
# Merge with target variable (life expectancy) - use inner join to keep only complete cases
data = features.merge(life_exp[['country', 'year', 'value']],
on=['country', 'year'], how='inner')
data.rename(columns={'value': 'life_expectancy'}, inplace=True)
# Remove rows with too many missing features
data_clean = data.dropna(thresh=len(data.columns) - 2) # Allow max 2 missing features
print(f"Final dataset shape: {data_clean.shape}")
print(f"Countries: {data_clean['country'].nunique()}")
print(f"Year range: {data_clean['year'].min()} - {data_clean['year'].max()}")
print(f"\nFeature columns: {[col for col in data_clean.columns if col not in ['country', 'year']]}")Final dataset shape: (450, 8)
Countries: 25
Year range: 2004 - 2019
Feature columns: ['PM10', 'PM2.5', 'renewable_share', 'energy_productivity', 'ghg_emissions', 'life_expectancy']
jupyter notebookOr use JupyterLab:
jupyter lab-
✅ Data Collection & Processing - COMPLETE
- All environmental features processed
- Life expectancy target variable processed
-
Data Integration - Use the example code above to merge datasets
-
Exploratory Data Analysis
- Visualize correlations between features and life expectancy
- Check for temporal trends
- Identify outliers and anomalies
-
Feature Engineering
- Create lagged features (previous years' values)
- Calculate year-over-year changes
- Compute rolling averages (3-year, 5-year windows)
- Add interaction terms (e.g., PM2.5 × renewable_share)
-
Handle Missing Values
- Analyze missingness patterns
- Choose imputation strategy (mean, median, KNN, forward-fill)
- Or remove rows/countries with excessive missing data
-
Build Prediction Models
- Baseline: Linear regression
- Tree-based: Random Forest, XGBoost, LightGBM
- Neural Networks: Multi-layer perceptron
- Time series: ARIMA, LSTM for temporal patterns
-
Evaluate and Iterate
- Train/test split (or time-based split for temporal validation)
- Metrics: RMSE, MAE, R²
- Cross-validation by country or year
- Feature importance analysis
- Model interpretation (SHAP values)
- Temporal alignment: Different datasets cover different year ranges
- Missing values: 6-20% missing values in country-level data, 86% in city data
- Country coverage: Varies by dataset (31-64 countries)
- Units: Ensure proper interpretation when combining features
- City data: Very sparse; consider aggregating or excluding
Research project - data sourced from Eurostat public datasets.