Skip to content

draklowd/phd

Repository files navigation

Diana PhD Project: Life Expectancy Prediction Using Environmental Data

This project uses environmental and energy-related indicators from Eurostat to predict life expectancy for European countries.

Project Overview

Goal: Predict life expectancy across European countries using environmental, energy, and air quality data as features.

Data Sources: Eurostat (European Statistical Office)

  • Air quality (PM10 & PM2.5 particulate matter)
  • Renewable energy share
  • Energy productivity
  • Greenhouse gas emissions
  • Urban environmental indicators (cities)

Project Structure

dianaPhd/
├── datasets/
│   ├── raw/                          # Original Eurostat data files
│   │   ├── Eurostat-sdg_11_50 time series.csv
│   │   ├── share_of_Renewabled.xlsx
│   │   ├── energy_productivity.xlsx
│   │   ├── greenhouse_gas.xlsx
│   │   └── env_cities_And_greater_cities.xlsx
│   └── processed/                    # Cleaned, curated CSV files
│       ├── country/                  # Country-level features
│       │   ├── air_quality.csv
│       │   ├── renewable_energy_share.csv
│       │   ├── energy_productivity.csv
│       │   ├── greenhouse_gas_emissions.csv
│       │   ├── life_expectancy.csv  # Target variable (all ages/sex)
│       │   └── life_expectancy_at_birth.csv # Target (simplified)
│       ├── city/                     # City-level features
│       │   └── environmental_cities.csv
│       └── metadata/                 # Quality reports & metadata
├── src/
│   └── data_processing/
│       ├── utils.py                  # Shared processing utilities
│       └── __init__.py
├── scripts/
│   ├── process_air_quality.py        # Process air quality data
│   ├── process_renewable_share.py    # Process renewable energy data
│   ├── process_energy_productivity.py # Process energy productivity
│   ├── process_greenhouse_gas.py     # Process GHG emissions
│   ├── process_life_expectancy.py    # Process life expectancy (target)
│   ├── process_env_cities.py         # Process city environmental data
│   └── process_all.py                # Master script - runs all processing
├── notebooks/                        # Jupyter notebooks for analysis
├── requirements.txt
└── README.md

Setup

1. Activate Virtual Environment

source venv/bin/activate

2. Install Dependencies

pip install -r requirements.txt

Data Processing

Quick Start: Process All Datasets

To process all raw data files at once:

# Process all datasets (including sparse city data)
python scripts/process_all.py

# Or skip city-level data (recommended if focusing on country-level prediction)
python scripts/process_all.py --skip-cities

This will:

  1. Read raw Eurostat Excel/CSV files
  2. Extract metadata and clean data
  3. Transform from wide format to long format
  4. Standardize country names and years
  5. Generate quality reports
  6. Save processed CSVs to datasets/processed/

Process Individual Datasets

You can also process datasets individually:

# Air quality (PM10 & PM2.5)
python scripts/process_air_quality.py

# Renewable energy share
python scripts/process_renewable_share.py

# Energy productivity
python scripts/process_energy_productivity.py

# Greenhouse gas emissions
python scripts/process_greenhouse_gas.py

# Life expectancy (target variable)
python scripts/process_life_expectancy.py --birth-only

# Environmental cities (sparse data)
python scripts/process_env_cities.py

Processed Data Format

All country-level datasets follow this standardized format:

country,year,value,metric_name,unit,source_file
Belgium,2020,18.3,air_quality,µg/m³,Eurostat-sdg_11_50 time series.csv
Belgium,2020,13.0,renewable_energy_share,percent,share_of_Renewabled.xlsx

City-level data includes additional columns:

city,country,year,value,area_type,metric_name,unit,source_file
Brussels,Belgium,2020,42.3,city,urban_environment,unknown,env_cities_And_greater_cities.xlsx

Data Dictionary

Country-Level Features

File Metric Description Unit Years Coverage
air_quality.csv PM10, PM2.5 Particulate matter concentration µg/m³ 2000-2019 64 countries
renewable_energy_share.csv Renewable energy % of renewable energy in gross final consumption % 2004-2024 43 countries
energy_productivity.csv Energy efficiency Economic output per unit of energy euro/kgoe 2000-2024 43 countries
greenhouse_gas_emissions.csv GHG emissions Absolute & indexed emissions various 1990-2023 31 countries
life_expectancy_at_birth.csv Life expectancy Life expectancy at birth (TARGET) years 1960-2024 56 countries

City-Level Features

File Description Cities Sparsity
environmental_cities.csv Urban environmental indicators 876 85.7% missing

Note: City data is very sparse and may be better used aggregated to country level or excluded from initial models.

Quality Reports

Each processed dataset includes a quality report in datasets/processed/metadata/:

{
  "file": "air_quality.csv",
  "processed_date": "2026-01-06",
  "rows": 1280,
  "countries": 64,
  "year_range": [2000, 2019],
  "completeness": {
    "overall": 0.80,
    "by_country": {...},
    "by_year": {...}
  },
  "missing_values": {
    "count": 256,
    "percentage": 20.0
  }
}

Using the Data for ML

Example: Load and Merge Features with Target Variable

import pandas as pd

# Load processed country-level features
air_quality = pd.read_csv('datasets/processed/country/air_quality.csv')
renewable = pd.read_csv('datasets/processed/country/renewable_energy_share.csv')
energy_prod = pd.read_csv('datasets/processed/country/energy_productivity.csv')
ghg = pd.read_csv('datasets/processed/country/greenhouse_gas_emissions.csv')

# Load target variable (life expectancy at birth)
life_exp = pd.read_csv('datasets/processed/country/life_expectancy_at_birth.csv')

# For air quality, pivot to have PM10 and PM2.5 as separate columns
air_pivot = air_quality.pivot_table(
    index=['country', 'year'],
    columns='particle_type',
    values='value'
).reset_index()
air_pivot.columns.name = None

# Merge all features
features = air_pivot.merge(renewable[['country', 'year', 'value']],
                          on=['country', 'year'], how='outer',
                          suffixes=('', '_renewable'))
features = features.merge(energy_prod[['country', 'year', 'value']],
                         on=['country', 'year'], how='outer',
                         suffixes=('', '_energy'))

# For GHG, filter to get only one metric type (e.g., absolute values from Sheet 1)
ghg_filtered = ghg[(ghg['metric_type'] == 'absolute') & (ghg['sheet_name'] == 'Sheet 1')]
features = features.merge(ghg_filtered[['country', 'year', 'value']],
                         on=['country', 'year'], how='outer',
                         suffixes=('', '_ghg'))

# Rename value columns for clarity
features.rename(columns={
    'value': 'renewable_share',
    'value_energy': 'energy_productivity',
    'value_ghg': 'ghg_emissions'
}, inplace=True)

# Merge with target variable (life expectancy) - use inner join to keep only complete cases
data = features.merge(life_exp[['country', 'year', 'value']],
                      on=['country', 'year'], how='inner')
data.rename(columns={'value': 'life_expectancy'}, inplace=True)

# Remove rows with too many missing features
data_clean = data.dropna(thresh=len(data.columns) - 2)  # Allow max 2 missing features

print(f"Final dataset shape: {data_clean.shape}")
print(f"Countries: {data_clean['country'].nunique()}")
print(f"Year range: {data_clean['year'].min()} - {data_clean['year'].max()}")
print(f"\nFeature columns: {[col for col in data_clean.columns if col not in ['country', 'year']]}")

Example Output:

Final dataset shape: (450, 8)
Countries: 25
Year range: 2004 - 2019
Feature columns: ['PM10', 'PM2.5', 'renewable_share', 'energy_productivity', 'ghg_emissions', 'life_expectancy']

Run Jupyter

jupyter notebook

Or use JupyterLab:

jupyter lab

Next Steps for ML Model

  1. Data Collection & Processing - COMPLETE

    • All environmental features processed
    • Life expectancy target variable processed
  2. Data Integration - Use the example code above to merge datasets

  3. Exploratory Data Analysis

    • Visualize correlations between features and life expectancy
    • Check for temporal trends
    • Identify outliers and anomalies
  4. Feature Engineering

    • Create lagged features (previous years' values)
    • Calculate year-over-year changes
    • Compute rolling averages (3-year, 5-year windows)
    • Add interaction terms (e.g., PM2.5 × renewable_share)
  5. Handle Missing Values

    • Analyze missingness patterns
    • Choose imputation strategy (mean, median, KNN, forward-fill)
    • Or remove rows/countries with excessive missing data
  6. Build Prediction Models

    • Baseline: Linear regression
    • Tree-based: Random Forest, XGBoost, LightGBM
    • Neural Networks: Multi-layer perceptron
    • Time series: ARIMA, LSTM for temporal patterns
  7. Evaluate and Iterate

    • Train/test split (or time-based split for temporal validation)
    • Metrics: RMSE, MAE, R²
    • Cross-validation by country or year
    • Feature importance analysis
    • Model interpretation (SHAP values)

Data Caveats

  • Temporal alignment: Different datasets cover different year ranges
  • Missing values: 6-20% missing values in country-level data, 86% in city data
  • Country coverage: Varies by dataset (31-64 countries)
  • Units: Ensure proper interpretation when combining features
  • City data: Very sparse; consider aggregating or excluding

License

Research project - data sourced from Eurostat public datasets.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors