Diana PhD Project: Life Expectancy Prediction Using Environmental Data

This project uses environmental and energy-related indicators from Eurostat to predict life expectancy for European countries.

Project Overview

Goal: Predict life expectancy across European countries using environmental, energy, and air quality data as features.

Data Sources: Eurostat (European Statistical Office)

Air quality (PM10 & PM2.5 particulate matter)
Renewable energy share
Energy productivity
Greenhouse gas emissions
Urban environmental indicators (cities)

Project Structure

dianaPhd/
├── datasets/
│   ├── raw/                          # Original Eurostat data files
│   │   ├── Eurostat-sdg_11_50 time series.csv
│   │   ├── share_of_Renewabled.xlsx
│   │   ├── energy_productivity.xlsx
│   │   ├── greenhouse_gas.xlsx
│   │   └── env_cities_And_greater_cities.xlsx
│   └── processed/                    # Cleaned, curated CSV files
│       ├── country/                  # Country-level features
│       │   ├── air_quality.csv
│       │   ├── renewable_energy_share.csv
│       │   ├── energy_productivity.csv
│       │   ├── greenhouse_gas_emissions.csv
│       │   ├── life_expectancy.csv  # Target variable (all ages/sex)
│       │   └── life_expectancy_at_birth.csv # Target (simplified)
│       ├── city/                     # City-level features
│       │   └── environmental_cities.csv
│       └── metadata/                 # Quality reports & metadata
├── src/
│   └── data_processing/
│       ├── utils.py                  # Shared processing utilities
│       └── __init__.py
├── scripts/
│   ├── process_air_quality.py        # Process air quality data
│   ├── process_renewable_share.py    # Process renewable energy data
│   ├── process_energy_productivity.py # Process energy productivity
│   ├── process_greenhouse_gas.py     # Process GHG emissions
│   ├── process_life_expectancy.py    # Process life expectancy (target)
│   ├── process_env_cities.py         # Process city environmental data
│   └── process_all.py                # Master script - runs all processing
├── notebooks/                        # Jupyter notebooks for analysis
├── requirements.txt
└── README.md

Setup

1. Activate Virtual Environment

source venv/bin/activate

2. Install Dependencies

pip install -r requirements.txt

Data Processing

Quick Start: Process All Datasets

To process all raw data files at once:

# Process all datasets (including sparse city data)
python scripts/process_all.py

# Or skip city-level data (recommended if focusing on country-level prediction)
python scripts/process_all.py --skip-cities

This will:

Read raw Eurostat Excel/CSV files
Extract metadata and clean data
Transform from wide format to long format
Standardize country names and years
Generate quality reports
Save processed CSVs to datasets/processed/

Process Individual Datasets

You can also process datasets individually:

# Air quality (PM10 & PM2.5)
python scripts/process_air_quality.py

# Renewable energy share
python scripts/process_renewable_share.py

# Energy productivity
python scripts/process_energy_productivity.py

# Greenhouse gas emissions
python scripts/process_greenhouse_gas.py

# Life expectancy (target variable)
python scripts/process_life_expectancy.py --birth-only

# Environmental cities (sparse data)
python scripts/process_env_cities.py

Processed Data Format

All country-level datasets follow this standardized format:

country,year,value,metric_name,unit,source_file
Belgium,2020,18.3,air_quality,µg/m³,Eurostat-sdg_11_50 time series.csv
Belgium,2020,13.0,renewable_energy_share,percent,share_of_Renewabled.xlsx

City-level data includes additional columns:

city,country,year,value,area_type,metric_name,unit,source_file
Brussels,Belgium,2020,42.3,city,urban_environment,unknown,env_cities_And_greater_cities.xlsx

Data Dictionary

Country-Level Features

File	Metric	Description	Unit	Years	Coverage
`air_quality.csv`	PM10, PM2.5	Particulate matter concentration	µg/m³	2000-2019	64 countries
`renewable_energy_share.csv`	Renewable energy	% of renewable energy in gross final consumption	%	2004-2024	43 countries
`energy_productivity.csv`	Energy efficiency	Economic output per unit of energy	euro/kgoe	2000-2024	43 countries
`greenhouse_gas_emissions.csv`	GHG emissions	Absolute & indexed emissions	various	1990-2023	31 countries
`life_expectancy_at_birth.csv`	Life expectancy	Life expectancy at birth (TARGET)	years	1960-2024	56 countries

City-Level Features

File	Description	Cities	Sparsity
`environmental_cities.csv`	Urban environmental indicators	876	85.7% missing

Note: City data is very sparse and may be better used aggregated to country level or excluded from initial models.

Quality Reports

Each processed dataset includes a quality report in datasets/processed/metadata/:

{
  "file": "air_quality.csv",
  "processed_date": "2026-01-06",
  "rows": 1280,
  "countries": 64,
  "year_range": [2000, 2019],
  "completeness": {
    "overall": 0.80,
    "by_country": {...},
    "by_year": {...}
  },
  "missing_values": {
    "count": 256,
    "percentage": 20.0
  }
}

Using the Data for ML

Example: Load and Merge Features with Target Variable

import pandas as pd

# Load processed country-level features
air_quality = pd.read_csv('datasets/processed/country/air_quality.csv')
renewable = pd.read_csv('datasets/processed/country/renewable_energy_share.csv')
energy_prod = pd.read_csv('datasets/processed/country/energy_productivity.csv')
ghg = pd.read_csv('datasets/processed/country/greenhouse_gas_emissions.csv')

# Load target variable (life expectancy at birth)
life_exp = pd.read_csv('datasets/processed/country/life_expectancy_at_birth.csv')

# For air quality, pivot to have PM10 and PM2.5 as separate columns
air_pivot = air_quality.pivot_table(
    index=['country', 'year'],
    columns='particle_type',
    values='value'
).reset_index()
air_pivot.columns.name = None

# Merge all features
features = air_pivot.merge(renewable[['country', 'year', 'value']],
                          on=['country', 'year'], how='outer',
                          suffixes=('', '_renewable'))
features = features.merge(energy_prod[['country', 'year', 'value']],
                         on=['country', 'year'], how='outer',
                         suffixes=('', '_energy'))

# For GHG, filter to get only one metric type (e.g., absolute values from Sheet 1)
ghg_filtered = ghg[(ghg['metric_type'] == 'absolute') & (ghg['sheet_name'] == 'Sheet 1')]
features = features.merge(ghg_filtered[['country', 'year', 'value']],
                         on=['country', 'year'], how='outer',
                         suffixes=('', '_ghg'))

# Rename value columns for clarity
features.rename(columns={
    'value': 'renewable_share',
    'value_energy': 'energy_productivity',
    'value_ghg': 'ghg_emissions'
}, inplace=True)

# Merge with target variable (life expectancy) - use inner join to keep only complete cases
data = features.merge(life_exp[['country', 'year', 'value']],
                      on=['country', 'year'], how='inner')
data.rename(columns={'value': 'life_expectancy'}, inplace=True)

# Remove rows with too many missing features
data_clean = data.dropna(thresh=len(data.columns) - 2)  # Allow max 2 missing features

print(f"Final dataset shape: {data_clean.shape}")
print(f"Countries: {data_clean['country'].nunique()}")
print(f"Year range: {data_clean['year'].min()} - {data_clean['year'].max()}")
print(f"\nFeature columns: {[col for col in data_clean.columns if col not in ['country', 'year']]}")

Example Output:

Final dataset shape: (450, 8)
Countries: 25
Year range: 2004 - 2019
Feature columns: ['PM10', 'PM2.5', 'renewable_share', 'energy_productivity', 'ghg_emissions', 'life_expectancy']

Run Jupyter

jupyter notebook

Or use JupyterLab:

jupyter lab

Next Steps for ML Model

✅ Data Collection & Processing - COMPLETE
- All environmental features processed
- Life expectancy target variable processed
Data Integration - Use the example code above to merge datasets
Exploratory Data Analysis
- Visualize correlations between features and life expectancy
- Check for temporal trends
- Identify outliers and anomalies
Feature Engineering
- Create lagged features (previous years' values)
- Calculate year-over-year changes
- Compute rolling averages (3-year, 5-year windows)
- Add interaction terms (e.g., PM2.5 × renewable_share)
Handle Missing Values
- Analyze missingness patterns
- Choose imputation strategy (mean, median, KNN, forward-fill)
- Or remove rows/countries with excessive missing data
Build Prediction Models
- Baseline: Linear regression
- Tree-based: Random Forest, XGBoost, LightGBM
- Neural Networks: Multi-layer perceptron
- Time series: ARIMA, LSTM for temporal patterns
Evaluate and Iterate
- Train/test split (or time-based split for temporal validation)
- Metrics: RMSE, MAE, R²
- Cross-validation by country or year
- Feature importance analysis
- Model interpretation (SHAP values)

Data Caveats

Temporal alignment: Different datasets cover different year ranges
Missing values: 6-20% missing values in country-level data, 86% in city data
Country coverage: Varies by dataset (31-64 countries)
Units: Ensure proper interpretation when combining features
City data: Very sparse; consider aggregating or excluding

License

Research project - data sourced from Eurostat public datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.idea		.idea
notebooks		notebooks
output		output
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
basic_prediction_model.ipynb		basic_prediction_model.ipynb
claude.md		claude.md
continut		continut
hello_world.ipynb		hello_world.ipynb
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diana PhD Project: Life Expectancy Prediction Using Environmental Data

Project Overview

Project Structure

Setup

1. Activate Virtual Environment

2. Install Dependencies

Data Processing

Quick Start: Process All Datasets

Process Individual Datasets

Processed Data Format

Data Dictionary

Country-Level Features

City-Level Features

Quality Reports

Using the Data for ML

Example: Load and Merge Features with Target Variable

Example Output:

Run Jupyter

Next Steps for ML Model

Data Caveats

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Diana PhD Project: Life Expectancy Prediction Using Environmental Data

Project Overview

Project Structure

Setup

1. Activate Virtual Environment

2. Install Dependencies

Data Processing

Quick Start: Process All Datasets

Process Individual Datasets

Processed Data Format

Data Dictionary

Country-Level Features

City-Level Features

Quality Reports

Using the Data for ML

Example: Load and Merge Features with Target Variable

Example Output:

Run Jupyter

Next Steps for ML Model

Data Caveats

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages