Skip to content

[Phase 4] Extreme Event Analysis - EM-DAT Correlation #5

@Sakeeb91

Description

@Sakeeb91

Summary

Link atmospheric patterns to extreme weather events using EM-DAT disaster records, enabling analysis of how teleconnection phases modulate extreme event probability.

Parent Issue: #1
Depends On: Phase 3 (Pattern Classification)

Objectives

  • Integrate EM-DAT international disaster database
  • Compute lagged correlations between patterns and disasters
  • Estimate conditional probability of extremes given pattern phase
  • Perform statistical significance testing

System Context

src/analysis/
├── extremes.py        # Extreme event detection
├── correlation.py     # Pattern-event correlation
└── significance.py    # Bootstrap confidence intervals

Files to Create/Modify

File Action Description
src/data/emdat.py Create EM-DAT data loader and parser
src/analysis/extremes.py Create Extreme event detector
src/analysis/correlation.py Create Pattern-event correlation
tests/test_analysis.py Create Analysis tests

Implementation Checklist

EM-DAT Integration

  • Register for EM-DAT access (free for academic use)
  • Implement Excel/CSV parser for EM-DAT format
  • Map disaster locations to grid coordinates
  • Filter by event type (floods, storms, droughts)

Extreme Event Detection

  • Define event categories aligned with EM-DAT
  • Create time series of event counts per month
  • Aggregate events by region

Correlation Analysis

  • Compute lagged cross-correlation (0-60 day lags)
  • Identify optimal lag for each event-pattern pair
  • Compute Pearson and Spearman correlations

Statistical Testing

  • Implement bootstrap confidence intervals
  • Test significance against null distribution
  • Correct for multiple comparisons (FDR)

Code Snippets

EM-DAT Loader

# src/data/emdat.py
import pandas as pd
from pathlib import Path

class EMDATLoader:
    """Load and parse EM-DAT disaster database."""

    EVENT_TYPES = ["Flood", "Storm", "Drought", "Extreme temperature"]

    def __init__(self, data_path: str):
        self.data_path = Path(data_path)

    def load(
        self,
        event_types: list = None,
        start_year: int = 1950,
        end_year: int = 2024
    ) -> pd.DataFrame:
        """Load EM-DAT data filtered by event type and year.

        Expected columns: Disaster Type, Country, Start Year,
        Start Month, Total Deaths, Total Affected, Latitude, Longitude
        """
        df = pd.read_excel(self.data_path)

        # Filter by year
        df = df[(df["Start Year"] >= start_year) & (df["Start Year"] <= end_year)]

        # Filter by event type
        if event_types:
            df = df[df["Disaster Type"].isin(event_types)]

        # Create datetime
        df["date"] = pd.to_datetime(
            df["Start Year"].astype(str) + "-" +
            df["Start Month"].fillna(1).astype(int).astype(str) + "-01"
        )

        return df

    def aggregate_monthly(self, df: pd.DataFrame) -> pd.DataFrame:
        """Aggregate events to monthly counts."""
        return df.groupby(pd.Grouper(key="date", freq="M")).agg({
            "Disaster Type": "count",
            "Total Deaths": "sum",
            "Total Affected": "sum"
        }).rename(columns={"Disaster Type": "event_count"})

Lagged Correlation

# src/analysis/correlation.py
import numpy as np
from scipy.stats import pearsonr, spearmanr

def lagged_correlation(
    pattern_ts: np.ndarray,
    event_ts: np.ndarray,
    max_lag: int = 60
) -> dict:
    """Compute correlation at multiple lags.

    Args:
        pattern_ts: Pattern index time series
        event_ts: Event count time series
        max_lag: Maximum lag in days/months to test

    Returns:
        Dict with lag values, correlations, and optimal lag
    """
    correlations = []
    p_values = []

    for lag in range(-max_lag, max_lag + 1):
        if lag >= 0:
            x = pattern_ts[:-lag] if lag > 0 else pattern_ts
            y = event_ts[lag:] if lag > 0 else event_ts
        else:
            x = pattern_ts[-lag:]
            y = event_ts[:lag]

        if len(x) > 10:
            r, p = pearsonr(x, y)
            correlations.append(r)
            p_values.append(p)
        else:
            correlations.append(np.nan)
            p_values.append(np.nan)

    lags = list(range(-max_lag, max_lag + 1))
    best_idx = np.nanargmax(np.abs(correlations))

    return {
        "lags": lags,
        "correlations": correlations,
        "p_values": p_values,
        "optimal_lag": lags[best_idx],
        "max_correlation": correlations[best_idx]
    }

Bootstrap Significance

# src/analysis/significance.py
import numpy as np
from scipy.stats import pearsonr

def bootstrap_correlation_ci(
    x: np.ndarray,
    y: np.ndarray,
    n_bootstrap: int = 10000,
    ci: float = 0.95
) -> tuple:
    """Compute bootstrap confidence interval for correlation.

    Returns:
        (lower_bound, upper_bound, p_value)
    """
    n = len(x)
    observed_r, _ = pearsonr(x, y)

    # Bootstrap samples
    boot_correlations = []
    for _ in range(n_bootstrap):
        idx = np.random.choice(n, size=n, replace=True)
        r, _ = pearsonr(x[idx], y[idx])
        boot_correlations.append(r)

    # Confidence interval
    alpha = 1 - ci
    lower = np.percentile(boot_correlations, 100 * alpha / 2)
    upper = np.percentile(boot_correlations, 100 * (1 - alpha / 2))

    # P-value (proportion of null correlations exceeding observed)
    null_correlations = []
    for _ in range(n_bootstrap):
        y_shuffled = np.random.permutation(y)
        r, _ = pearsonr(x, y_shuffled)
        null_correlations.append(r)

    p_value = np.mean(np.abs(null_correlations) >= np.abs(observed_r))

    return lower, upper, p_value

Verification

# Test EM-DAT loading
python -c "
from src.data.emdat import EMDATLoader
loader = EMDATLoader('data/external/emdat.xlsx')
df = loader.load(event_types=['Flood'], start_year=2000)
print(f'Loaded {len(df)} flood events')
"

# Run correlation analysis
python -m src.analysis.correlation --pattern NAO --event flood --output reports/

# Run tests
pytest tests/test_analysis.py -v

Technical Challenges

Challenge Mitigation
EM-DAT location encoding Use country centroids, fuzzy matching
Small sample sizes Bootstrap CIs, focus on large N events
Multiple comparisons FDR correction (Benjamini-Hochberg)
Lag selection Domain knowledge (2-4 weeks typical)

Definition of Done

  • EM-DAT loads and filters by event type and region
  • Lagged correlation computed for NAO-flood pairs
  • At least one statistically significant correlation found
  • Confidence intervals computed via bootstrap
  • All tests pass with pytest tests/test_analysis.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions