Summary
Link atmospheric patterns to extreme weather events using EM-DAT disaster records, enabling analysis of how teleconnection phases modulate extreme event probability.
Parent Issue: #1
Depends On: Phase 3 (Pattern Classification)
Objectives
- Integrate EM-DAT international disaster database
- Compute lagged correlations between patterns and disasters
- Estimate conditional probability of extremes given pattern phase
- Perform statistical significance testing
System Context
src/analysis/
├── extremes.py # Extreme event detection
├── correlation.py # Pattern-event correlation
└── significance.py # Bootstrap confidence intervals
Files to Create/Modify
| File |
Action |
Description |
src/data/emdat.py |
Create |
EM-DAT data loader and parser |
src/analysis/extremes.py |
Create |
Extreme event detector |
src/analysis/correlation.py |
Create |
Pattern-event correlation |
tests/test_analysis.py |
Create |
Analysis tests |
Implementation Checklist
EM-DAT Integration
Extreme Event Detection
Correlation Analysis
Statistical Testing
Code Snippets
EM-DAT Loader
# src/data/emdat.py
import pandas as pd
from pathlib import Path
class EMDATLoader:
"""Load and parse EM-DAT disaster database."""
EVENT_TYPES = ["Flood", "Storm", "Drought", "Extreme temperature"]
def __init__(self, data_path: str):
self.data_path = Path(data_path)
def load(
self,
event_types: list = None,
start_year: int = 1950,
end_year: int = 2024
) -> pd.DataFrame:
"""Load EM-DAT data filtered by event type and year.
Expected columns: Disaster Type, Country, Start Year,
Start Month, Total Deaths, Total Affected, Latitude, Longitude
"""
df = pd.read_excel(self.data_path)
# Filter by year
df = df[(df["Start Year"] >= start_year) & (df["Start Year"] <= end_year)]
# Filter by event type
if event_types:
df = df[df["Disaster Type"].isin(event_types)]
# Create datetime
df["date"] = pd.to_datetime(
df["Start Year"].astype(str) + "-" +
df["Start Month"].fillna(1).astype(int).astype(str) + "-01"
)
return df
def aggregate_monthly(self, df: pd.DataFrame) -> pd.DataFrame:
"""Aggregate events to monthly counts."""
return df.groupby(pd.Grouper(key="date", freq="M")).agg({
"Disaster Type": "count",
"Total Deaths": "sum",
"Total Affected": "sum"
}).rename(columns={"Disaster Type": "event_count"})
Lagged Correlation
# src/analysis/correlation.py
import numpy as np
from scipy.stats import pearsonr, spearmanr
def lagged_correlation(
pattern_ts: np.ndarray,
event_ts: np.ndarray,
max_lag: int = 60
) -> dict:
"""Compute correlation at multiple lags.
Args:
pattern_ts: Pattern index time series
event_ts: Event count time series
max_lag: Maximum lag in days/months to test
Returns:
Dict with lag values, correlations, and optimal lag
"""
correlations = []
p_values = []
for lag in range(-max_lag, max_lag + 1):
if lag >= 0:
x = pattern_ts[:-lag] if lag > 0 else pattern_ts
y = event_ts[lag:] if lag > 0 else event_ts
else:
x = pattern_ts[-lag:]
y = event_ts[:lag]
if len(x) > 10:
r, p = pearsonr(x, y)
correlations.append(r)
p_values.append(p)
else:
correlations.append(np.nan)
p_values.append(np.nan)
lags = list(range(-max_lag, max_lag + 1))
best_idx = np.nanargmax(np.abs(correlations))
return {
"lags": lags,
"correlations": correlations,
"p_values": p_values,
"optimal_lag": lags[best_idx],
"max_correlation": correlations[best_idx]
}
Bootstrap Significance
# src/analysis/significance.py
import numpy as np
from scipy.stats import pearsonr
def bootstrap_correlation_ci(
x: np.ndarray,
y: np.ndarray,
n_bootstrap: int = 10000,
ci: float = 0.95
) -> tuple:
"""Compute bootstrap confidence interval for correlation.
Returns:
(lower_bound, upper_bound, p_value)
"""
n = len(x)
observed_r, _ = pearsonr(x, y)
# Bootstrap samples
boot_correlations = []
for _ in range(n_bootstrap):
idx = np.random.choice(n, size=n, replace=True)
r, _ = pearsonr(x[idx], y[idx])
boot_correlations.append(r)
# Confidence interval
alpha = 1 - ci
lower = np.percentile(boot_correlations, 100 * alpha / 2)
upper = np.percentile(boot_correlations, 100 * (1 - alpha / 2))
# P-value (proportion of null correlations exceeding observed)
null_correlations = []
for _ in range(n_bootstrap):
y_shuffled = np.random.permutation(y)
r, _ = pearsonr(x, y_shuffled)
null_correlations.append(r)
p_value = np.mean(np.abs(null_correlations) >= np.abs(observed_r))
return lower, upper, p_value
Verification
# Test EM-DAT loading
python -c "
from src.data.emdat import EMDATLoader
loader = EMDATLoader('data/external/emdat.xlsx')
df = loader.load(event_types=['Flood'], start_year=2000)
print(f'Loaded {len(df)} flood events')
"
# Run correlation analysis
python -m src.analysis.correlation --pattern NAO --event flood --output reports/
# Run tests
pytest tests/test_analysis.py -v
Technical Challenges
| Challenge |
Mitigation |
| EM-DAT location encoding |
Use country centroids, fuzzy matching |
| Small sample sizes |
Bootstrap CIs, focus on large N events |
| Multiple comparisons |
FDR correction (Benjamini-Hochberg) |
| Lag selection |
Domain knowledge (2-4 weeks typical) |
Definition of Done
Summary
Link atmospheric patterns to extreme weather events using EM-DAT disaster records, enabling analysis of how teleconnection phases modulate extreme event probability.
Parent Issue: #1
Depends On: Phase 3 (Pattern Classification)
Objectives
System Context
Files to Create/Modify
src/data/emdat.pysrc/analysis/extremes.pysrc/analysis/correlation.pytests/test_analysis.pyImplementation Checklist
EM-DAT Integration
Extreme Event Detection
Correlation Analysis
Statistical Testing
Code Snippets
EM-DAT Loader
Lagged Correlation
Bootstrap Significance
Verification
Technical Challenges
Definition of Done
pytest tests/test_analysis.py