[Phase 4] Extreme Event Analysis - EM-DAT Correlation

## Summary

Link atmospheric patterns to extreme weather events using EM-DAT disaster records, enabling analysis of how teleconnection phases modulate extreme event probability.

**Parent Issue:** #1
**Depends On:** Phase 3 (Pattern Classification)

## Objectives

- Integrate EM-DAT international disaster database
- Compute lagged correlations between patterns and disasters
- Estimate conditional probability of extremes given pattern phase
- Perform statistical significance testing

## System Context

```
src/analysis/
├── extremes.py        # Extreme event detection
├── correlation.py     # Pattern-event correlation
└── significance.py    # Bootstrap confidence intervals
```

## Files to Create/Modify

| File | Action | Description |
|------|--------|-------------|
| `src/data/emdat.py` | Create | EM-DAT data loader and parser |
| `src/analysis/extremes.py` | Create | Extreme event detector |
| `src/analysis/correlation.py` | Create | Pattern-event correlation |
| `tests/test_analysis.py` | Create | Analysis tests |

## Implementation Checklist

### EM-DAT Integration
- [ ] Register for EM-DAT access (free for academic use)
- [ ] Implement Excel/CSV parser for EM-DAT format
- [ ] Map disaster locations to grid coordinates
- [ ] Filter by event type (floods, storms, droughts)

### Extreme Event Detection
- [ ] Define event categories aligned with EM-DAT
- [ ] Create time series of event counts per month
- [ ] Aggregate events by region

### Correlation Analysis
- [ ] Compute lagged cross-correlation (0-60 day lags)
- [ ] Identify optimal lag for each event-pattern pair
- [ ] Compute Pearson and Spearman correlations

### Statistical Testing
- [ ] Implement bootstrap confidence intervals
- [ ] Test significance against null distribution
- [ ] Correct for multiple comparisons (FDR)

## Code Snippets

### EM-DAT Loader

```python
# src/data/emdat.py
import pandas as pd
from pathlib import Path

class EMDATLoader:
    """Load and parse EM-DAT disaster database."""

    EVENT_TYPES = ["Flood", "Storm", "Drought", "Extreme temperature"]

    def __init__(self, data_path: str):
        self.data_path = Path(data_path)

    def load(
        self,
        event_types: list = None,
        start_year: int = 1950,
        end_year: int = 2024
    ) -> pd.DataFrame:
        """Load EM-DAT data filtered by event type and year.

        Expected columns: Disaster Type, Country, Start Year,
        Start Month, Total Deaths, Total Affected, Latitude, Longitude
        """
        df = pd.read_excel(self.data_path)

        # Filter by year
        df = df[(df["Start Year"] >= start_year) & (df["Start Year"] <= end_year)]

        # Filter by event type
        if event_types:
            df = df[df["Disaster Type"].isin(event_types)]

        # Create datetime
        df["date"] = pd.to_datetime(
            df["Start Year"].astype(str) + "-" +
            df["Start Month"].fillna(1).astype(int).astype(str) + "-01"
        )

        return df

    def aggregate_monthly(self, df: pd.DataFrame) -> pd.DataFrame:
        """Aggregate events to monthly counts."""
        return df.groupby(pd.Grouper(key="date", freq="M")).agg({
            "Disaster Type": "count",
            "Total Deaths": "sum",
            "Total Affected": "sum"
        }).rename(columns={"Disaster Type": "event_count"})
```

### Lagged Correlation

```python
# src/analysis/correlation.py
import numpy as np
from scipy.stats import pearsonr, spearmanr

def lagged_correlation(
    pattern_ts: np.ndarray,
    event_ts: np.ndarray,
    max_lag: int = 60
) -> dict:
    """Compute correlation at multiple lags.

    Args:
        pattern_ts: Pattern index time series
        event_ts: Event count time series
        max_lag: Maximum lag in days/months to test

    Returns:
        Dict with lag values, correlations, and optimal lag
    """
    correlations = []
    p_values = []

    for lag in range(-max_lag, max_lag + 1):
        if lag >= 0:
            x = pattern_ts[:-lag] if lag > 0 else pattern_ts
            y = event_ts[lag:] if lag > 0 else event_ts
        else:
            x = pattern_ts[-lag:]
            y = event_ts[:lag]

        if len(x) > 10:
            r, p = pearsonr(x, y)
            correlations.append(r)
            p_values.append(p)
        else:
            correlations.append(np.nan)
            p_values.append(np.nan)

    lags = list(range(-max_lag, max_lag + 1))
    best_idx = np.nanargmax(np.abs(correlations))

    return {
        "lags": lags,
        "correlations": correlations,
        "p_values": p_values,
        "optimal_lag": lags[best_idx],
        "max_correlation": correlations[best_idx]
    }
```

### Bootstrap Significance

```python
# src/analysis/significance.py
import numpy as np
from scipy.stats import pearsonr

def bootstrap_correlation_ci(
    x: np.ndarray,
    y: np.ndarray,
    n_bootstrap: int = 10000,
    ci: float = 0.95
) -> tuple:
    """Compute bootstrap confidence interval for correlation.

    Returns:
        (lower_bound, upper_bound, p_value)
    """
    n = len(x)
    observed_r, _ = pearsonr(x, y)

    # Bootstrap samples
    boot_correlations = []
    for _ in range(n_bootstrap):
        idx = np.random.choice(n, size=n, replace=True)
        r, _ = pearsonr(x[idx], y[idx])
        boot_correlations.append(r)

    # Confidence interval
    alpha = 1 - ci
    lower = np.percentile(boot_correlations, 100 * alpha / 2)
    upper = np.percentile(boot_correlations, 100 * (1 - alpha / 2))

    # P-value (proportion of null correlations exceeding observed)
    null_correlations = []
    for _ in range(n_bootstrap):
        y_shuffled = np.random.permutation(y)
        r, _ = pearsonr(x, y_shuffled)
        null_correlations.append(r)

    p_value = np.mean(np.abs(null_correlations) >= np.abs(observed_r))

    return lower, upper, p_value
```

## Verification

```bash
# Test EM-DAT loading
python -c "
from src.data.emdat import EMDATLoader
loader = EMDATLoader('data/external/emdat.xlsx')
df = loader.load(event_types=['Flood'], start_year=2000)
print(f'Loaded {len(df)} flood events')
"

# Run correlation analysis
python -m src.analysis.correlation --pattern NAO --event flood --output reports/

# Run tests
pytest tests/test_analysis.py -v
```

## Technical Challenges

| Challenge | Mitigation |
|-----------|------------|
| EM-DAT location encoding | Use country centroids, fuzzy matching |
| Small sample sizes | Bootstrap CIs, focus on large N events |
| Multiple comparisons | FDR correction (Benjamini-Hochberg) |
| Lag selection | Domain knowledge (2-4 weeks typical) |

## Definition of Done

- [ ] EM-DAT loads and filters by event type and region
- [ ] Lagged correlation computed for NAO-flood pairs
- [ ] At least one statistically significant correlation found
- [ ] Confidence intervals computed via bootstrap
- [ ] All tests pass with `pytest tests/test_analysis.py`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Phase 4] Extreme Event Analysis - EM-DAT Correlation #5

Summary

Objectives

System Context

Files to Create/Modify

Implementation Checklist

EM-DAT Integration

Extreme Event Detection

Correlation Analysis

Statistical Testing

Code Snippets

EM-DAT Loader

Lagged Correlation

Bootstrap Significance

Verification

Technical Challenges

Definition of Done

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

File	Action	Description
`src/data/emdat.py`	Create	EM-DAT data loader and parser
`src/analysis/extremes.py`	Create	Extreme event detector
`src/analysis/correlation.py`	Create	Pattern-event correlation
`tests/test_analysis.py`	Create	Analysis tests

Challenge	Mitigation
EM-DAT location encoding	Use country centroids, fuzzy matching
Small sample sizes	Bootstrap CIs, focus on large N events
Multiple comparisons	FDR correction (Benjamini-Hochberg)
Lag selection	Domain knowledge (2-4 weeks typical)

[Phase 4] Extreme Event Analysis - EM-DAT Correlation #5

Description

Summary

Objectives

System Context

Files to Create/Modify

Implementation Checklist

EM-DAT Integration

Extreme Event Detection

Correlation Analysis

Statistical Testing

Code Snippets

EM-DAT Loader

Lagged Correlation

Bootstrap Significance

Verification

Technical Challenges

Definition of Done

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions