Summary
Implement EOF (Empirical Orthogonal Function) analysis and pattern extraction pipeline to identify dominant modes of atmospheric variability from reanalysis data.
Parent Issue: #1
Depends On: Phase 1 (Data Infrastructure)
Objectives
- Implement area-weighted EOF/PCA analysis on spatial fields
- Extract principal component time series
- Create region masking for Atlantic, Pacific, Arctic sectors
- Cache features to Parquet format for fast loading
System Context
src/features/
├── eof.py # EOF analysis core
├── patterns.py # Pattern extraction wrapper
└── regions.py # Geographic region masks
Files to Create/Modify
| File |
Action |
Description |
src/features/eof.py |
Create |
EOFAnalyzer class using SVD |
src/features/patterns.py |
Create |
PatternExtractor wrapper |
src/features/regions.py |
Create |
RegionMask for spatial subsetting |
tests/test_features.py |
Create |
Feature extraction tests |
Implementation Checklist
EOF Analysis
Pattern Extractor
Region Masking
Testing
Code Snippets
Area-Weighted EOF
# src/features/eof.py
import numpy as np
import xarray as xr
from sklearn.decomposition import PCA
class EOFAnalyzer:
def __init__(self, n_components: int = 10):
self.n_components = n_components
self.pca = PCA(n_components=n_components)
self._lat_weights = None
self._shape = None
def fit(self, data: xr.DataArray) -> "EOFAnalyzer":
"""Fit EOF to data with (time, lat, lon) dimensions."""
# Compute latitude weights
lat = data.coords["lat"].values
weights = np.sqrt(np.cos(np.deg2rad(lat)))
self._lat_weights = weights
# Apply weights and flatten spatial dimensions
weighted = data * xr.DataArray(weights, dims=["lat"])
self._shape = (len(data.lat), len(data.lon))
# Reshape to (time, space)
X = weighted.values.reshape(len(data.time), -1)
# Handle NaN values
X = np.nan_to_num(X, nan=0.0)
# Fit PCA
self.pca.fit(X)
return self
def transform(self, data: xr.DataArray) -> np.ndarray:
"""Project data onto EOF patterns."""
weighted = data * xr.DataArray(self._lat_weights, dims=["lat"])
X = weighted.values.reshape(len(data.time), -1)
X = np.nan_to_num(X, nan=0.0)
return self.pca.transform(X)
@property
def explained_variance_ratio(self) -> np.ndarray:
return self.pca.explained_variance_ratio_
def get_patterns(self) -> np.ndarray:
"""Return EOF patterns as (n_components, lat, lon)."""
return self.pca.components_.reshape(
self.n_components, *self._shape
)
Region Mask
# src/features/regions.py
REGIONS = {
"atlantic": {"lon": (-80, 0), "lat": (20, 80)},
"pacific": {"lon": (120, 240), "lat": (-20, 60)},
"arctic": {"lon": (-180, 180), "lat": (60, 90)},
"nino34": {"lon": (190, 240), "lat": (-5, 5)}, # ENSO region
}
def apply_mask(data: xr.DataArray, region: str) -> xr.DataArray:
"""Subset data to specified region."""
bounds = REGIONS[region]
return data.sel(
lon=slice(*bounds["lon"]),
lat=slice(*bounds["lat"])
)
Verification
# Run EOF analysis
python -c "
from src.features.eof import EOFAnalyzer
import xarray as xr
data = xr.open_dataset('data/processed/slp_anomalies.nc')['msl']
eof = EOFAnalyzer(n_components=10).fit(data)
print('Explained variance:', eof.explained_variance_ratio[:3])
"
# Run tests
pytest tests/test_features.py -v
Technical Challenges
| Challenge |
Mitigation |
| Memory for large grids |
Use incremental PCA or randomized SVD |
| Land mask handling |
Set land to NaN, use nan_to_num(0) |
| Proper weighting |
sqrt(cos(lat)) weighting before PCA |
Definition of Done
Summary
Implement EOF (Empirical Orthogonal Function) analysis and pattern extraction pipeline to identify dominant modes of atmospheric variability from reanalysis data.
Parent Issue: #1
Depends On: Phase 1 (Data Infrastructure)
Objectives
System Context
Files to Create/Modify
src/features/eof.pysrc/features/patterns.pysrc/features/regions.pytests/test_features.pyImplementation Checklist
EOF Analysis
EOFAnalyzerclassPattern Extractor
PatternExtractorwrapper for full pipelineRegion Masking
Testing
Code Snippets
Area-Weighted EOF
Region Mask
Verification
Technical Challenges
Definition of Done
pytest tests/test_features.py