Skip to content

[Phase 2] Feature Engineering - EOF Analysis & Pattern Extraction #3

@Sakeeb91

Description

@Sakeeb91

Summary

Implement EOF (Empirical Orthogonal Function) analysis and pattern extraction pipeline to identify dominant modes of atmospheric variability from reanalysis data.

Parent Issue: #1
Depends On: Phase 1 (Data Infrastructure)

Objectives

  • Implement area-weighted EOF/PCA analysis on spatial fields
  • Extract principal component time series
  • Create region masking for Atlantic, Pacific, Arctic sectors
  • Cache features to Parquet format for fast loading

System Context

src/features/
├── eof.py            # EOF analysis core
├── patterns.py       # Pattern extraction wrapper
└── regions.py        # Geographic region masks

Files to Create/Modify

File Action Description
src/features/eof.py Create EOFAnalyzer class using SVD
src/features/patterns.py Create PatternExtractor wrapper
src/features/regions.py Create RegionMask for spatial subsetting
tests/test_features.py Create Feature extraction tests

Implementation Checklist

EOF Analysis

  • Implement EOFAnalyzer class
  • Add latitude weighting (cos(lat) for area-weighted EOF)
  • Handle missing values (land mask)
  • Compute explained variance ratio
  • Extract spatial patterns and PC time series

Pattern Extractor

  • Create PatternExtractor wrapper for full pipeline
  • Integrate preprocessing and EOF
  • Add pattern caching to Parquet

Region Masking

  • Define Atlantic sector (80W-0, 20N-80N)
  • Define Pacific sector (120E-120W, 20S-60N)
  • Define Arctic sector (north of 60N)
  • Implement mask application

Testing

  • Test EOF produces orthogonal patterns
  • Test variance explained sums correctly
  • Test first EOF resembles known patterns

Code Snippets

Area-Weighted EOF

# src/features/eof.py
import numpy as np
import xarray as xr
from sklearn.decomposition import PCA

class EOFAnalyzer:
    def __init__(self, n_components: int = 10):
        self.n_components = n_components
        self.pca = PCA(n_components=n_components)
        self._lat_weights = None
        self._shape = None

    def fit(self, data: xr.DataArray) -> "EOFAnalyzer":
        """Fit EOF to data with (time, lat, lon) dimensions."""
        # Compute latitude weights
        lat = data.coords["lat"].values
        weights = np.sqrt(np.cos(np.deg2rad(lat)))
        self._lat_weights = weights

        # Apply weights and flatten spatial dimensions
        weighted = data * xr.DataArray(weights, dims=["lat"])
        self._shape = (len(data.lat), len(data.lon))

        # Reshape to (time, space)
        X = weighted.values.reshape(len(data.time), -1)

        # Handle NaN values
        X = np.nan_to_num(X, nan=0.0)

        # Fit PCA
        self.pca.fit(X)
        return self

    def transform(self, data: xr.DataArray) -> np.ndarray:
        """Project data onto EOF patterns."""
        weighted = data * xr.DataArray(self._lat_weights, dims=["lat"])
        X = weighted.values.reshape(len(data.time), -1)
        X = np.nan_to_num(X, nan=0.0)
        return self.pca.transform(X)

    @property
    def explained_variance_ratio(self) -> np.ndarray:
        return self.pca.explained_variance_ratio_

    def get_patterns(self) -> np.ndarray:
        """Return EOF patterns as (n_components, lat, lon)."""
        return self.pca.components_.reshape(
            self.n_components, *self._shape
        )

Region Mask

# src/features/regions.py
REGIONS = {
    "atlantic": {"lon": (-80, 0), "lat": (20, 80)},
    "pacific": {"lon": (120, 240), "lat": (-20, 60)},
    "arctic": {"lon": (-180, 180), "lat": (60, 90)},
    "nino34": {"lon": (190, 240), "lat": (-5, 5)},  # ENSO region
}

def apply_mask(data: xr.DataArray, region: str) -> xr.DataArray:
    """Subset data to specified region."""
    bounds = REGIONS[region]
    return data.sel(
        lon=slice(*bounds["lon"]),
        lat=slice(*bounds["lat"])
    )

Verification

# Run EOF analysis
python -c "
from src.features.eof import EOFAnalyzer
import xarray as xr
data = xr.open_dataset('data/processed/slp_anomalies.nc')['msl']
eof = EOFAnalyzer(n_components=10).fit(data)
print('Explained variance:', eof.explained_variance_ratio[:3])
"

# Run tests
pytest tests/test_features.py -v

Technical Challenges

Challenge Mitigation
Memory for large grids Use incremental PCA or randomized SVD
Land mask handling Set land to NaN, use nan_to_num(0)
Proper weighting sqrt(cos(lat)) weighting before PCA

Definition of Done

  • EOF extracts 10 leading modes from SLP anomalies
  • First EOF visually resembles NAO/AO pattern
  • Explained variance sums to <100% and decreases monotonically
  • Region masks correctly subset data
  • All tests pass with pytest tests/test_features.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions