Skip to content

[Phase 3] Pattern Classification - ML Classifiers & Validation #4

@Sakeeb91

Description

@Sakeeb91

Summary

Build ML classifiers to identify teleconnection phases (positive/neutral/negative) from atmospheric patterns, validated against official NOAA indices.

Parent Issue: #1
Depends On: Phase 2 (Feature Engineering)

Objectives

  • Implement unsupervised clustering (K-Means, SOM) for pattern discovery
  • Build supervised classifier for teleconnection phase prediction
  • Validate against official NOAA indices
  • Achieve >70% classification accuracy

System Context

src/models/
├── clustering.py      # K-Means, SOM implementations
├── classifier.py      # Supervised teleconnection classifier
└── validation.py      # Cross-validation against NOAA

Files to Create/Modify

File Action Description
src/models/clustering.py Create K-Means and SOM for pattern discovery
src/models/classifier.py Create Random Forest classifier for phases
src/models/validation.py Create Index correlation and accuracy metrics
tests/test_models.py Create Model tests

Implementation Checklist

Clustering

  • Implement K-Means with elbow method for optimal k
  • Add silhouette score computation
  • Implement Self-Organizing Map (using minisom)
  • Visualize cluster centers as spatial patterns

Supervised Classifier

  • Discretize NOAA indices into phases (-1, 0, +1)
  • Train Random Forest on EOF features
  • Implement cross-validation with stratified splits
  • Handle class imbalance with SMOTE or class weights

Validation

  • Compute correlation between predicted and actual indices
  • Generate confusion matrix for phase classification
  • Calculate precision, recall, F1 per class
  • Compare PC time series to official indices

Testing

  • Test classifier accuracy on held-out data
  • Test clustering produces reasonable cluster counts
  • Test validation metrics computed correctly

Code Snippets

Phase Discretization

# src/models/classifier.py
def discretize_index(values: np.ndarray, threshold: float = 0.5) -> np.ndarray:
    """Convert continuous index to phase labels.

    Args:
        values: Continuous index values (standardized)
        threshold: Threshold for positive/negative phases

    Returns:
        Labels: -1 (negative), 0 (neutral), +1 (positive)
    """
    labels = np.zeros_like(values, dtype=int)
    labels[values > threshold] = 1
    labels[values < -threshold] = -1
    return labels

Teleconnection Classifier

# src/models/classifier.py
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold

class TeleconnectionClassifier:
    def __init__(self, index_name: str = "NAO"):
        self.index_name = index_name
        self.model = RandomForestClassifier(
            n_estimators=100,
            max_depth=10,
            class_weight="balanced",  # Handle imbalance
            random_state=42
        )

    def fit(self, X: np.ndarray, y: np.ndarray) -> "TeleconnectionClassifier":
        """Train classifier with cross-validation."""
        cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
        scores = cross_val_score(self.model, X, y, cv=cv, scoring="f1_macro")
        print(f"CV F1 scores: {scores.mean():.3f} +/- {scores.std():.3f}")

        self.model.fit(X, y)
        return self

    def predict(self, X: np.ndarray) -> np.ndarray:
        return self.model.predict(X)

    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        return self.model.predict_proba(X)

Validation Against NOAA

# src/models/validation.py
from scipy.stats import pearsonr
from sklearn.metrics import classification_report, confusion_matrix

def validate_against_noaa(
    predicted_pc: np.ndarray,
    noaa_index: pd.Series
) -> dict:
    """Compare predicted PC to official NOAA index.

    Returns correlation, RMSE, and phase classification metrics.
    """
    # Align time indices
    common_idx = predicted_pc.index.intersection(noaa_index.index)
    pred = predicted_pc.loc[common_idx]
    actual = noaa_index.loc[common_idx]

    # Continuous metrics
    corr, pval = pearsonr(pred, actual)
    rmse = np.sqrt(np.mean((pred - actual) ** 2))

    # Phase classification
    pred_phase = discretize_index(pred.values)
    actual_phase = discretize_index(actual.values)

    return {
        "correlation": corr,
        "p_value": pval,
        "rmse": rmse,
        "classification_report": classification_report(
            actual_phase, pred_phase, target_names=["Negative", "Neutral", "Positive"]
        ),
        "confusion_matrix": confusion_matrix(actual_phase, pred_phase)
    }

Verification

# Train classifier
python -c "
from src.models.classifier import TeleconnectionClassifier
from src.features.eof import EOFAnalyzer
from src.data.loaders import NOAAIndexLoader

# Load features and labels
# ... (data loading code)

clf = TeleconnectionClassifier('NAO')
clf.fit(X_train, y_train)
print('Accuracy:', clf.model.score(X_test, y_test))
"

# Run validation
python -m src.models.validation --index NAO --output reports/nao_validation.json

# Run tests
pytest tests/test_models.py -v

Technical Challenges

Challenge Mitigation
Class imbalance Use class_weight="balanced", stratified CV
Optimal cluster count Elbow method + domain knowledge (4-8 regimes)
SOM training slow Use minisom library, small grid (10x10)
Overfitting Cross-validation, limit max_depth

Definition of Done

  • K-Means identifies 4-8 distinct weather regimes
  • NAO classifier achieves >70% accuracy on held-out data
  • Correlation with official NAO index exceeds 0.7
  • Confusion matrix shows reasonable class separation
  • All tests pass with pytest tests/test_models.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions