Summary
Build ML classifiers to identify teleconnection phases (positive/neutral/negative) from atmospheric patterns, validated against official NOAA indices.
Parent Issue: #1
Depends On: Phase 2 (Feature Engineering)
Objectives
- Implement unsupervised clustering (K-Means, SOM) for pattern discovery
- Build supervised classifier for teleconnection phase prediction
- Validate against official NOAA indices
- Achieve >70% classification accuracy
System Context
src/models/
├── clustering.py # K-Means, SOM implementations
├── classifier.py # Supervised teleconnection classifier
└── validation.py # Cross-validation against NOAA
Files to Create/Modify
| File |
Action |
Description |
src/models/clustering.py |
Create |
K-Means and SOM for pattern discovery |
src/models/classifier.py |
Create |
Random Forest classifier for phases |
src/models/validation.py |
Create |
Index correlation and accuracy metrics |
tests/test_models.py |
Create |
Model tests |
Implementation Checklist
Clustering
Supervised Classifier
Validation
Testing
Code Snippets
Phase Discretization
# src/models/classifier.py
def discretize_index(values: np.ndarray, threshold: float = 0.5) -> np.ndarray:
"""Convert continuous index to phase labels.
Args:
values: Continuous index values (standardized)
threshold: Threshold for positive/negative phases
Returns:
Labels: -1 (negative), 0 (neutral), +1 (positive)
"""
labels = np.zeros_like(values, dtype=int)
labels[values > threshold] = 1
labels[values < -threshold] = -1
return labels
Teleconnection Classifier
# src/models/classifier.py
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
class TeleconnectionClassifier:
def __init__(self, index_name: str = "NAO"):
self.index_name = index_name
self.model = RandomForestClassifier(
n_estimators=100,
max_depth=10,
class_weight="balanced", # Handle imbalance
random_state=42
)
def fit(self, X: np.ndarray, y: np.ndarray) -> "TeleconnectionClassifier":
"""Train classifier with cross-validation."""
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(self.model, X, y, cv=cv, scoring="f1_macro")
print(f"CV F1 scores: {scores.mean():.3f} +/- {scores.std():.3f}")
self.model.fit(X, y)
return self
def predict(self, X: np.ndarray) -> np.ndarray:
return self.model.predict(X)
def predict_proba(self, X: np.ndarray) -> np.ndarray:
return self.model.predict_proba(X)
Validation Against NOAA
# src/models/validation.py
from scipy.stats import pearsonr
from sklearn.metrics import classification_report, confusion_matrix
def validate_against_noaa(
predicted_pc: np.ndarray,
noaa_index: pd.Series
) -> dict:
"""Compare predicted PC to official NOAA index.
Returns correlation, RMSE, and phase classification metrics.
"""
# Align time indices
common_idx = predicted_pc.index.intersection(noaa_index.index)
pred = predicted_pc.loc[common_idx]
actual = noaa_index.loc[common_idx]
# Continuous metrics
corr, pval = pearsonr(pred, actual)
rmse = np.sqrt(np.mean((pred - actual) ** 2))
# Phase classification
pred_phase = discretize_index(pred.values)
actual_phase = discretize_index(actual.values)
return {
"correlation": corr,
"p_value": pval,
"rmse": rmse,
"classification_report": classification_report(
actual_phase, pred_phase, target_names=["Negative", "Neutral", "Positive"]
),
"confusion_matrix": confusion_matrix(actual_phase, pred_phase)
}
Verification
# Train classifier
python -c "
from src.models.classifier import TeleconnectionClassifier
from src.features.eof import EOFAnalyzer
from src.data.loaders import NOAAIndexLoader
# Load features and labels
# ... (data loading code)
clf = TeleconnectionClassifier('NAO')
clf.fit(X_train, y_train)
print('Accuracy:', clf.model.score(X_test, y_test))
"
# Run validation
python -m src.models.validation --index NAO --output reports/nao_validation.json
# Run tests
pytest tests/test_models.py -v
Technical Challenges
| Challenge |
Mitigation |
| Class imbalance |
Use class_weight="balanced", stratified CV |
| Optimal cluster count |
Elbow method + domain knowledge (4-8 regimes) |
| SOM training slow |
Use minisom library, small grid (10x10) |
| Overfitting |
Cross-validation, limit max_depth |
Definition of Done
Summary
Build ML classifiers to identify teleconnection phases (positive/neutral/negative) from atmospheric patterns, validated against official NOAA indices.
Parent Issue: #1
Depends On: Phase 2 (Feature Engineering)
Objectives
System Context
Files to Create/Modify
src/models/clustering.pysrc/models/classifier.pysrc/models/validation.pytests/test_models.pyImplementation Checklist
Clustering
Supervised Classifier
Validation
Testing
Code Snippets
Phase Discretization
Teleconnection Classifier
Validation Against NOAA
Verification
Technical Challenges
Definition of Done
pytest tests/test_models.py