[Phase 3] Pattern Classification - ML Classifiers & Validation

## Summary

Build ML classifiers to identify teleconnection phases (positive/neutral/negative) from atmospheric patterns, validated against official NOAA indices.

**Parent Issue:** #1
**Depends On:** Phase 2 (Feature Engineering)

## Objectives

- Implement unsupervised clustering (K-Means, SOM) for pattern discovery
- Build supervised classifier for teleconnection phase prediction
- Validate against official NOAA indices
- Achieve >70% classification accuracy

## System Context

```
src/models/
├── clustering.py      # K-Means, SOM implementations
├── classifier.py      # Supervised teleconnection classifier
└── validation.py      # Cross-validation against NOAA
```

## Files to Create/Modify

| File | Action | Description |
|------|--------|-------------|
| `src/models/clustering.py` | Create | K-Means and SOM for pattern discovery |
| `src/models/classifier.py` | Create | Random Forest classifier for phases |
| `src/models/validation.py` | Create | Index correlation and accuracy metrics |
| `tests/test_models.py` | Create | Model tests |

## Implementation Checklist

### Clustering
- [ ] Implement K-Means with elbow method for optimal k
- [ ] Add silhouette score computation
- [ ] Implement Self-Organizing Map (using minisom)
- [ ] Visualize cluster centers as spatial patterns

### Supervised Classifier
- [ ] Discretize NOAA indices into phases (-1, 0, +1)
- [ ] Train Random Forest on EOF features
- [ ] Implement cross-validation with stratified splits
- [ ] Handle class imbalance with SMOTE or class weights

### Validation
- [ ] Compute correlation between predicted and actual indices
- [ ] Generate confusion matrix for phase classification
- [ ] Calculate precision, recall, F1 per class
- [ ] Compare PC time series to official indices

### Testing
- [ ] Test classifier accuracy on held-out data
- [ ] Test clustering produces reasonable cluster counts
- [ ] Test validation metrics computed correctly

## Code Snippets

### Phase Discretization

```python
# src/models/classifier.py
def discretize_index(values: np.ndarray, threshold: float = 0.5) -> np.ndarray:
    """Convert continuous index to phase labels.

    Args:
        values: Continuous index values (standardized)
        threshold: Threshold for positive/negative phases

    Returns:
        Labels: -1 (negative), 0 (neutral), +1 (positive)
    """
    labels = np.zeros_like(values, dtype=int)
    labels[values > threshold] = 1
    labels[values < -threshold] = -1
    return labels
```

### Teleconnection Classifier

```python
# src/models/classifier.py
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold

class TeleconnectionClassifier:
    def __init__(self, index_name: str = "NAO"):
        self.index_name = index_name
        self.model = RandomForestClassifier(
            n_estimators=100,
            max_depth=10,
            class_weight="balanced",  # Handle imbalance
            random_state=42
        )

    def fit(self, X: np.ndarray, y: np.ndarray) -> "TeleconnectionClassifier":
        """Train classifier with cross-validation."""
        cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
        scores = cross_val_score(self.model, X, y, cv=cv, scoring="f1_macro")
        print(f"CV F1 scores: {scores.mean():.3f} +/- {scores.std():.3f}")

        self.model.fit(X, y)
        return self

    def predict(self, X: np.ndarray) -> np.ndarray:
        return self.model.predict(X)

    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        return self.model.predict_proba(X)
```

### Validation Against NOAA

```python
# src/models/validation.py
from scipy.stats import pearsonr
from sklearn.metrics import classification_report, confusion_matrix

def validate_against_noaa(
    predicted_pc: np.ndarray,
    noaa_index: pd.Series
) -> dict:
    """Compare predicted PC to official NOAA index.

    Returns correlation, RMSE, and phase classification metrics.
    """
    # Align time indices
    common_idx = predicted_pc.index.intersection(noaa_index.index)
    pred = predicted_pc.loc[common_idx]
    actual = noaa_index.loc[common_idx]

    # Continuous metrics
    corr, pval = pearsonr(pred, actual)
    rmse = np.sqrt(np.mean((pred - actual) ** 2))

    # Phase classification
    pred_phase = discretize_index(pred.values)
    actual_phase = discretize_index(actual.values)

    return {
        "correlation": corr,
        "p_value": pval,
        "rmse": rmse,
        "classification_report": classification_report(
            actual_phase, pred_phase, target_names=["Negative", "Neutral", "Positive"]
        ),
        "confusion_matrix": confusion_matrix(actual_phase, pred_phase)
    }
```

## Verification

```bash
# Train classifier
python -c "
from src.models.classifier import TeleconnectionClassifier
from src.features.eof import EOFAnalyzer
from src.data.loaders import NOAAIndexLoader

# Load features and labels
# ... (data loading code)

clf = TeleconnectionClassifier('NAO')
clf.fit(X_train, y_train)
print('Accuracy:', clf.model.score(X_test, y_test))
"

# Run validation
python -m src.models.validation --index NAO --output reports/nao_validation.json

# Run tests
pytest tests/test_models.py -v
```

## Technical Challenges

| Challenge | Mitigation |
|-----------|------------|
| Class imbalance | Use class_weight="balanced", stratified CV |
| Optimal cluster count | Elbow method + domain knowledge (4-8 regimes) |
| SOM training slow | Use minisom library, small grid (10x10) |
| Overfitting | Cross-validation, limit max_depth |

## Definition of Done

- [ ] K-Means identifies 4-8 distinct weather regimes
- [ ] NAO classifier achieves >70% accuracy on held-out data
- [ ] Correlation with official NAO index exceeds 0.7
- [ ] Confusion matrix shows reasonable class separation
- [ ] All tests pass with `pytest tests/test_models.py`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Phase 3] Pattern Classification - ML Classifiers & Validation #4

Summary

Objectives

System Context

Files to Create/Modify

Implementation Checklist

Clustering

Supervised Classifier

Validation

Testing

Code Snippets

Phase Discretization

Teleconnection Classifier

Validation Against NOAA

Verification

Technical Challenges

Definition of Done

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

File	Action	Description
`src/models/clustering.py`	Create	K-Means and SOM for pattern discovery
`src/models/classifier.py`	Create	Random Forest classifier for phases
`src/models/validation.py`	Create	Index correlation and accuracy metrics
`tests/test_models.py`	Create	Model tests

Challenge	Mitigation
Class imbalance	Use class_weight="balanced", stratified CV
Optimal cluster count	Elbow method + domain knowledge (4-8 regimes)
SOM training slow	Use minisom library, small grid (10x10)
Overfitting	Cross-validation, limit max_depth

[Phase 3] Pattern Classification - ML Classifiers & Validation #4

Description

Summary

Objectives

System Context

Files to Create/Modify

Implementation Checklist

Clustering

Supervised Classifier

Validation

Testing

Code Snippets

Phase Discretization

Teleconnection Classifier

Validation Against NOAA

Verification

Technical Challenges

Definition of Done

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions