Skip to content

Latest commit

 

History

History
321 lines (249 loc) · 10.6 KB

File metadata and controls

321 lines (249 loc) · 10.6 KB

Optimizing Deep Learning Models

A practical, implementation-first guide to making deep learning models train faster, generalize better, and stay stable during training.


🎯 Goals

  • Improve model performance and generalization
  • Prevent/diagnose overfitting and underfitting
  • Choose and tune impactful hyperparameters
  • Build robust, reproducible training pipelines

🔍 Overfitting vs. Underfitting

Overfitting: Low training loss, high validation loss → model memorizes noise/patterns specific to training data.
Underfitting: Both training and validation losses are high → model too simple or training not sufficient.

Quick diagnostics

  • Training ↓, Validation ↑ ➜ Overfitting
  • Training ≈ Validation but both high ➜ Underfitting
  • Accuracy oscillates / loss unstable ➜ Optimization or data issues

🧰 Core Techniques

1) Regularization

  • L2 weight decay: Penalizes large weights; smooths decision boundaries.
  • Dropout: Randomly disables neurons during training to reduce co-adaptation.
  • Data augmentation: Expands dataset with realistic transformations (images, audio, text).

PyTorch snippet (L2 + Dropout + Augmentation):

import torch
import torch.nn as nn
import torchvision.transforms as T

# Data augmentation (vision example)
train_tfms = T.Compose([
    T.RandomResizedCrop(224, scale=(0.8, 1.0)),
    T.RandomHorizontalFlip(),
    T.ColorJitter(0.2, 0.2, 0.2, 0.1),
    T.ToTensor(),
])

# Model with dropout
class Net(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1), nn.ReLU(),
            nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Dropout(p=0.25),
            nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2)
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 56 * 56, 256), nn.ReLU(),
            nn.Dropout(p=0.5),
            nn.Linear(256, num_classes)
        )

    def forward(self, x):
        return self.classifier(self.features(x))

model = Net()
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-2)  # L2 via weight_decay
  1. Optimization & Stability
  • Learning rate (LR) is the #1 hyperparameter. Use warmup + cosine/step decay.

  • Gradient clipping to control exploding gradients.

  • Batch normalization / Layer normalization stabilizes and speeds up training.

  • Mixed precision (fp16/bf16) for speed and memory savings when supported.

  • LR scheduling + gradient clippin

import torch
from torch.optim.lr_scheduler import CosineAnnealingLR

optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-2)
scheduler = CosineAnnealingLR(optimizer, T_max=100)  # 100 epochs

for epoch in range(100):
    model.train()
    for xb, yb in train_loader:
        optimizer.zero_grad(set_to_none=True)
        loss = loss_fn(model(xb), yb)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
    scheduler.step()

Mixed precision (if GPU supports it):


scaler = torch.cuda.amp.GradScaler()
for xb, yb in train_loader:
    optimizer.zero_grad(set_to_none=True)
    with torch.cuda.amp.autocast():
        loss = loss_fn(model(xb), yb)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
  1. Early Stopping & Checkpointing Stop before the model overfits; keep the best weights by validation metric.
best_val = float("inf")
patience, patience_ctr = 10, 0
best_state = None

for epoch in range(100):
    train_one_epoch(...)
    val_loss = evaluate(...)
    if val_loss + 1e-6 < best_val:
        best_val = val_loss
        best_state = {k: v.cpu().clone() for k, v in model.state_dict().items()}
        patience_ctr = 0
    else:
        patience_ctr += 1
        if patience_ctr >= patience:
            print("Early stopping.")
            break

model.load_state_dict(best_state)
  1. Data Quality & Splits
  • Use train/validation/test splits (e.g., 80/10/10) or k-fold when data is limited.
  • Ensure no leakage: the same entity must not appear across splits.
  • Balance labels or use class weights / focal loss for imbalanced data.

Example: class weights for cross-entropy

from torch.nn.functional import cross_entropy
import torch

weights = torch.tensor([1.0, 3.0], device=device)  # minority class gets higher weight
loss = cross_entropy(logits, targets, weight=weights)
  1. Hyperparameter Tuning (Practical)

Prioritize: learning rate, batch size, weight decay, model width/depth, dropout rate. Search methods: random search > small grid; optionally Bayesian/ASHA if available. Budget wisely: short runs (few epochs) to prune poor configs; then long runs.

import math, random

def sample_config():
    return {
        "lr": 10 ** random.uniform(-4.5, -2.5),
        "batch_size": random.choice([32, 64, 128]),
        "weight_decay": 10 ** random.uniform(-6, -2),
        "dropout": random.choice([0.1, 0.25, 0.5]),
    }

best, best_score = None, -math.inf
for _ in range(20):  # 20 trials
    cfg = sample_config()
    score = quick_train_eval(cfg, epochs=5)  # lightweight proxy
    if score > best_score:
        best, best_score = cfg, score
print("Best (proxy):", best)
  1. Monitoring & Reproducibility
  • Log train/val loss, learning rate, gradient norms, throughput.
  • Save config + code version + random seeds in each run.
  • Visualize curves to catch drift and instability early.
import torch, random, numpy as np
seed = 42
random.seed(seed); np.random.seed(seed); torch.manual_seed(seed); torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

🧪 Quick Troubleshooting

  • Val loss increases early: lower LR; add/raise weight decay; add dropout; more aug.
  • Training very slow to improve: raise LR; better init; add normalization; check data pipeline.
  • Nan/Inf loss: lower LR; enable gradient clipping; check for invalid labels/inputs.
  • Class imbalance: use weighted loss, focal loss, or resampling; report per-class metrics.

Minimal Experiment Template (PyTorch)

class Experiment:
    def __init__(self, model, optimizer, scheduler=None, scaler=None, device="cuda"):
        self.model, self.opt, self.sched, self.scaler = model.to(device), optimizer, scheduler, scaler
        self.device = device

    def train_epoch(self, loader, loss_fn):
        self.model.train()
        total = 0.0
        for xb, yb in loader:
            xb, yb = xb.to(self.device), yb.to(self.device)
            self.opt.zero_grad(set_to_none=True)
            if self.scaler:
                with torch.cuda.amp.autocast():
                    loss = loss_fn(self.model(xb), yb)
                self.scaler.scale(loss).backward()
                torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
                self.scaler.step(self.opt)
                self.scaler.update()
            else:
                loss = loss_fn(self.model(xb), yb)
                loss.backward()
                torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
                self.opt.step()
            total += loss.item() * xb.size(0)
        return total / len(loader.dataset)

    @torch.no_grad()
    def evaluate(self, loader, loss_fn, metric_fn=None):
        self.model.eval()
        total, correct, count = 0.0, 0, 0
        for xb, yb in loader:
            xb, yb = xb.to(self.device), yb.to(self.device)
            logits = self.model(xb)
            total += loss_fn(logits, yb).item() * xb.size(0)
            if metric_fn:
                correct += metric_fn(logits, yb)
            count += xb.size(0)
        return total / len(loader.dataset), (correct / count if metric_fn else None)

✅ Checklist (Practical Order of Operations)

  • Establish strong data splits + baseline model
  • Tune learning rate (LR range test) and batch size
  • Add weight decay and augmentation
  • Introduce dropout/normalization if needed
  • Add LR schedule, gradient clipping, mixed precision
  • Implement early stopping + checkpointing
  • Run random search on key hyperparameters
  • Log everything; fix seeds; export best config + weights

The Importance of Optimizing Deep Learning Models

Optimizing deep learning models is not just about making them run—it’s about making them efficient, stable, and generalizable. A well-optimized model achieves high accuracy on both training and unseen data while avoiding pitfalls like overfitting or unstable training.


✅ Why Optimization Matters

1. Efficient Convergence

  • Faster training means reduced time and computational cost.
  • Proper optimization ensures the model reaches a good solution without wasting resources.

2. Training Stability

  • Prevents issues like exploding or vanishing gradients.
  • Maintains smooth and consistent parameter updates.

3. Performance

  • Minimizes the loss function effectively.
  • Improves accuracy on both training and validation sets.

4. Generalization

  • Balances fitting the training data with avoiding overfitting or underfitting.
  • Ensures the model performs well on unseen data.

🔍 Key Strategies for Optimization

Continuous Monitoring

  • Track training and validation metrics.
  • Detect overfitting or underfitting early and adjust accordingly.

Regularization

  • L1, L2, Elastic Net: Penalize overly complex models.
  • Dropout: Randomly disable neurons during training to encourage generalization.

Hyperparameter Tuning

  • Critical parameters: learning rate, batch size, momentum.
  • Proper tuning can dramatically improve performance.

Optimizer Selection

  • SGD: Strong results when tuned carefully.
  • Adaptive methods (Adam, RMSProp): Adjust learning rates automatically for faster convergence.

Advanced Techniques

  • Early Stopping: Halt training when validation performance plateaus.
  • Gradient Clipping: Prevent excessively large updates that cause instability.

🧠 Summary

Model optimization is a multifaceted process that balances:

  • Convergence speed
  • Training stability
  • Performance
  • Generalization

By combining strategies like regularization, hyperparameter tuning, and advanced optimization techniques, we can build models that are efficient, accurate, and robust for real-world challenges.


✅ Next Steps

  • Implement monitoring and logging in your training pipeline.
  • Experiment with different regularization and optimizer settings.
  • Use early stopping and gradient clipping for stability.