Skip to content

Latest commit

 

History

History
1358 lines (954 loc) · 34.8 KB

File metadata and controls

1358 lines (954 loc) · 34.8 KB

📘 Study Note: Common Loss Functions in Deep Learning

In deep learning, a loss function measures how far a model’s predictions deviate from the true target values. During training, optimization algorithms (like SGD or Adam) use the loss as feedback to adjust model parameters (weights and biases). Choosing the right loss function is essential because it directly influences how effectively a model learns for a given task.


🔍 What Is a Loss Function?

A loss function:

  • Quantifies the error between predictions and true labels.
  • Guides the optimizer during backpropagation.
  • Helps the model gradually improve by minimizing this error.

Different tasks require different loss functions. The most common categories are:

  • Regression (predicting continuous values)
  • Binary classification (two classes)
  • Multiclass classification (three or more classes)

1. 📈 Loss Functions for Regression

Regression tasks involve predicting continuous numeric values (e.g., house prices, temperatures).


1.1 Mean Squared Error (MSE)

One of the most widely used regression losses.

Formula

[ \text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 ]

Key Characteristics

  • Penalizes large errors more strongly due to squaring
  • Always non‑negative
  • Sensitive to outliers

Typical Use Cases

  • Stock price prediction
  • Forecasting
  • Low‑noise regression tasks

1.2 Mean Absolute Error (MAE)

Formula

[ \text{MAE} = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i| ]

Key Characteristics

  • More robust to outliers than MSE
  • Penalizes deviations linearly
  • Converges slower because gradient is constant and non‑smooth at zero

Typical Use Cases

  • Noisy datasets
  • When large deviations should not be heavily penalized

2. ⚖️ Loss Functions for Binary Classification

Binary classification predicts one of two possible classes, usually encoded as 0 or 1.


2.1 Binary Cross‑Entropy (BCE)

Also known as log loss.

Formula

[ \text{BCE} = -\frac{1}{N} \sum_{i=1}^{N} \left[y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)\right] ]

Key Characteristics

  • Measures closeness of predicted probabilities to true labels
  • Encourages confident and correct predictions

Typical Applications

  • Spam detection
  • Fraud detection
  • Medical diagnosis
  • Any yes/no classification task

3. 🎨 Loss Functions for Multiclass Classification

Multiclass classification predicts one class out of many possible categories.


3.1 Categorical Cross‑Entropy (CCE)

Used when labels are one‑hot encoded.

Formula

[ \text{CCE} = -\sum_{i=1}^{N} \sum_{j=1}^{K} y_{ij} \log(\hat{y}_{ij}) ]

Key Characteristics

  • Compares predicted probability distribution with the true one‑hot encoding
  • Penalizes misclassification proportionally to predicted probability

Typical Applications

  • Image classification (CIFAR‑10, MNIST)
  • Text classification
  • Audio classification

3.2 Sparse Categorical Cross‑Entropy

When to Use

  • Labels are integer encoded (0–9)
  • Avoiding one‑hot encoding for efficiency

Same mathematical idea as CCE, but suitable for integer labels.


4. 🧩 Specialized Loss Functions in Advanced Deep Learning

Some tasks require domain‑specific loss functions tailored to unique data structures.


4.1 Intersection over Union (IoU) Loss

Used for:

  • Object detection
  • Semantic segmentation

Measures overlap between predicted and true regions.


4.2 Dice Loss

Used for:

  • Medical image segmentation
  • Imbalanced segmentation datasets

Optimizes overlap between predicted and actual masks.


4.3 Sequence Loss

Used for:

  • Machine translation
  • Text generation
  • Speech recognition

Handles variable‑length sequence outputs.


🧠 Summary

Choosing the right loss function is crucial for model performance:

Task Type Recommended Loss Function
Regression MSE, MAE
Binary Classification Binary Cross‑Entropy
Multiclass Classification Categorical Cross‑Entropy / Sparse Categorical Cross‑Entropy
Object Detection IoU Loss
Segmentation Dice Loss
Sequence Modeling Sequence Loss

The loss function is the core driver of training—guiding the optimizer to reduce error and improve the model’s predictive accuracy.

Batch Gradient Descent (BGD)

TL;DR: Batch Gradient Descent updates model parameters by computing the gradient of the loss over the entire training set at each step. It’s stable, deterministic, and simple, but can be slow, memory-heavy, and may get stuck in local minima.


🚀 What It Is

Batch Gradient Descent (BGD) minimizes a loss function ( \mathcal{L}(\theta) ) by updating parameters ( \theta ) using the gradient computed over the full dataset.

  • Think of it as planning the “best path downhill” using all terrain data before each step.
  • Deterministic: Same data + same initialization = same exact training path.
  • Stable: Gradients are smooth since they aggregate over all training examples.

🔢 Update Rule

[ \theta \leftarrow \theta - \eta \cdot \nabla_{\theta} \mathcal{L}(\theta; \mathcal{D}) ]

Where:

  • ( \eta ) = learning rate
  • ( \mathcal{D} ) = entire training dataset
  • ( \nabla_{\theta} \mathcal{L} ) = gradient of the loss w.r.t. parameters

✅ Strengths

  • Stable updates due to full-dataset gradients.
  • Reproducible because each update is deterministic.
  • Simple to implement, great for beginners or baseline models.

⚠️ Limitations

  • Slow because each iteration requires processing the entire dataset.
  • Memory-intensive — must load or aggregate over all samples.
  • Can get stuck in local minima or saddle points in non-convex loss landscapes.
  • Slower feedback loop: updates only happen once per full pass.

🧭 When to Use

  • Dataset is small/medium and fits in memory.
  • You need reproducibility (research, verification).
  • Training stability is more important than speed.
  • As a baseline when comparing optimizers.

🔁 Pseudocode

# Batch Gradient Descent (BGD) - Pseudocode

initialize theta  # model parameters
for epoch in range(num_epochs):
    grad = gradient_over_full_dataset(theta, X_train, y_train)
    theta = theta - lr * grad
    
    # Optional monitoring
    loss = loss_over_dataset(theta, X_train, y_train)
    log(epoch=epoch, loss=loss)

🧪 Minimal NumPy Example (Linear Regression)

import numpy as np

# y = 3x + 2 + noise
np.random.seed(42)
X = np.random.rand(200, 1)
y = 3 * X + 2 + 0.1 * np.random.randn(200, 1)

# Add bias
Xb = np.c_[np.ones((len(X), 1)), X]

theta = np.zeros((2, 1))
lr = 0.1
epochs = 2000

def mse(theta, Xb, y):
    return np.mean((Xb @ theta - y) ** 2)

for epoch in range(epochs):
    y_pred = Xb @ theta
    grad = (2 / len(Xb)) * (Xb.T @ (y_pred - y))
    theta -= lr * grad
    
    if epoch % 200 == 0:
        print(f"epoch={epoch:4d} loss={mse(theta, Xb, y):.6f}")

print("Learned parameters [bias, weight]:", theta.ravel())

⚖️ BGD vs Mini-batch vs SGD

Batch Gradient Descent (batch = full data)
 + Very stable updates
 - Very slow, high memory usage

Mini-batch Gradient Descent (batch = 32 to 1024)
 + Best balance of speed + stability
 - Slightly noisy gradients

Stochastic Gradient Descent (batch = 1)
 + Very fast, good at escaping local minima
 - Highly noisy, unstable updates

🛠️ Practical Tips

  • Normalize features → faster convergence.
  • Start with learning rate in range 10−310^{-3}10−3 to 10−110^{-1}10−1.
  • Track loss curves to monitor training behavior.
  • Use learning rate schedules (step, cosine, exponential).
  • If dataset is large: switch to mini-batch.

🧩 Common Pitfalls

  • Training too slow → use mini-batch.
  • Runs out of memory → stream data in batches.
  • Model stuck at poor minima → try Momentum or Adam.
  • Loss plateaus early → reduce learning rate.

📌 Helpful Checklists

  • Before Training

  • Normalize/standardize data

  • Dataset fits memory

  • Learning rate chosen

  • Seeds fixed (if reproducibility needed)

During Training

  • Log loss
  • Watch for plateaus
  • Check gradient norms
  • Validate on test/val sets

🧠 Intuition Diagram

  • Loss Surface (2D slice)
┌─────────────────────────────────────┐
│             • (start)               │
│               ↘                     │
│                 ↘                   │
│                   ↘                 │
│                     • (minimum)     │
│                                     │
│ Each arrow = one BGD step           │
│ Uses entire dataset → smooth path   │
└─────────────────────────────────────┘

🔁 Related Optimizers

  • Momentum
  • Nesterov Momentum
  • Adam
  • Adagrad
  • RMSProp
  • L-BFGS

Stochastic Gradient Descent (SGD)

TL;DR: Stochastic Gradient Descent updates model parameters using the gradient from one sample at a time. It is fast, memory‑efficient, and can escape local minima, but introduces noise, zig‑zag convergence, and may require more iterations.


🚀 What It Is

Stochastic Gradient Descent (SGD) optimizes a model by updating parameters using the gradient computed from a single training example rather than the whole dataset.

  • Imagine running downhill while adjusting your path based only on your immediate step.
  • This introduces randomness, making SGD less stable but more flexible.
  • Great for streaming or online learning since it can update the model as new data arrives.

🔢 Update Rule

[ \theta \leftarrow \theta - \eta \cdot \nabla_\theta \mathcal{L}(\theta; x_i, y_i) ]

Where:

  • ( \eta ): learning rate
  • ( (x_i, y_i) ): a single training example
  • ( \nabla_\theta \mathcal{L} ): gradient from this example only

✅ Strengths

⚡ Fast Updates

  • Each update processes just one sample → extremely fast iteration.

💾 Memory Efficient

  • Only one example is needed at a time → good for massive datasets.

🌀 Can Escape Local Minima

  • Random noise helps SGD jump out of poor local minima and explore better solutions.

🔄 Online & Streaming Friendly

  • Suitable for real-time systems where data comes continuously.

⚠️ Limitations

📉 High Variance Updates

  • Each step may move in wildly different directions.
  • Convergence path looks noisy and unpredictable.

🐢 More Iterations Needed

  • The noisy path often requires more total updates to reach good convergence.

↔️ Zig‑Zag Behavior

  • Especially in ravine-shaped loss surfaces, SGD may oscillate, making convergence slower.

💻 Limited Parallelization

  • Since only one sample is processed at a time, hard to leverage multi-core CPUs or GPUs efficiently.

🧭 When to Use

  • For large datasets that do not fit into memory.
  • When speed of updates is important.
  • For online learning or streaming data.
  • When you want an optimizer that can escape local minima.

🔁 Pseudocode

# Stochastic Gradient Descent (SGD) - Pseudocode

initialize theta

for epoch in range(num_epochs):
    for (x_i, y_i) in training_data:
        grad = gradient_of_loss(theta, x_i, y_i)
        theta = theta - lr * grad

🧪 Minimal NumPy Example

import numpy as np

# y = 3x + 2 + noise
np.random.seed(42)
X = np.random.rand(200, 1)
y = 3 * X + 2 + 0.1 * np.random.randn(200, 1)

# Add bias
Xb = np.c_[np.ones((len(X), 1)), X]

theta = np.zeros((2, 1))
lr = 0.05
epochs = 10

for epoch in range(epochs):
    for i in range(len(Xb)):
        xi = Xb[i:i+1]
        yi = y[i:i+1]

        grad = 2 * xi.T @ (xi @ theta - yi)
        theta -= lr * grad

print("Learned parameters [bias, weight]:", theta.ravel())

⚖️ SGD vs Mini-batch vs Batch GD

SGD (batch = 1)
 + Very fast updates
 + Escapes local minima
 - Noisy, unstable path
 - Harder to parallelize

Mini-batch GD (batch = 321024)
 + Best overall trade-off
 + Works well with GPUs
 - Slight noise, but manageable

Batch GD (batch = full dataset)
 + Stable, smooth convergence
 - Very slow
 - Memory heavy

🛠️ Practical Tips

  • Use learning rate decay to stabilize late-stage training.
  • Shuffle the dataset each epoch.
  • Use momentum or advanced methods like SGD + Momentum, Nesterov, or Adam.
  • Track moving average of loss to see true progress.

🧩 Common Pitfalls

  • Too noisy → reduce lr or increase batch size (switch to mini-batch).
  • Stuck oscillating → add momentum.
  • Training taking too long → adjust lr schedule.
  • Poor hardware utilization → consider mini-batch for GPU training.

🧠 Intuition Diagram

Loss Surface (SGD Path)
┌───────────────────────────────┐
│   • start                     │
│     ↘   ↗  ↘   ↗              │
│       ↘     ↗   ↘             │
│          ↘        ↗           │
│               ✦ minimum       │
│  Noisy zig-zag path           │
└───────────────────────────────┘

🔁 Related Optimizers

  • SGD + Momentum
  • Nesterov Accelerated Gradient
  • Adam
  • RMSProp
  • Adagrad

Mini-Batch Gradient Descent

TL;DR: Mini‑Batch Gradient Descent updates parameters using small batches (e.g., 32–1024 samples). It blends the stability of Batch GD with the speed + noise‑benefits of SGD. It is the most widely used optimization approach in deep learning.


🚀 What It Is

Mini‑Batch Gradient Descent computes gradients using a small subset of training examples (a mini‑batch), where:

  • Batch size > 1 (unlike SGD)
  • Batch size < total dataset (unlike Batch GD)
  • Acts like navigating downhill using information from a small group of nearby paths.

This approach provides a balance between computational efficiency and stable convergence.


🔢 Update Rule

Given a mini‑batch ( B ) of examples ( (x_i, y_i) ):

[ \theta \leftarrow \theta - \eta \cdot \frac{1}{|B|} \sum_{i \in B} \nabla_\theta \mathcal{L}(\theta; x_i, y_i) ]

Where:

  • ( \eta ) = learning rate
  • ( |B| ) = batch size
  • ( \nabla_\theta \mathcal{L} ) = gradient for each sample in the batch

✅ Strengths

⚡ Computationally Efficient

  • Uses vectorization and hardware acceleration (GPUs/TPUs).
  • Faster than Batch GD since gradients on small batches run efficiently.

🔁 More Stable Than SGD

  • Averages gradient over multiple samples → less noise.
  • Converges more smoothly than SGD but retains flexibility.

⚙️ Flexible Batch Size

  • Adjust batch size based on:
    • memory limits
    • hardware capabilities
    • dataset size
    • desired training dynamics

📉 Typically Faster Overall Convergence

  • Combines:
    • the speed of SGD
    • the stability of Batch GD

⚠️ Limitations

🎚️ Choosing the Best Batch Size Is Tricky

  • Too small → noisy updates (similar to SGD).
  • Too large → slow, memory‑heavy (similar to Batch GD).

💾 Larger Batches Need More Memory

  • GPU memory constraints can restrict feasible batch sizes.

🎯 Risk of Suboptimal Convergence

  • Poorly chosen batch sizes may not capture enough data diversity.
  • Gradients may not approximate true gradient well → suboptimal minima.

🧭 When to Use

Mini‑Batch Gradient Descent is ideal when:

  • Training deep learning models on GPUs/TPUs
  • Dataset is too large for Batch GD
  • You want a good trade‑off between:
    • speed
    • stability
    • convergence quality

It is the default optimization method in modern deep learning frameworks.


🔁 Pseudocode

# Mini-Batch Gradient Descent (Pseudocode)

initialize theta

for epoch in range(num_epochs):
    shuffle(training_data)

    for batch in mini_batches(training_data, batch_size):
        grad = gradient_over_batch(theta, batch)
        theta = theta - lr * grad

🧪 Minimal NumPy Example

import numpy as np

# y = 3x + 2 + noise
np.random.seed(42)
X = np.random.rand(200, 1)
y = 3 * X + 2 + 0.1 * np.random.randn(200, 1)

# Add bias
Xb = np.c_[np.ones((len(X), 1)), X]

theta = np.zeros((2, 1))
lr = 0.1
epochs = 20
batch_size = 16

def mse(theta, Xb, y):
    return np.mean((Xb @ theta - y) ** 2)

for epoch in range(epochs):
    idx = np.random.permutation(len(Xb))
    X_shuf, y_shuf = Xb[idx], y[idx]

    for i in range(0, len(Xb), batch_size):
        X_batch = X_shuf[i:i+batch_size]
        y_batch = y_shuf[i:i+batch_size]

        grad = (2 / len(X_batch)) * (X_batch.T @ (X_batch @ theta - y_batch))
        theta -= lr * grad

print("Learned parameters [bias, weight]:", theta.ravel())

⚖️ Comparison Summary

Batch Gradient Descent (whole dataset)
 + Very stable
 - Very slow, memory-heavy

SGD (1 sample)
 + Very fast, good for online learning
 - Very noisy, unstable

Mini-Batch GD (e.g., 321024 samples)
 + Fast AND stable
 + Best for GPU training
 + Default choice in deep learning
 - Needs tuning of batch size

🛠️ Practical Tips

  • Common batch sizes: 32, 64, 128, 256
  • If GPU memory allows, try increasing batch size to speed up training
  • Use learning rate decay for smoother convergence
  • Shuffle dataset each epoch
  • Use momentum, Adam, or RMSProp for even better performance

🧠 Intuition Diagram

Mini-Batch Path
┌────────────────────────────────────┐
│    • start                         │
│      ↘      ↘                      │
│        ↘        ↘                  │
│          ↘         ↘               │
│             • (minimum)            │
│ Less noisy than SGD, faster than BD│
└────────────────────────────────────┘

🔁 Related Optimizers

  • SGD + Momentum
  • Nesterov Momentum
  • Adam (most common)
  • RMSProp
  • Adagrad

AdaGrad (Adaptive Gradient Algorithm)

TL;DR: AdaGrad adapts the learning rate per parameter by scaling it inversely to the square root of all past squared gradients. It is excellent for sparse data and NLP, but suffers from continually shrinking learning rates that can stop learning early.


🚀 What It Is

AdaGrad (Adaptive Gradient Algorithm) is an optimization technique that automatically adjusts the learning rate for each parameter based on historical gradient information.

  • Parameters with frequent large gradientssmaller learning rate
  • Parameters with rare or small gradientslarger learning rate

This makes AdaGrad especially useful for:

  • Sparse datasets
  • Natural Language Processing
  • Models with features of uneven frequency

🤖 Why Adaptive Learning Rates Matter

The learning rate controls how big a step the model takes toward minimizing the loss.

  • Too high → diverges or overshoots
  • Too low → slow learning or stuck in flat regions
  • Fixed learning rates (SGD) often require manual tuning

AdaGrad solves this by dynamically adjusting the learning rate during training.


🔢 Update Rule

Let:

  • ( g_t ) = gradient at time step ( t )
  • ( G_t ) = sum of squares of all past gradients

[ G_t = G_{t-1} + g_t^2 ]

AdaGrad update:

[ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t} + \epsilon} \cdot g_t ]

Where:

  • ( \eta ) = initial learning rate
  • ( \epsilon ) = small constant to avoid division by zero

✅ Strengths

🎛️ 1. Automatically Adjusts Learning Rates

No need to hand-tune per-parameter learning rates.
Each parameter scales based on its behavior over time.

🌐 2. Excellent for Sparse Data

Infrequently updated parameters get higher learning rates, making AdaGrad ideal for:

  • NLP
  • Recommender systems
  • Sparse linear models

🧩 3. Simple to Implement

Builds on standard gradient descent with one additional accumulator.

⚡ 4. Faster Convergence (Early Training)

Effective in the early phases thanks to adaptive scaling.


⚠️ Limitations

🪫 1. Learning Rates Shrink Too Much

Since squared gradients accumulate forever,
( \sqrt{G_t} ) becomes very large → learning rates become extremely small.

This causes:

  • Training slowdown
  • Premature stopping
  • Poor long-term performance

🔄 2. No Mechanism to Reset or Forget

Gradient accumulation grows monotonically → AdaGrad cannot “recover” once learning stagnates.

(Optimizers like RMSProp and Adam were created to fix this.)

💾 3. Higher Memory Usage

Must store squared gradient history for every parameter → expensive for large models.


🧭 When to Use AdaGrad

Use AdaGrad when:

  • Working with sparse features
  • Training NLP or text-based models
  • Parameters update infrequently
  • You want a simple, adaptive optimizer

Avoid AdaGrad for:

  • Long training runs
  • Very deep models
  • Dense, large-scale tasks (CV, speech, transformer models)

🔁 Pseudocode

# AdaGrad Pseudocode

initialize theta
initialize G = 0  # accumulator for squared gradients

for each iteration:
    g = gradient(theta)
    G = G + g * g
    theta = theta - (lr / (sqrt(G) + epsilon)) * g

🧪 Minimal NumPy Example

import numpy as np

Example: optimizing a simple quadratic function

lr = 0.1
epsilon = 1e-8

theta = np.array([5.0])          # initial parameter
G = np.zeros_like(theta)          # accumulator

def grad(theta):
    return 2 * theta              # derivative of f(x)=x^2

for t in range(1, 101):
    g = grad(theta)
    G += g ** 2
    adjusted_lr = lr / (np.sqrt(G) + epsilon)
    theta -= adjusted_lr * g

print("Final theta:", theta)

⚖️ Comparison with Other Optimizers

SGD
 + Simple, low memory
 - Fixed learning rate

AdaGrad
 + Adaptive per-parameter learning rate
 + Great for sparse data
 - Learning rate decays too much

RMSProp
 + Fixes AdaGrad’s decaying rate issue
 - Adds exponential decay

Adam
 + Adaptive + momentum
 + Most widely used today

🧠 Intuition Diagram

AdaGrad Behavior
┌──────────────────────────────────────┐
│ Large gradients → G increases        │
│                 → learning rate ↓    │
│                                      │
│ Small gradients → G small            │
│                 → learning rate ↑    │
│                                      │
│ Eventually learning rate becomes too │
│ small → progress slows dramatically  │
└──────────────────────────────────────┘
```text


# RMSProp (Root Mean Square Propagation)

> **TL;DR**: RMSProp fixes AdaGrad’s biggest issue—**rapidly shrinking learning rates**—by using an **exponential moving average** of squared gradients. This keeps learning rates adaptive without decaying too fast. It works well for **non‑stationary**, **noisy**, and **sparse** problems, including RNNs.

---

## 🚀 What It Is

**RMSProp** is an adaptive learning rate optimization algorithm designed to overcome the limitations of AdaGrad.

While AdaGrad accumulates *all* past squared gradients (making learning rates shrink too quickly), **RMSProp uses a running exponential moving average**, allowing the optimizer to:

- *Forget older gradients*  
- *Focus on recent gradient behavior*  
- *Maintain learning rates at useful scales*  

This makes RMSProp more reliable for long training runs and dynamic tasks.

---

## 🔢 Update Rule

RMSProp maintains an exponentially decaying average of squared gradients:

\[
E[g^2]_t = \beta E[g^2]_{t-1} + (1 - \beta) g_t^2
\]

Parameters are updated using:

\[
\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t} + \epsilon} \cdot g_t
\]

Where:

- \( \eta \): learning rate  
- \( \beta \): decay rate (commonly 0.9)  
- \( \epsilon \): smoothing constant (e.g. \(10^{-8}\))  
- \( g_t \): gradient at time \( t \)  

---

## ✅ Strengths

### 🎚️ 1. Prevents Learning Rate Collapse  
Unlike AdaGrad, RMSProp does **not** let squared gradients grow indefinitely.  
This keeps learning rates effective throughout training.

### 🎢 2. Great for Non‑stationary Objectives  
Ideal when data distribution changes over time, such as:

- Reinforcement learning  
- Streaming data  
- Time‑series models  

### 🧠 3. Handles Noisy and Sparse Gradients  
RMSProp remains stable even with highly variable gradient signals.

### 🔁 4. Good Fit for RNNs  
Less sensitive to exploding/vanishing gradients → better RNN training stability.

### 🧩 5. Simple to Implement  
Just adds a decaying average term on top of AdaGrad.

---

## ⚠️ Limitations

### ⚙️ 1. Extra Hyperparameters  
Requires careful tuning of:

- **Decay rate** \( \beta \)  
- **Learning rate** \( \eta \)  

Poor choices may cause divergence or slow convergence.

### 🔄 2. Convergence Issues  
RMSProp may:

- Converge to suboptimal solutions  
- Fail on some loss surfaces  
- Require schedule adjustments

### 📘 3. Lacks Strong Theoretical Guarantees  
Unlike some modern optimizers, RMSProp lacks a rigorous mathematical foundation.  
This can make behavior harder to predict and debug.

---

## 🧭 When to Use RMSProp

RMSProp is a strong choice for:

- Recurrent neural networks (LSTM, GRU)  
- Reinforcement learning agents  
- Noisy or sparse datasets  
- Non‑stationary problems  

It is often used as a practical middle ground between AdaGrad and Adam.

---

## 🔁 Pseudocode

```python
# RMSProp Pseudocode

initialize theta
initialize E = 0         # running average of squared gradients
beta = 0.9               # decay rate

for each iteration:
    g = gradient(theta)
    E = beta * E + (1 - beta) * (g * g)
    theta = theta - (lr / (sqrt(E) + epsilon)) * g

🧪 Minimal NumPy Example

import numpy as np

lr = 0.01
beta = 0.9
epsilon = 1e-8

theta = np.array([5.0])
E = np.zeros_like(theta)

def grad(theta):
    return 2 * theta   # derivative of f(x) = x^2

for t in range(1, 101):
    g = grad(theta)
    E = beta * E + (1 - beta) * (g ** 2)
    theta -= (lr / (np.sqrt(E) + epsilon)) * g

print("Final theta:", theta)

⚖️ Comparison with Other Optimizers

AdaGrad
 + Adaptive learning rates
 - Rates shrink too muchpremature stopping

RMSProp
 + Fixes AdaGrad's decay issue with moving average
 + Great for RNNs and non-stationary tasks
 - Needs tuning of decay rate

Adam
 + RMSProp + Momentum
 + Most commonly used today

🧠 Intuition Diagram

  • RMSProp Moving Average Concept
  • Old gradients fade → new gradients matter more
  • E[g^2] = 0.9 * previous + 0.1 * current

Result:

  • Learning rate stays healthy
  • Training continues progressing

Adam (Adaptive Moment Estimation)

TL;DR: Adam combines Momentum and RMSProp, using exponentially decaying averages of past gradients (first moment) and squared gradients (second moment). It provides adaptive learning rates, momentum-driven updates, and generally converges fast with minimal tuning, making it a default choice in deep learning.


🚀 What It Is

Adam, short for Adaptive Moment Estimation, is an optimization algorithm that improves upon RMSProp by incorporating momentum in addition to adaptive learning rates.

It keeps track of:

  • First moment (m) → exponential moving average of gradients
  • Second moment (v) → exponential moving average of squared gradients

By combining both:

  • Adam adapts learning rates per parameter
  • Adam smooths out updates with momentum
  • Adam avoids AdaGrad’s learning rate decay issue

This results in fast, stable, efficient optimization.


🔢 Update Rule

Let:

  • ( g_t ) = gradient at step ( t )
  • ( m_t ) = first moment (mean)
  • ( v_t ) = second moment (variance)

Moment Updates

[ m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t ] [ v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 ]

Bias Correction

[ \hat{m}_t = \frac{m_t}{1 - \beta_1^t} ] [ \hat{v}_t = \frac{v_t}{1 - \beta_2^t} ]

Parameter Update

[ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \cdot \hat{m}_t ]

Defaults (work well in practice):

`` beta1 = 0.9 beta2 = 0.999 epsilon = 1e-8


✅ Strengths

⚡ Fast Convergence

Combines momentum + adaptive rates → reaches good solutions quickly.

🎛️ Excellent Default Optimizer

Works well across many tasks with default hyperparameters.
Requires less tuning than RMSProp or AdaGrad.

🧠 Momentum Stabilizes Updates

Smooths noisy gradients, prevents oscillations, accelerates directionally consistent updates.

📐 Handles Large-Scale and High-Dimensional Models

Efficiently manages large parameter spaces such as:

  • Deep neural networks
  • CNNs
  • RNNs
  • Transformers

🧮 Computationally Efficient

  • Low overhead
  • Few additional operations
  • Works well on GPUs/TPUs

⚠️ Limitations

🎯 Risk of Overfitting

Because Adam converges quickly, it may overfit unless:

  • Early stopping
  • Weight decay
  • Dropout
  • Regularization techniques

are applied.

💾 Requires Extra Memory

Must store two moment vectors (m and v) → doubles memory use.

🎚️ Hyperparameter Sensitivity (in some tasks)

Although defaults work well generally, certain tasks (e.g., RL) may require careful tuning.

🧩 Sometimes Converges to Worse Minima

Some studies show that Adam may converge to:

  • Worse generalization performance
  • Solutions with higher test error

compared to SGD + Momentum.

(This led to the creation of AdamW, which decouples weight decay.)


🧭 When to Use Adam

Adam is ideal when:

  • Working with large datasets
  • Training very deep networks
  • Gradients are noisy or sparse
  • You need fast, reliable convergence
  • You want a strong default optimizer

Most deep learning frameworks choose Adam as the default optimizer.


🔁 Pseudocode

# Adam Optimizer (Pseudocode)

initialize theta
initialize m = 0
initialize v = 0
t = 0

while training:
    t += 1
    g = gradient(theta)

    m = beta1 * m + (1 - beta1) * g
    v = beta2 * v + (1 - beta2) * (g * g)

    m_hat = m / (1 - beta1**t)
    v_hat = v / (1 - beta2**t)

    theta = theta - (lr / (sqrt(v_hat) + epsilon)) * m_hat

🧪 Minimal NumPy Example

import numpy as np

lr = 0.1
beta1 = 0.9
beta2 = 0.999
epsilon = 1e-8

theta = np.array([5.0])
m = np.zeros_like(theta)
v = np.zeros_like(theta)

def grad(theta):
    return 2 * theta  # derivative of f(x)=x^2

for t in range(1, 101):
    g = grad(theta)

    m = beta1 * m + (1 - beta1) * g
    v = beta2 * v + (1 - beta2) * (g ** 2)

    m_hat = m / (1 - beta1**t)
    v_hat = v / (1 - beta2**t)

    theta -= (lr / (np.sqrt(v_hat) + epsilon)) * m_hat

print("Final theta:", theta)

⚖️ Comparison with Other Optimizers

SGD
 + Simple, strong generalization
 - Requires tuning, slower convergence

AdaGrad
 + Adaptive learning rates 
 - Learning rate decays too fast

RMSProp
 + Stabilizes AdaGrad
 - No momentum for fast direction tracking

Adam
 + RMSProp + Momentum
 + Fast, stable, adaptive
 + Best default choice

🧠 Intuition Diagram

Adam = RMSProp (adaptive learning rate)
     + Momentum (smooth, fast updates)

1st moment  -> tracks direction
2nd moment  -> scales updates
bias correction -> prevents early bias

📘 Study Notes: Parameters vs. Hyperparameters in Machine Learning

🔹 Overview

In machine learning and deep learning, parameters and hyperparameters are foundational concepts, each playing a distinct role in how models are built, trained, and optimized.


🔧 Parameters

Definition

Parameters are the internal model variables learned automatically from training data during model training.
They are updated by optimization algorithms (like gradient descent) to minimize the loss function.

Characteristics

  • Learned during training
  • Not set manually
  • Define how the model behaves for given inputs
  • Directly affect predictions

Examples

1. Linear Regression

For the model:

[ Y = WX + B ]

  • W = weight (slope)
  • B = bias (intercept)
    Both W and B are parameters learned during training.

2. Neural Networks

In neural networks, parameters include:

  • Weights between neurons
  • Biases for neurons

Example network:

  • Input layer: 784 neurons
  • Hidden layer 1: 512 neurons
  • Hidden layer 2: 128 neurons
  • Output layer: 10 neurons

Parameter counts:

  • Input → Hidden1: 784 × 512 weights
  • Hidden1 → Hidden2: 512 × 128 weights
  • Hidden2 → Output: 128 × 10 weights
  • Biases for each neuron in hidden + output layers

All these are optimized during training using algorithms such as Stochastic Gradient Descent (SGD).


⚙️ Hyperparameters

Definition

Hyperparameters are external configuration settings defined before training begins.
Unlike parameters, they are not learned from data. Instead, they control the training process or model structure.

Characteristics

  • Set manually by the practitioner
  • Often tuned via trial & error or search strategies
  • Control training behavior and model architecture
  • Not updated during training

Analogy: Building a House

  • Parameters → materials (bricks, cement, wood)
  • Hyperparameters → blueprint (number of rooms, layout)

Common Hyperparameters

  • Learning rate: step size during parameter updates
  • Batch size: number of samples per training step
  • Number of epochs: full passes over the dataset
  • Network architecture choices:
    • Number of layers
    • Number of neurons per layer
    • Activation functions

🔍 Key Differences at a Glance

Aspect Parameters Hyperparameters
Learned from data? ✔️ Yes ❌ No
Set manually? ❌ No ✔️ Yes
Role Define model behavior Control training & architecture
Examples Weights, biases Learning rate, batch size, epochs

📝 Summary

  • Parameters are the internal values (weights, biases) learned by the model to make accurate predictions.
  • Hyperparameters define how the model is trained and structured, and must be chosen before training.
  • Proper hyperparameter tuning is essential for good model performance.