📘 Study Note: Common Loss Functions in Deep Learning

In deep learning, a loss function measures how far a model’s predictions deviate from the true target values. During training, optimization algorithms (like SGD or Adam) use the loss as feedback to adjust model parameters (weights and biases). Choosing the right loss function is essential because it directly influences how effectively a model learns for a given task.

🔍 What Is a Loss Function?

A loss function:

Quantifies the error between predictions and true labels.
Guides the optimizer during backpropagation.
Helps the model gradually improve by minimizing this error.

Different tasks require different loss functions. The most common categories are:

Regression (predicting continuous values)
Binary classification (two classes)
Multiclass classification (three or more classes)

1. 📈 Loss Functions for Regression

Regression tasks involve predicting continuous numeric values (e.g., house prices, temperatures).

1.1 Mean Squared Error (MSE)

One of the most widely used regression losses.

Formula

[ \text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 ]

Key Characteristics

Penalizes large errors more strongly due to squaring
Always non‑negative
Sensitive to outliers

Typical Use Cases

Stock price prediction
Forecasting
Low‑noise regression tasks

1.2 Mean Absolute Error (MAE)

Formula

[ \text{MAE} = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i| ]

Key Characteristics

More robust to outliers than MSE
Penalizes deviations linearly
Converges slower because gradient is constant and non‑smooth at zero

Typical Use Cases

Noisy datasets
When large deviations should not be heavily penalized

2. ⚖️ Loss Functions for Binary Classification

Binary classification predicts one of two possible classes, usually encoded as 0 or 1.

2.1 Binary Cross‑Entropy (BCE)

Also known as log loss.

Formula

[ \text{BCE} = -\frac{1}{N} \sum_{i=1}^{N} \left[y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)\right] ]

Key Characteristics

Measures closeness of predicted probabilities to true labels
Encourages confident and correct predictions

Typical Applications

Spam detection
Fraud detection
Medical diagnosis
Any yes/no classification task

3. 🎨 Loss Functions for Multiclass Classification

Multiclass classification predicts one class out of many possible categories.

3.1 Categorical Cross‑Entropy (CCE)

Used when labels are one‑hot encoded.

Formula

[ \text{CCE} = -\sum_{i=1}^{N} \sum_{j=1}^{K} y_{ij} \log(\hat{y}_{ij}) ]

Key Characteristics

Compares predicted probability distribution with the true one‑hot encoding
Penalizes misclassification proportionally to predicted probability

Typical Applications

Image classification (CIFAR‑10, MNIST)
Text classification
Audio classification

3.2 Sparse Categorical Cross‑Entropy

When to Use

Labels are integer encoded (0–9)
Avoiding one‑hot encoding for efficiency

Same mathematical idea as CCE, but suitable for integer labels.

4. 🧩 Specialized Loss Functions in Advanced Deep Learning

Some tasks require domain‑specific loss functions tailored to unique data structures.

4.1 Intersection over Union (IoU) Loss

Used for:

Object detection
Semantic segmentation

Measures overlap between predicted and true regions.

4.2 Dice Loss

Used for:

Medical image segmentation
Imbalanced segmentation datasets

Optimizes overlap between predicted and actual masks.

4.3 Sequence Loss

Used for:

Machine translation
Text generation
Speech recognition

Handles variable‑length sequence outputs.

🧠 Summary

Choosing the right loss function is crucial for model performance:

Task Type	Recommended Loss Function
Regression	MSE, MAE
Binary Classification	Binary Cross‑Entropy
Multiclass Classification	Categorical Cross‑Entropy / Sparse Categorical Cross‑Entropy
Object Detection	IoU Loss
Segmentation	Dice Loss
Sequence Modeling	Sequence Loss

The loss function is the core driver of training—guiding the optimizer to reduce error and improve the model’s predictive accuracy.

Batch Gradient Descent (BGD)

TL;DR: Batch Gradient Descent updates model parameters by computing the gradient of the loss over the entire training set at each step. It’s stable, deterministic, and simple, but can be slow, memory-heavy, and may get stuck in local minima.

🚀 What It Is

Batch Gradient Descent (BGD) minimizes a loss function ( \mathcal{L}(\theta) ) by updating parameters ( \theta ) using the gradient computed over the full dataset.

Think of it as planning the “best path downhill” using all terrain data before each step.
Deterministic: Same data + same initialization = same exact training path.
Stable: Gradients are smooth since they aggregate over all training examples.

🔢 Update Rule

[ \theta \leftarrow \theta - \eta \cdot \nabla_{\theta} \mathcal{L}(\theta; \mathcal{D}) ]

Where:

( \eta ) = learning rate
( \mathcal{D} ) = entire training dataset
( \nabla_{\theta} \mathcal{L} ) = gradient of the loss w.r.t. parameters

✅ Strengths

Stable updates due to full-dataset gradients.
Reproducible because each update is deterministic.
Simple to implement, great for beginners or baseline models.

⚠️ Limitations

Slow because each iteration requires processing the entire dataset.
Memory-intensive — must load or aggregate over all samples.
Can get stuck in local minima or saddle points in non-convex loss landscapes.
Slower feedback loop: updates only happen once per full pass.

🧭 When to Use

Dataset is small/medium and fits in memory.
You need reproducibility (research, verification).
Training stability is more important than speed.
As a baseline when comparing optimizers.

🔁 Pseudocode

# Batch Gradient Descent (BGD) - Pseudocode

initialize theta  # model parameters
for epoch in range(num_epochs):
    grad = gradient_over_full_dataset(theta, X_train, y_train)
    theta = theta - lr * grad
    
    # Optional monitoring
    loss = loss_over_dataset(theta, X_train, y_train)
    log(epoch=epoch, loss=loss)

🧪 Minimal NumPy Example (Linear Regression)

import numpy as np

# y = 3x + 2 + noise
np.random.seed(42)
X = np.random.rand(200, 1)
y = 3 * X + 2 + 0.1 * np.random.randn(200, 1)

# Add bias
Xb = np.c_[np.ones((len(X), 1)), X]

theta = np.zeros((2, 1))
lr = 0.1
epochs = 2000

def mse(theta, Xb, y):
    return np.mean((Xb @ theta - y) ** 2)

for epoch in range(epochs):
    y_pred = Xb @ theta
    grad = (2 / len(Xb)) * (Xb.T @ (y_pred - y))
    theta -= lr * grad
    
    if epoch % 200 == 0:
        print(f"epoch={epoch:4d} loss={mse(theta, Xb, y):.6f}")

print("Learned parameters [bias, weight]:", theta.ravel())

⚖️ BGD vs Mini-batch vs SGD

Batch Gradient Descent (batch = full data)
 + Very stable updates
 - Very slow, high memory usage

Mini-batch Gradient Descent (batch = 32 to 1024)
 + Best balance of speed + stability
 - Slightly noisy gradients

Stochastic Gradient Descent (batch = 1)
 + Very fast, good at escaping local minima
 - Highly noisy, unstable updates

🛠️ Practical Tips

Normalize features → faster convergence.
Start with learning rate in range 10−310^{-3}10−3 to 10−110^{-1}10−1.
Track loss curves to monitor training behavior.
Use learning rate schedules (step, cosine, exponential).
If dataset is large: switch to mini-batch.

🧩 Common Pitfalls

Training too slow → use mini-batch.
Runs out of memory → stream data in batches.
Model stuck at poor minima → try Momentum or Adam.
Loss plateaus early → reduce learning rate.

📌 Helpful Checklists

Before Training
Normalize/standardize data
Dataset fits memory
Learning rate chosen
Seeds fixed (if reproducibility needed)

During Training

Log loss
Watch for plateaus
Check gradient norms
Validate on test/val sets

🧠 Intuition Diagram

Loss Surface (2D slice)

┌─────────────────────────────────────┐
│             • (start)               │
│               ↘                     │
│                 ↘                   │
│                   ↘                 │
│                     • (minimum)     │
│                                     │
│ Each arrow = one BGD step           │
│ Uses entire dataset → smooth path   │
└─────────────────────────────────────┘

🔁 Related Optimizers

Momentum
Nesterov Momentum
Adam
Adagrad
RMSProp
L-BFGS

Stochastic Gradient Descent (SGD)

TL;DR: Stochastic Gradient Descent updates model parameters using the gradient from one sample at a time. It is fast, memory‑efficient, and can escape local minima, but introduces noise, zig‑zag convergence, and may require more iterations.

🚀 What It Is

Stochastic Gradient Descent (SGD) optimizes a model by updating parameters using the gradient computed from a single training example rather than the whole dataset.

Imagine running downhill while adjusting your path based only on your immediate step.
This introduces randomness, making SGD less stable but more flexible.
Great for streaming or online learning since it can update the model as new data arrives.

🔢 Update Rule

[ \theta \leftarrow \theta - \eta \cdot \nabla_\theta \mathcal{L}(\theta; x_i, y_i) ]

Where:

( \eta ): learning rate
( (x_i, y_i) ): a single training example
( \nabla_\theta \mathcal{L} ): gradient from this example only

✅ Strengths

⚡ Fast Updates

Each update processes just one sample → extremely fast iteration.

💾 Memory Efficient

Only one example is needed at a time → good for massive datasets.

🌀 Can Escape Local Minima

Random noise helps SGD jump out of poor local minima and explore better solutions.

🔄 Online & Streaming Friendly

Suitable for real-time systems where data comes continuously.

⚠️ Limitations

📉 High Variance Updates

Each step may move in wildly different directions.
Convergence path looks noisy and unpredictable.

🐢 More Iterations Needed

The noisy path often requires more total updates to reach good convergence.

↔️ Zig‑Zag Behavior

Especially in ravine-shaped loss surfaces, SGD may oscillate, making convergence slower.

💻 Limited Parallelization

Since only one sample is processed at a time, hard to leverage multi-core CPUs or GPUs efficiently.

🧭 When to Use

For large datasets that do not fit into memory.
When speed of updates is important.
For online learning or streaming data.
When you want an optimizer that can escape local minima.

🔁 Pseudocode

# Stochastic Gradient Descent (SGD) - Pseudocode

initialize theta

for epoch in range(num_epochs):
    for (x_i, y_i) in training_data:
        grad = gradient_of_loss(theta, x_i, y_i)
        theta = theta - lr * grad

🧪 Minimal NumPy Example

import numpy as np

# y = 3x + 2 + noise
np.random.seed(42)
X = np.random.rand(200, 1)
y = 3 * X + 2 + 0.1 * np.random.randn(200, 1)

# Add bias
Xb = np.c_[np.ones((len(X), 1)), X]

theta = np.zeros((2, 1))
lr = 0.05
epochs = 10

for epoch in range(epochs):
    for i in range(len(Xb)):
        xi = Xb[i:i+1]
        yi = y[i:i+1]

        grad = 2 * xi.T @ (xi @ theta - yi)
        theta -= lr * grad

print("Learned parameters [bias, weight]:", theta.ravel())

⚖️ SGD vs Mini-batch vs Batch GD

SGD (batch = 1)
 + Very fast updates
 + Escapes local minima
 - Noisy, unstable path
 - Harder to parallelize

Mini-batch GD (batch = 32–1024)
 + Best overall trade-off
 + Works well with GPUs
 - Slight noise, but manageable

Batch GD (batch = full dataset)
 + Stable, smooth convergence
 - Very slow
 - Memory heavy

🛠️ Practical Tips

Use learning rate decay to stabilize late-stage training.
Shuffle the dataset each epoch.
Use momentum or advanced methods like SGD + Momentum, Nesterov, or Adam.
Track moving average of loss to see true progress.

🧩 Common Pitfalls

Too noisy → reduce lr or increase batch size (switch to mini-batch).
Stuck oscillating → add momentum.
Training taking too long → adjust lr schedule.
Poor hardware utilization → consider mini-batch for GPU training.

🧠 Intuition Diagram

Loss Surface (SGD Path)
┌───────────────────────────────┐
│   • start                     │
│     ↘   ↗  ↘   ↗              │
│       ↘     ↗   ↘             │
│          ↘        ↗           │
│               ✦ minimum       │
│  Noisy zig-zag path           │
└───────────────────────────────┘

🔁 Related Optimizers

SGD + Momentum
Nesterov Accelerated Gradient
Adam
RMSProp
Adagrad

Mini-Batch Gradient Descent

TL;DR: Mini‑Batch Gradient Descent updates parameters using small batches (e.g., 32–1024 samples). It blends the stability of Batch GD with the speed + noise‑benefits of SGD. It is the most widely used optimization approach in deep learning.

🚀 What It Is

Mini‑Batch Gradient Descent computes gradients using a small subset of training examples (a mini‑batch), where:

Batch size > 1 (unlike SGD)
Batch size < total dataset (unlike Batch GD)
Acts like navigating downhill using information from a small group of nearby paths.

This approach provides a balance between computational efficiency and stable convergence.

🔢 Update Rule

Given a mini‑batch ( B ) of examples ( (x_i, y_i) ):

[ \theta \leftarrow \theta - \eta \cdot \frac{1}{|B|} \sum_{i \in B} \nabla_\theta \mathcal{L}(\theta; x_i, y_i) ]

Where:

( \eta ) = learning rate
( |B| ) = batch size
( \nabla_\theta \mathcal{L} ) = gradient for each sample in the batch

✅ Strengths

⚡ Computationally Efficient

Uses vectorization and hardware acceleration (GPUs/TPUs).
Faster than Batch GD since gradients on small batches run efficiently.

🔁 More Stable Than SGD

Averages gradient over multiple samples → less noise.
Converges more smoothly than SGD but retains flexibility.

⚙️ Flexible Batch Size

Adjust batch size based on:
- memory limits
- hardware capabilities
- dataset size
- desired training dynamics

📉 Typically Faster Overall Convergence

Combines:
- the speed of SGD
- the stability of Batch GD

⚠️ Limitations

🎚️ Choosing the Best Batch Size Is Tricky

Too small → noisy updates (similar to SGD).
Too large → slow, memory‑heavy (similar to Batch GD).

💾 Larger Batches Need More Memory

GPU memory constraints can restrict feasible batch sizes.

🎯 Risk of Suboptimal Convergence

Poorly chosen batch sizes may not capture enough data diversity.
Gradients may not approximate true gradient well → suboptimal minima.

🧭 When to Use

Mini‑Batch Gradient Descent is ideal when:

Training deep learning models on GPUs/TPUs
Dataset is too large for Batch GD
You want a good trade‑off between:
- speed
- stability
- convergence quality

It is the default optimization method in modern deep learning frameworks.

🔁 Pseudocode

# Mini-Batch Gradient Descent (Pseudocode)

initialize theta

for epoch in range(num_epochs):
    shuffle(training_data)

    for batch in mini_batches(training_data, batch_size):
        grad = gradient_over_batch(theta, batch)
        theta = theta - lr * grad

🧪 Minimal NumPy Example

import numpy as np

# y = 3x + 2 + noise
np.random.seed(42)
X = np.random.rand(200, 1)
y = 3 * X + 2 + 0.1 * np.random.randn(200, 1)

# Add bias
Xb = np.c_[np.ones((len(X), 1)), X]

theta = np.zeros((2, 1))
lr = 0.1
epochs = 20
batch_size = 16

def mse(theta, Xb, y):
    return np.mean((Xb @ theta - y) ** 2)

for epoch in range(epochs):
    idx = np.random.permutation(len(Xb))
    X_shuf, y_shuf = Xb[idx], y[idx]

    for i in range(0, len(Xb), batch_size):
        X_batch = X_shuf[i:i+batch_size]
        y_batch = y_shuf[i:i+batch_size]

        grad = (2 / len(X_batch)) * (X_batch.T @ (X_batch @ theta - y_batch))
        theta -= lr * grad

print("Learned parameters [bias, weight]:", theta.ravel())

⚖️ Comparison Summary

Batch Gradient Descent (whole dataset)
 + Very stable
 - Very slow, memory-heavy

SGD (1 sample)
 + Very fast, good for online learning
 - Very noisy, unstable

Mini-Batch GD (e.g., 32–1024 samples)
 + Fast AND stable
 + Best for GPU training
 + Default choice in deep learning
 - Needs tuning of batch size

🛠️ Practical Tips

Common batch sizes: 32, 64, 128, 256
If GPU memory allows, try increasing batch size to speed up training
Use learning rate decay for smoother convergence
Shuffle dataset each epoch
Use momentum, Adam, or RMSProp for even better performance

🧠 Intuition Diagram

Mini-Batch Path
┌────────────────────────────────────┐
│    • start                         │
│      ↘      ↘                      │
│        ↘        ↘                  │
│          ↘         ↘               │
│             • (minimum)            │
│ Less noisy than SGD, faster than BD│
└────────────────────────────────────┘

🔁 Related Optimizers

SGD + Momentum
Nesterov Momentum
Adam (most common)
RMSProp
Adagrad

AdaGrad (Adaptive Gradient Algorithm)

TL;DR: AdaGrad adapts the learning rate per parameter by scaling it inversely to the square root of all past squared gradients. It is excellent for sparse data and NLP, but suffers from continually shrinking learning rates that can stop learning early.

🚀 What It Is

AdaGrad (Adaptive Gradient Algorithm) is an optimization technique that automatically adjusts the learning rate for each parameter based on historical gradient information.

Parameters with frequent large gradients → smaller learning rate
Parameters with rare or small gradients → larger learning rate

This makes AdaGrad especially useful for:

Sparse datasets
Natural Language Processing
Models with features of uneven frequency

🤖 Why Adaptive Learning Rates Matter

The learning rate controls how big a step the model takes toward minimizing the loss.

Too high → diverges or overshoots
Too low → slow learning or stuck in flat regions
Fixed learning rates (SGD) often require manual tuning

AdaGrad solves this by dynamically adjusting the learning rate during training.

🔢 Update Rule

Let:

( g_t ) = gradient at time step ( t )
( G_t ) = sum of squares of all past gradients

[ G_t = G_{t-1} + g_t^2 ]

AdaGrad update:

[ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t} + \epsilon} \cdot g_t ]

Where:

( \eta ) = initial learning rate
( \epsilon ) = small constant to avoid division by zero

✅ Strengths

🎛️ 1. Automatically Adjusts Learning Rates

No need to hand-tune per-parameter learning rates.
Each parameter scales based on its behavior over time.

🌐 2. Excellent for Sparse Data

Infrequently updated parameters get higher learning rates, making AdaGrad ideal for:

NLP
Recommender systems
Sparse linear models

🧩 3. Simple to Implement

Builds on standard gradient descent with one additional accumulator.

⚡ 4. Faster Convergence (Early Training)

Effective in the early phases thanks to adaptive scaling.

⚠️ Limitations

🪫 1. Learning Rates Shrink Too Much

Since squared gradients accumulate forever,
( \sqrt{G_t} ) becomes very large → learning rates become extremely small.

This causes:

Training slowdown
Premature stopping
Poor long-term performance

🔄 2. No Mechanism to Reset or Forget

Gradient accumulation grows monotonically → AdaGrad cannot “recover” once learning stagnates.

(Optimizers like RMSProp and Adam were created to fix this.)

💾 3. Higher Memory Usage

Must store squared gradient history for every parameter → expensive for large models.

🧭 When to Use AdaGrad

Use AdaGrad when:

Working with sparse features
Training NLP or text-based models
Parameters update infrequently
You want a simple, adaptive optimizer

Avoid AdaGrad for:

Long training runs
Very deep models
Dense, large-scale tasks (CV, speech, transformer models)

🔁 Pseudocode

# AdaGrad Pseudocode

initialize theta
initialize G = 0  # accumulator for squared gradients

for each iteration:
    g = gradient(theta)
    G = G + g * g
    theta = theta - (lr / (sqrt(G) + epsilon)) * g

🧪 Minimal NumPy Example

import numpy as np

Example: optimizing a simple quadratic function

lr = 0.1
epsilon = 1e-8

theta = np.array([5.0])          # initial parameter
G = np.zeros_like(theta)          # accumulator

def grad(theta):
    return 2 * theta              # derivative of f(x)=x^2

for t in range(1, 101):
    g = grad(theta)
    G += g ** 2
    adjusted_lr = lr / (np.sqrt(G) + epsilon)
    theta -= adjusted_lr * g

print("Final theta:", theta)

⚖️ Comparison with Other Optimizers

SGD
 + Simple, low memory
 - Fixed learning rate

AdaGrad
 + Adaptive per-parameter learning rate
 + Great for sparse data
 - Learning rate decays too much

RMSProp
 + Fixes AdaGrad’s decaying rate issue
 - Adds exponential decay

Adam
 + Adaptive + momentum
 + Most widely used today

🧠 Intuition Diagram

AdaGrad Behavior
┌──────────────────────────────────────┐
│ Large gradients → G increases        │
│                 → learning rate ↓    │
│                                      │
│ Small gradients → G small            │
│                 → learning rate ↑    │
│                                      │
│ Eventually learning rate becomes too │
│ small → progress slows dramatically  │
└──────────────────────────────────────┘
```text


# RMSProp (Root Mean Square Propagation)

> **TL;DR**: RMSProp fixes AdaGrad’s biggest issue—**rapidly shrinking learning rates**—by using an **exponential moving average** of squared gradients. This keeps learning rates adaptive without decaying too fast. It works well for **non‑stationary**, **noisy**, and **sparse** problems, including RNNs.

---

## 🚀 What It Is

**RMSProp** is an adaptive learning rate optimization algorithm designed to overcome the limitations of AdaGrad.

While AdaGrad accumulates *all* past squared gradients (making learning rates shrink too quickly), **RMSProp uses a running exponential moving average**, allowing the optimizer to:

- *Forget older gradients*  
- *Focus on recent gradient behavior*  
- *Maintain learning rates at useful scales*  

This makes RMSProp more reliable for long training runs and dynamic tasks.

---

## 🔢 Update Rule

RMSProp maintains an exponentially decaying average of squared gradients:

\[
E[g^2]_t = \beta E[g^2]_{t-1} + (1 - \beta) g_t^2
\]

Parameters are updated using:

\[
\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t} + \epsilon} \cdot g_t
\]

Where:

- \( \eta \): learning rate  
- \( \beta \): decay rate (commonly 0.9)  
- \( \epsilon \): smoothing constant (e.g. \(10^{-8}\))  
- \( g_t \): gradient at time \( t \)  

---

## ✅ Strengths

### 🎚️ 1. Prevents Learning Rate Collapse  
Unlike AdaGrad, RMSProp does **not** let squared gradients grow indefinitely.  
This keeps learning rates effective throughout training.

### 🎢 2. Great for Non‑stationary Objectives  
Ideal when data distribution changes over time, such as:

- Reinforcement learning  
- Streaming data  
- Time‑series models  

### 🧠 3. Handles Noisy and Sparse Gradients  
RMSProp remains stable even with highly variable gradient signals.

### 🔁 4. Good Fit for RNNs  
Less sensitive to exploding/vanishing gradients → better RNN training stability.

### 🧩 5. Simple to Implement  
Just adds a decaying average term on top of AdaGrad.

---

## ⚠️ Limitations

### ⚙️ 1. Extra Hyperparameters  
Requires careful tuning of:

- **Decay rate** \( \beta \)  
- **Learning rate** \( \eta \)  

Poor choices may cause divergence or slow convergence.

### 🔄 2. Convergence Issues  
RMSProp may:

- Converge to suboptimal solutions  
- Fail on some loss surfaces  
- Require schedule adjustments

### 📘 3. Lacks Strong Theoretical Guarantees  
Unlike some modern optimizers, RMSProp lacks a rigorous mathematical foundation.  
This can make behavior harder to predict and debug.

---

## 🧭 When to Use RMSProp

RMSProp is a strong choice for:

- Recurrent neural networks (LSTM, GRU)  
- Reinforcement learning agents  
- Noisy or sparse datasets  
- Non‑stationary problems  

It is often used as a practical middle ground between AdaGrad and Adam.

---

## 🔁 Pseudocode

```python
# RMSProp Pseudocode

initialize theta
initialize E = 0         # running average of squared gradients
beta = 0.9               # decay rate

for each iteration:
    g = gradient(theta)
    E = beta * E + (1 - beta) * (g * g)
    theta = theta - (lr / (sqrt(E) + epsilon)) * g

🧪 Minimal NumPy Example

import numpy as np

lr = 0.01
beta = 0.9
epsilon = 1e-8

theta = np.array([5.0])
E = np.zeros_like(theta)

def grad(theta):
    return 2 * theta   # derivative of f(x) = x^2

for t in range(1, 101):
    g = grad(theta)
    E = beta * E + (1 - beta) * (g ** 2)
    theta -= (lr / (np.sqrt(E) + epsilon)) * g

print("Final theta:", theta)

⚖️ Comparison with Other Optimizers

AdaGrad
 + Adaptive learning rates
 - Rates shrink too much → premature stopping

RMSProp
 + Fixes AdaGrad's decay issue with moving average
 + Great for RNNs and non-stationary tasks
 - Needs tuning of decay rate

Adam
 + RMSProp + Momentum
 + Most commonly used today

🧠 Intuition Diagram

RMSProp Moving Average Concept
Old gradients fade → new gradients matter more
E[g^2] = 0.9 * previous + 0.1 * current

Result:

Learning rate stays healthy
Training continues progressing

Adam (Adaptive Moment Estimation)

TL;DR: Adam combines Momentum and RMSProp, using exponentially decaying averages of past gradients (first moment) and squared gradients (second moment). It provides adaptive learning rates, momentum-driven updates, and generally converges fast with minimal tuning, making it a default choice in deep learning.

🚀 What It Is

Adam, short for Adaptive Moment Estimation, is an optimization algorithm that improves upon RMSProp by incorporating momentum in addition to adaptive learning rates.

It keeps track of:

First moment (m) → exponential moving average of gradients
Second moment (v) → exponential moving average of squared gradients

By combining both:

Adam adapts learning rates per parameter
Adam smooths out updates with momentum
Adam avoids AdaGrad’s learning rate decay issue

This results in fast, stable, efficient optimization.

🔢 Update Rule

Let:

( g_t ) = gradient at step ( t )
( m_t ) = first moment (mean)
( v_t ) = second moment (variance)

Moment Updates

[ m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t ] [ v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 ]

Bias Correction

[ \hat{m}_t = \frac{m_t}{1 - \beta_1^t} ] [ \hat{v}_t = \frac{v_t}{1 - \beta_2^t} ]

Parameter Update

[ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \cdot \hat{m}_t ]

Defaults (work well in practice):

`` beta1 = 0.9 beta2 = 0.999 epsilon = 1e-8

✅ Strengths

⚡ Fast Convergence

Combines momentum + adaptive rates → reaches good solutions quickly.

🎛️ Excellent Default Optimizer

Works well across many tasks with default hyperparameters.
Requires less tuning than RMSProp or AdaGrad.

🧠 Momentum Stabilizes Updates

Smooths noisy gradients, prevents oscillations, accelerates directionally consistent updates.

📐 Handles Large-Scale and High-Dimensional Models

Efficiently manages large parameter spaces such as:

Deep neural networks
CNNs
RNNs
Transformers

🧮 Computationally Efficient

Low overhead
Few additional operations
Works well on GPUs/TPUs

⚠️ Limitations

🎯 Risk of Overfitting

Because Adam converges quickly, it may overfit unless:

Early stopping
Weight decay
Dropout
Regularization techniques

are applied.

💾 Requires Extra Memory

Must store two moment vectors (m and v) → doubles memory use.

🎚️ Hyperparameter Sensitivity (in some tasks)

Although defaults work well generally, certain tasks (e.g., RL) may require careful tuning.

🧩 Sometimes Converges to Worse Minima

Some studies show that Adam may converge to:

Worse generalization performance
Solutions with higher test error

compared to SGD + Momentum.

(This led to the creation of AdamW, which decouples weight decay.)

🧭 When to Use Adam

Adam is ideal when:

Working with large datasets
Training very deep networks
Gradients are noisy or sparse
You need fast, reliable convergence
You want a strong default optimizer

Most deep learning frameworks choose Adam as the default optimizer.

🔁 Pseudocode

# Adam Optimizer (Pseudocode)

initialize theta
initialize m = 0
initialize v = 0
t = 0

while training:
    t += 1
    g = gradient(theta)

    m = beta1 * m + (1 - beta1) * g
    v = beta2 * v + (1 - beta2) * (g * g)

    m_hat = m / (1 - beta1**t)
    v_hat = v / (1 - beta2**t)

    theta = theta - (lr / (sqrt(v_hat) + epsilon)) * m_hat

🧪 Minimal NumPy Example

import numpy as np

lr = 0.1
beta1 = 0.9
beta2 = 0.999
epsilon = 1e-8

theta = np.array([5.0])
m = np.zeros_like(theta)
v = np.zeros_like(theta)

def grad(theta):
    return 2 * theta  # derivative of f(x)=x^2

for t in range(1, 101):
    g = grad(theta)

    m = beta1 * m + (1 - beta1) * g
    v = beta2 * v + (1 - beta2) * (g ** 2)

    m_hat = m / (1 - beta1**t)
    v_hat = v / (1 - beta2**t)

    theta -= (lr / (np.sqrt(v_hat) + epsilon)) * m_hat

print("Final theta:", theta)

⚖️ Comparison with Other Optimizers

SGD
 + Simple, strong generalization
 - Requires tuning, slower convergence

AdaGrad
 + Adaptive learning rates 
 - Learning rate decays too fast

RMSProp
 + Stabilizes AdaGrad
 - No momentum for fast direction tracking

Adam
 + RMSProp + Momentum
 + Fast, stable, adaptive
 + Best default choice

🧠 Intuition Diagram

Adam = RMSProp (adaptive learning rate)
     + Momentum (smooth, fast updates)

1st moment  -> tracks direction
2nd moment  -> scales updates
bias correction -> prevents early bias

📘 Study Notes: Parameters vs. Hyperparameters in Machine Learning

🔹 Overview

In machine learning and deep learning, parameters and hyperparameters are foundational concepts, each playing a distinct role in how models are built, trained, and optimized.

🔧 Parameters

Definition

Parameters are the internal model variables learned automatically from training data during model training.
They are updated by optimization algorithms (like gradient descent) to minimize the loss function.

Characteristics

Learned during training
Not set manually
Define how the model behaves for given inputs
Directly affect predictions

Examples

1. Linear Regression

For the model:

[ Y = WX + B ]

W = weight (slope)
B = bias (intercept)
Both W and B are parameters learned during training.

2. Neural Networks

In neural networks, parameters include:

Weights between neurons
Biases for neurons

Example network:

Input layer: 784 neurons
Hidden layer 1: 512 neurons
Hidden layer 2: 128 neurons
Output layer: 10 neurons

Parameter counts:

Input → Hidden1: 784 × 512 weights
Hidden1 → Hidden2: 512 × 128 weights
Hidden2 → Output: 128 × 10 weights
Biases for each neuron in hidden + output layers

All these are optimized during training using algorithms such as Stochastic Gradient Descent (SGD).

⚙️ Hyperparameters

Definition

Hyperparameters are external configuration settings defined before training begins.
Unlike parameters, they are not learned from data. Instead, they control the training process or model structure.

Characteristics

Set manually by the practitioner
Often tuned via trial & error or search strategies
Control training behavior and model architecture
Not updated during training

Analogy: Building a House

Parameters → materials (bricks, cement, wood)
Hyperparameters → blueprint (number of rooms, layout)

Common Hyperparameters

Learning rate: step size during parameter updates
Batch size: number of samples per training step
Number of epochs: full passes over the dataset
Network architecture choices:
- Number of layers
- Number of neurons per layer
- Activation functions

🔍 Key Differences at a Glance

Aspect	Parameters	Hyperparameters
Learned from data?	✔️ Yes	❌ No
Set manually?	❌ No	✔️ Yes
Role	Define model behavior	Control training & architecture
Examples	Weights, biases	Learning rate, batch size, epochs

📝 Summary

Parameters are the internal values (weights, biases) learned by the model to make accurate predictions.
Hyperparameters define how the model is trained and structured, and must be chosen before training.
Proper hyperparameter tuning is essential for good model performance.

FilesExpand file tree

lossfunction_optimization_algo.md

Latest commit

History

lossfunction_optimization_algo.md

File metadata and controls

📘 Study Note: Common Loss Functions in Deep Learning

🔍 What Is a Loss Function?

1. 📈 Loss Functions for Regression

1.1 Mean Squared Error (MSE)

Formula

Key Characteristics

Typical Use Cases

1.2 Mean Absolute Error (MAE)

Formula

Key Characteristics

Typical Use Cases

2. ⚖️ Loss Functions for Binary Classification

2.1 Binary Cross‑Entropy (BCE)

Formula

Key Characteristics

Typical Applications

3. 🎨 Loss Functions for Multiclass Classification

3.1 Categorical Cross‑Entropy (CCE)

Formula

Key Characteristics

Typical Applications

3.2 Sparse Categorical Cross‑Entropy

When to Use

4. 🧩 Specialized Loss Functions in Advanced Deep Learning

4.1 Intersection over Union (IoU) Loss

4.2 Dice Loss

4.3 Sequence Loss

🧠 Summary

Batch Gradient Descent (BGD)

🚀 What It Is

🔢 Update Rule

✅ Strengths

⚠️ Limitations

🧭 When to Use

🔁 Pseudocode

🧪 Minimal NumPy Example (Linear Regression)

⚖️ BGD vs Mini-batch vs SGD

🛠️ Practical Tips

🧩 Common Pitfalls

📌 Helpful Checklists

During Training

🧠 Intuition Diagram

🔁 Related Optimizers

Stochastic Gradient Descent (SGD)

🚀 What It Is

🔢 Update Rule

✅ Strengths

⚡ Fast Updates

💾 Memory Efficient

🌀 Can Escape Local Minima

🔄 Online & Streaming Friendly

⚠️ Limitations

📉 High Variance Updates

🐢 More Iterations Needed

↔️ Zig‑Zag Behavior

💻 Limited Parallelization

🧭 When to Use

🔁 Pseudocode

🧪 Minimal NumPy Example

⚖️ SGD vs Mini-batch vs Batch GD

🛠️ Practical Tips

🧩 Common Pitfalls

🧠 Intuition Diagram

🔁 Related Optimizers

Mini-Batch Gradient Descent

🚀 What It Is

🔢 Update Rule

✅ Strengths

⚡ Computationally Efficient

🔁 More Stable Than SGD

⚙️ Flexible Batch Size

📉 Typically Faster Overall Convergence

⚠️ Limitations

🎚️ Choosing the Best Batch Size Is Tricky