In deep learning, a loss function measures how far a model’s predictions deviate from the true target values. During training, optimization algorithms (like SGD or Adam) use the loss as feedback to adjust model parameters (weights and biases). Choosing the right loss function is essential because it directly influences how effectively a model learns for a given task.
A loss function:
- Quantifies the error between predictions and true labels.
- Guides the optimizer during backpropagation.
- Helps the model gradually improve by minimizing this error.
Different tasks require different loss functions. The most common categories are:
- Regression (predicting continuous values)
- Binary classification (two classes)
- Multiclass classification (three or more classes)
Regression tasks involve predicting continuous numeric values (e.g., house prices, temperatures).
One of the most widely used regression losses.
[ \text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 ]
- Penalizes large errors more strongly due to squaring
- Always non‑negative
- Sensitive to outliers
- Stock price prediction
- Forecasting
- Low‑noise regression tasks
[ \text{MAE} = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i| ]
- More robust to outliers than MSE
- Penalizes deviations linearly
- Converges slower because gradient is constant and non‑smooth at zero
- Noisy datasets
- When large deviations should not be heavily penalized
Binary classification predicts one of two possible classes, usually encoded as 0 or 1.
Also known as log loss.
[ \text{BCE} = -\frac{1}{N} \sum_{i=1}^{N} \left[y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)\right] ]
- Measures closeness of predicted probabilities to true labels
- Encourages confident and correct predictions
- Spam detection
- Fraud detection
- Medical diagnosis
- Any yes/no classification task
Multiclass classification predicts one class out of many possible categories.
Used when labels are one‑hot encoded.
[ \text{CCE} = -\sum_{i=1}^{N} \sum_{j=1}^{K} y_{ij} \log(\hat{y}_{ij}) ]
- Compares predicted probability distribution with the true one‑hot encoding
- Penalizes misclassification proportionally to predicted probability
- Image classification (CIFAR‑10, MNIST)
- Text classification
- Audio classification
- Labels are integer encoded (
0–9) - Avoiding one‑hot encoding for efficiency
Same mathematical idea as CCE, but suitable for integer labels.
Some tasks require domain‑specific loss functions tailored to unique data structures.
Used for:
- Object detection
- Semantic segmentation
Measures overlap between predicted and true regions.
Used for:
- Medical image segmentation
- Imbalanced segmentation datasets
Optimizes overlap between predicted and actual masks.
Used for:
- Machine translation
- Text generation
- Speech recognition
Handles variable‑length sequence outputs.
Choosing the right loss function is crucial for model performance:
| Task Type | Recommended Loss Function |
|---|---|
| Regression | MSE, MAE |
| Binary Classification | Binary Cross‑Entropy |
| Multiclass Classification | Categorical Cross‑Entropy / Sparse Categorical Cross‑Entropy |
| Object Detection | IoU Loss |
| Segmentation | Dice Loss |
| Sequence Modeling | Sequence Loss |
The loss function is the core driver of training—guiding the optimizer to reduce error and improve the model’s predictive accuracy.
TL;DR: Batch Gradient Descent updates model parameters by computing the gradient of the loss over the entire training set at each step. It’s stable, deterministic, and simple, but can be slow, memory-heavy, and may get stuck in local minima.
Batch Gradient Descent (BGD) minimizes a loss function ( \mathcal{L}(\theta) ) by updating parameters ( \theta ) using the gradient computed over the full dataset.
- Think of it as planning the “best path downhill” using all terrain data before each step.
- Deterministic: Same data + same initialization = same exact training path.
- Stable: Gradients are smooth since they aggregate over all training examples.
[ \theta \leftarrow \theta - \eta \cdot \nabla_{\theta} \mathcal{L}(\theta; \mathcal{D}) ]
Where:
- ( \eta ) = learning rate
- ( \mathcal{D} ) = entire training dataset
- ( \nabla_{\theta} \mathcal{L} ) = gradient of the loss w.r.t. parameters
- Stable updates due to full-dataset gradients.
- Reproducible because each update is deterministic.
- Simple to implement, great for beginners or baseline models.
- Slow because each iteration requires processing the entire dataset.
- Memory-intensive — must load or aggregate over all samples.
- Can get stuck in local minima or saddle points in non-convex loss landscapes.
- Slower feedback loop: updates only happen once per full pass.
- Dataset is small/medium and fits in memory.
- You need reproducibility (research, verification).
- Training stability is more important than speed.
- As a baseline when comparing optimizers.
# Batch Gradient Descent (BGD) - Pseudocode
initialize theta # model parameters
for epoch in range(num_epochs):
grad = gradient_over_full_dataset(theta, X_train, y_train)
theta = theta - lr * grad
# Optional monitoring
loss = loss_over_dataset(theta, X_train, y_train)
log(epoch=epoch, loss=loss)import numpy as np
# y = 3x + 2 + noise
np.random.seed(42)
X = np.random.rand(200, 1)
y = 3 * X + 2 + 0.1 * np.random.randn(200, 1)
# Add bias
Xb = np.c_[np.ones((len(X), 1)), X]
theta = np.zeros((2, 1))
lr = 0.1
epochs = 2000
def mse(theta, Xb, y):
return np.mean((Xb @ theta - y) ** 2)
for epoch in range(epochs):
y_pred = Xb @ theta
grad = (2 / len(Xb)) * (Xb.T @ (y_pred - y))
theta -= lr * grad
if epoch % 200 == 0:
print(f"epoch={epoch:4d} loss={mse(theta, Xb, y):.6f}")
print("Learned parameters [bias, weight]:", theta.ravel())
Batch Gradient Descent (batch = full data)
+ Very stable updates
- Very slow, high memory usage
Mini-batch Gradient Descent (batch = 32 to 1024)
+ Best balance of speed + stability
- Slightly noisy gradients
Stochastic Gradient Descent (batch = 1)
+ Very fast, good at escaping local minima
- Highly noisy, unstable updates
- Normalize features → faster convergence.
- Start with learning rate in range 10−310^{-3}10−3 to 10−110^{-1}10−1.
- Track loss curves to monitor training behavior.
- Use learning rate schedules (step, cosine, exponential).
- If dataset is large: switch to mini-batch.
- Training too slow → use mini-batch.
- Runs out of memory → stream data in batches.
- Model stuck at poor minima → try Momentum or Adam.
- Loss plateaus early → reduce learning rate.
-
Before Training
-
Normalize/standardize data
-
Dataset fits memory
-
Learning rate chosen
-
Seeds fixed (if reproducibility needed)
- Log loss
- Watch for plateaus
- Check gradient norms
- Validate on test/val sets
- Loss Surface (2D slice)
┌─────────────────────────────────────┐
│ • (start) │
│ ↘ │
│ ↘ │
│ ↘ │
│ • (minimum) │
│ │
│ Each arrow = one BGD step │
│ Uses entire dataset → smooth path │
└─────────────────────────────────────┘
- Momentum
- Nesterov Momentum
- Adam
- Adagrad
- RMSProp
- L-BFGS
TL;DR: Stochastic Gradient Descent updates model parameters using the gradient from one sample at a time. It is fast, memory‑efficient, and can escape local minima, but introduces noise, zig‑zag convergence, and may require more iterations.
Stochastic Gradient Descent (SGD) optimizes a model by updating parameters using the gradient computed from a single training example rather than the whole dataset.
- Imagine running downhill while adjusting your path based only on your immediate step.
- This introduces randomness, making SGD less stable but more flexible.
- Great for streaming or online learning since it can update the model as new data arrives.
[ \theta \leftarrow \theta - \eta \cdot \nabla_\theta \mathcal{L}(\theta; x_i, y_i) ]
Where:
- ( \eta ): learning rate
- ( (x_i, y_i) ): a single training example
- ( \nabla_\theta \mathcal{L} ): gradient from this example only
- Each update processes just one sample → extremely fast iteration.
- Only one example is needed at a time → good for massive datasets.
- Random noise helps SGD jump out of poor local minima and explore better solutions.
- Suitable for real-time systems where data comes continuously.
- Each step may move in wildly different directions.
- Convergence path looks noisy and unpredictable.
- The noisy path often requires more total updates to reach good convergence.
- Especially in ravine-shaped loss surfaces, SGD may oscillate, making convergence slower.
- Since only one sample is processed at a time, hard to leverage multi-core CPUs or GPUs efficiently.
- For large datasets that do not fit into memory.
- When speed of updates is important.
- For online learning or streaming data.
- When you want an optimizer that can escape local minima.
# Stochastic Gradient Descent (SGD) - Pseudocode
initialize theta
for epoch in range(num_epochs):
for (x_i, y_i) in training_data:
grad = gradient_of_loss(theta, x_i, y_i)
theta = theta - lr * gradimport numpy as np
# y = 3x + 2 + noise
np.random.seed(42)
X = np.random.rand(200, 1)
y = 3 * X + 2 + 0.1 * np.random.randn(200, 1)
# Add bias
Xb = np.c_[np.ones((len(X), 1)), X]
theta = np.zeros((2, 1))
lr = 0.05
epochs = 10
for epoch in range(epochs):
for i in range(len(Xb)):
xi = Xb[i:i+1]
yi = y[i:i+1]
grad = 2 * xi.T @ (xi @ theta - yi)
theta -= lr * grad
print("Learned parameters [bias, weight]:", theta.ravel())SGD (batch = 1)
+ Very fast updates
+ Escapes local minima
- Noisy, unstable path
- Harder to parallelize
Mini-batch GD (batch = 32–1024)
+ Best overall trade-off
+ Works well with GPUs
- Slight noise, but manageable
Batch GD (batch = full dataset)
+ Stable, smooth convergence
- Very slow
- Memory heavy- Use learning rate decay to stabilize late-stage training.
- Shuffle the dataset each epoch.
- Use momentum or advanced methods like SGD + Momentum, Nesterov, or Adam.
- Track moving average of loss to see true progress.
- Too noisy → reduce lr or increase batch size (switch to mini-batch).
- Stuck oscillating → add momentum.
- Training taking too long → adjust lr schedule.
- Poor hardware utilization → consider mini-batch for GPU training.
Loss Surface (SGD Path)
┌───────────────────────────────┐
│ • start │
│ ↘ ↗ ↘ ↗ │
│ ↘ ↗ ↘ │
│ ↘ ↗ │
│ ✦ minimum │
│ Noisy zig-zag path │
└───────────────────────────────┘- SGD + Momentum
- Nesterov Accelerated Gradient
- Adam
- RMSProp
- Adagrad
TL;DR: Mini‑Batch Gradient Descent updates parameters using small batches (e.g., 32–1024 samples). It blends the stability of Batch GD with the speed + noise‑benefits of SGD. It is the most widely used optimization approach in deep learning.
Mini‑Batch Gradient Descent computes gradients using a small subset of training examples (a mini‑batch), where:
- Batch size > 1 (unlike SGD)
- Batch size < total dataset (unlike Batch GD)
- Acts like navigating downhill using information from a small group of nearby paths.
This approach provides a balance between computational efficiency and stable convergence.
Given a mini‑batch ( B ) of examples ( (x_i, y_i) ):
[ \theta \leftarrow \theta - \eta \cdot \frac{1}{|B|} \sum_{i \in B} \nabla_\theta \mathcal{L}(\theta; x_i, y_i) ]
Where:
- ( \eta ) = learning rate
- ( |B| ) = batch size
- ( \nabla_\theta \mathcal{L} ) = gradient for each sample in the batch
- Uses vectorization and hardware acceleration (GPUs/TPUs).
- Faster than Batch GD since gradients on small batches run efficiently.
- Averages gradient over multiple samples → less noise.
- Converges more smoothly than SGD but retains flexibility.
- Adjust batch size based on:
- memory limits
- hardware capabilities
- dataset size
- desired training dynamics
- Combines:
- the speed of SGD
- the stability of Batch GD
- Too small → noisy updates (similar to SGD).
- Too large → slow, memory‑heavy (similar to Batch GD).
- GPU memory constraints can restrict feasible batch sizes.
- Poorly chosen batch sizes may not capture enough data diversity.
- Gradients may not approximate true gradient well → suboptimal minima.
Mini‑Batch Gradient Descent is ideal when:
- Training deep learning models on GPUs/TPUs
- Dataset is too large for Batch GD
- You want a good trade‑off between:
- speed
- stability
- convergence quality
It is the default optimization method in modern deep learning frameworks.
# Mini-Batch Gradient Descent (Pseudocode)
initialize theta
for epoch in range(num_epochs):
shuffle(training_data)
for batch in mini_batches(training_data, batch_size):
grad = gradient_over_batch(theta, batch)
theta = theta - lr * gradimport numpy as np
# y = 3x + 2 + noise
np.random.seed(42)
X = np.random.rand(200, 1)
y = 3 * X + 2 + 0.1 * np.random.randn(200, 1)
# Add bias
Xb = np.c_[np.ones((len(X), 1)), X]
theta = np.zeros((2, 1))
lr = 0.1
epochs = 20
batch_size = 16
def mse(theta, Xb, y):
return np.mean((Xb @ theta - y) ** 2)
for epoch in range(epochs):
idx = np.random.permutation(len(Xb))
X_shuf, y_shuf = Xb[idx], y[idx]
for i in range(0, len(Xb), batch_size):
X_batch = X_shuf[i:i+batch_size]
y_batch = y_shuf[i:i+batch_size]
grad = (2 / len(X_batch)) * (X_batch.T @ (X_batch @ theta - y_batch))
theta -= lr * grad
print("Learned parameters [bias, weight]:", theta.ravel())Batch Gradient Descent (whole dataset)
+ Very stable
- Very slow, memory-heavy
SGD (1 sample)
+ Very fast, good for online learning
- Very noisy, unstable
Mini-Batch GD (e.g., 32–1024 samples)
+ Fast AND stable
+ Best for GPU training
+ Default choice in deep learning
- Needs tuning of batch size- Common batch sizes: 32, 64, 128, 256
- If GPU memory allows, try increasing batch size to speed up training
- Use learning rate decay for smoother convergence
- Shuffle dataset each epoch
- Use momentum, Adam, or RMSProp for even better performance
Mini-Batch Path
┌────────────────────────────────────┐
│ • start │
│ ↘ ↘ │
│ ↘ ↘ │
│ ↘ ↘ │
│ • (minimum) │
│ Less noisy than SGD, faster than BD│
└────────────────────────────────────┘- SGD + Momentum
- Nesterov Momentum
- Adam (most common)
- RMSProp
- Adagrad
TL;DR: AdaGrad adapts the learning rate per parameter by scaling it inversely to the square root of all past squared gradients. It is excellent for sparse data and NLP, but suffers from continually shrinking learning rates that can stop learning early.
AdaGrad (Adaptive Gradient Algorithm) is an optimization technique that automatically adjusts the learning rate for each parameter based on historical gradient information.
- Parameters with frequent large gradients → smaller learning rate
- Parameters with rare or small gradients → larger learning rate
This makes AdaGrad especially useful for:
- Sparse datasets
- Natural Language Processing
- Models with features of uneven frequency
The learning rate controls how big a step the model takes toward minimizing the loss.
- Too high → diverges or overshoots
- Too low → slow learning or stuck in flat regions
- Fixed learning rates (SGD) often require manual tuning
AdaGrad solves this by dynamically adjusting the learning rate during training.
Let:
- ( g_t ) = gradient at time step ( t )
- ( G_t ) = sum of squares of all past gradients
[ G_t = G_{t-1} + g_t^2 ]
AdaGrad update:
[ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t} + \epsilon} \cdot g_t ]
Where:
- ( \eta ) = initial learning rate
- ( \epsilon ) = small constant to avoid division by zero
No need to hand-tune per-parameter learning rates.
Each parameter scales based on its behavior over time.
Infrequently updated parameters get higher learning rates, making AdaGrad ideal for:
- NLP
- Recommender systems
- Sparse linear models
Builds on standard gradient descent with one additional accumulator.
Effective in the early phases thanks to adaptive scaling.
Since squared gradients accumulate forever,
( \sqrt{G_t} ) becomes very large → learning rates become extremely small.
This causes:
- Training slowdown
- Premature stopping
- Poor long-term performance
Gradient accumulation grows monotonically → AdaGrad cannot “recover” once learning stagnates.
(Optimizers like RMSProp and Adam were created to fix this.)
Must store squared gradient history for every parameter → expensive for large models.
Use AdaGrad when:
- Working with sparse features
- Training NLP or text-based models
- Parameters update infrequently
- You want a simple, adaptive optimizer
Avoid AdaGrad for:
- Long training runs
- Very deep models
- Dense, large-scale tasks (CV, speech, transformer models)
# AdaGrad Pseudocode
initialize theta
initialize G = 0 # accumulator for squared gradients
for each iteration:
g = gradient(theta)
G = G + g * g
theta = theta - (lr / (sqrt(G) + epsilon)) * gimport numpy as np
lr = 0.1
epsilon = 1e-8
theta = np.array([5.0]) # initial parameter
G = np.zeros_like(theta) # accumulator
def grad(theta):
return 2 * theta # derivative of f(x)=x^2
for t in range(1, 101):
g = grad(theta)
G += g ** 2
adjusted_lr = lr / (np.sqrt(G) + epsilon)
theta -= adjusted_lr * g
print("Final theta:", theta)SGD
+ Simple, low memory
- Fixed learning rate
AdaGrad
+ Adaptive per-parameter learning rate
+ Great for sparse data
- Learning rate decays too much
RMSProp
+ Fixes AdaGrad’s decaying rate issue
- Adds exponential decay
Adam
+ Adaptive + momentum
+ Most widely used today
AdaGrad Behavior
┌──────────────────────────────────────┐
│ Large gradients → G increases │
│ → learning rate ↓ │
│ │
│ Small gradients → G small │
│ → learning rate ↑ │
│ │
│ Eventually learning rate becomes too │
│ small → progress slows dramatically │
└──────────────────────────────────────┘
```text
# RMSProp (Root Mean Square Propagation)
> **TL;DR**: RMSProp fixes AdaGrad’s biggest issue—**rapidly shrinking learning rates**—by using an **exponential moving average** of squared gradients. This keeps learning rates adaptive without decaying too fast. It works well for **non‑stationary**, **noisy**, and **sparse** problems, including RNNs.
---
## 🚀 What It Is
**RMSProp** is an adaptive learning rate optimization algorithm designed to overcome the limitations of AdaGrad.
While AdaGrad accumulates *all* past squared gradients (making learning rates shrink too quickly), **RMSProp uses a running exponential moving average**, allowing the optimizer to:
- *Forget older gradients*
- *Focus on recent gradient behavior*
- *Maintain learning rates at useful scales*
This makes RMSProp more reliable for long training runs and dynamic tasks.
---
## 🔢 Update Rule
RMSProp maintains an exponentially decaying average of squared gradients:
\[
E[g^2]_t = \beta E[g^2]_{t-1} + (1 - \beta) g_t^2
\]
Parameters are updated using:
\[
\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t} + \epsilon} \cdot g_t
\]
Where:
- \( \eta \): learning rate
- \( \beta \): decay rate (commonly 0.9)
- \( \epsilon \): smoothing constant (e.g. \(10^{-8}\))
- \( g_t \): gradient at time \( t \)
---
## ✅ Strengths
### 🎚️ 1. Prevents Learning Rate Collapse
Unlike AdaGrad, RMSProp does **not** let squared gradients grow indefinitely.
This keeps learning rates effective throughout training.
### 🎢 2. Great for Non‑stationary Objectives
Ideal when data distribution changes over time, such as:
- Reinforcement learning
- Streaming data
- Time‑series models
### 🧠 3. Handles Noisy and Sparse Gradients
RMSProp remains stable even with highly variable gradient signals.
### 🔁 4. Good Fit for RNNs
Less sensitive to exploding/vanishing gradients → better RNN training stability.
### 🧩 5. Simple to Implement
Just adds a decaying average term on top of AdaGrad.
---
## ⚠️ Limitations
### ⚙️ 1. Extra Hyperparameters
Requires careful tuning of:
- **Decay rate** \( \beta \)
- **Learning rate** \( \eta \)
Poor choices may cause divergence or slow convergence.
### 🔄 2. Convergence Issues
RMSProp may:
- Converge to suboptimal solutions
- Fail on some loss surfaces
- Require schedule adjustments
### 📘 3. Lacks Strong Theoretical Guarantees
Unlike some modern optimizers, RMSProp lacks a rigorous mathematical foundation.
This can make behavior harder to predict and debug.
---
## 🧭 When to Use RMSProp
RMSProp is a strong choice for:
- Recurrent neural networks (LSTM, GRU)
- Reinforcement learning agents
- Noisy or sparse datasets
- Non‑stationary problems
It is often used as a practical middle ground between AdaGrad and Adam.
---
## 🔁 Pseudocode
```python
# RMSProp Pseudocode
initialize theta
initialize E = 0 # running average of squared gradients
beta = 0.9 # decay rate
for each iteration:
g = gradient(theta)
E = beta * E + (1 - beta) * (g * g)
theta = theta - (lr / (sqrt(E) + epsilon)) * g
import numpy as np
lr = 0.01
beta = 0.9
epsilon = 1e-8
theta = np.array([5.0])
E = np.zeros_like(theta)
def grad(theta):
return 2 * theta # derivative of f(x) = x^2
for t in range(1, 101):
g = grad(theta)
E = beta * E + (1 - beta) * (g ** 2)
theta -= (lr / (np.sqrt(E) + epsilon)) * g
print("Final theta:", theta)AdaGrad
+ Adaptive learning rates
- Rates shrink too much → premature stopping
RMSProp
+ Fixes AdaGrad's decay issue with moving average
+ Great for RNNs and non-stationary tasks
- Needs tuning of decay rate
Adam
+ RMSProp + Momentum
+ Most commonly used today- RMSProp Moving Average Concept
- Old gradients fade → new gradients matter more
- E[g^2] = 0.9 * previous + 0.1 * current
- Learning rate stays healthy
- Training continues progressing
TL;DR: Adam combines Momentum and RMSProp, using exponentially decaying averages of past gradients (first moment) and squared gradients (second moment). It provides adaptive learning rates, momentum-driven updates, and generally converges fast with minimal tuning, making it a default choice in deep learning.
Adam, short for Adaptive Moment Estimation, is an optimization algorithm that improves upon RMSProp by incorporating momentum in addition to adaptive learning rates.
It keeps track of:
- First moment (m) → exponential moving average of gradients
- Second moment (v) → exponential moving average of squared gradients
By combining both:
- Adam adapts learning rates per parameter
- Adam smooths out updates with momentum
- Adam avoids AdaGrad’s learning rate decay issue
This results in fast, stable, efficient optimization.
Let:
- ( g_t ) = gradient at step ( t )
- ( m_t ) = first moment (mean)
- ( v_t ) = second moment (variance)
[ m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t ] [ v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 ]
[ \hat{m}_t = \frac{m_t}{1 - \beta_1^t} ] [ \hat{v}_t = \frac{v_t}{1 - \beta_2^t} ]
[ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \cdot \hat{m}_t ]
Defaults (work well in practice):
`` beta1 = 0.9 beta2 = 0.999 epsilon = 1e-8
Combines momentum + adaptive rates → reaches good solutions quickly.
Works well across many tasks with default hyperparameters.
Requires less tuning than RMSProp or AdaGrad.
Smooths noisy gradients, prevents oscillations, accelerates directionally consistent updates.
Efficiently manages large parameter spaces such as:
- Deep neural networks
- CNNs
- RNNs
- Transformers
- Low overhead
- Few additional operations
- Works well on GPUs/TPUs
Because Adam converges quickly, it may overfit unless:
- Early stopping
- Weight decay
- Dropout
- Regularization techniques
are applied.
Must store two moment vectors (m and v) → doubles memory use.
Although defaults work well generally, certain tasks (e.g., RL) may require careful tuning.
Some studies show that Adam may converge to:
- Worse generalization performance
- Solutions with higher test error
compared to SGD + Momentum.
(This led to the creation of AdamW, which decouples weight decay.)
Adam is ideal when:
- Working with large datasets
- Training very deep networks
- Gradients are noisy or sparse
- You need fast, reliable convergence
- You want a strong default optimizer
Most deep learning frameworks choose Adam as the default optimizer.
# Adam Optimizer (Pseudocode)
initialize theta
initialize m = 0
initialize v = 0
t = 0
while training:
t += 1
g = gradient(theta)
m = beta1 * m + (1 - beta1) * g
v = beta2 * v + (1 - beta2) * (g * g)
m_hat = m / (1 - beta1**t)
v_hat = v / (1 - beta2**t)
theta = theta - (lr / (sqrt(v_hat) + epsilon)) * m_hatimport numpy as np
lr = 0.1
beta1 = 0.9
beta2 = 0.999
epsilon = 1e-8
theta = np.array([5.0])
m = np.zeros_like(theta)
v = np.zeros_like(theta)
def grad(theta):
return 2 * theta # derivative of f(x)=x^2
for t in range(1, 101):
g = grad(theta)
m = beta1 * m + (1 - beta1) * g
v = beta2 * v + (1 - beta2) * (g ** 2)
m_hat = m / (1 - beta1**t)
v_hat = v / (1 - beta2**t)
theta -= (lr / (np.sqrt(v_hat) + epsilon)) * m_hat
print("Final theta:", theta)
SGD
+ Simple, strong generalization
- Requires tuning, slower convergence
AdaGrad
+ Adaptive learning rates
- Learning rate decays too fast
RMSProp
+ Stabilizes AdaGrad
- No momentum for fast direction tracking
Adam
+ RMSProp + Momentum
+ Fast, stable, adaptive
+ Best default choice
Adam = RMSProp (adaptive learning rate)
+ Momentum (smooth, fast updates)
1st moment -> tracks direction
2nd moment -> scales updates
bias correction -> prevents early bias
In machine learning and deep learning, parameters and hyperparameters are foundational concepts, each playing a distinct role in how models are built, trained, and optimized.
Parameters are the internal model variables learned automatically from training data during model training.
They are updated by optimization algorithms (like gradient descent) to minimize the loss function.
- Learned during training
- Not set manually
- Define how the model behaves for given inputs
- Directly affect predictions
For the model:
[ Y = WX + B ]
- W = weight (slope)
- B = bias (intercept)
Both W and B are parameters learned during training.
In neural networks, parameters include:
- Weights between neurons
- Biases for neurons
Example network:
- Input layer: 784 neurons
- Hidden layer 1: 512 neurons
- Hidden layer 2: 128 neurons
- Output layer: 10 neurons
Parameter counts:
- Input → Hidden1: 784 × 512 weights
- Hidden1 → Hidden2: 512 × 128 weights
- Hidden2 → Output: 128 × 10 weights
- Biases for each neuron in hidden + output layers
All these are optimized during training using algorithms such as Stochastic Gradient Descent (SGD).
Hyperparameters are external configuration settings defined before training begins.
Unlike parameters, they are not learned from data. Instead, they control the training process or model structure.
- Set manually by the practitioner
- Often tuned via trial & error or search strategies
- Control training behavior and model architecture
- Not updated during training
- Parameters → materials (bricks, cement, wood)
- Hyperparameters → blueprint (number of rooms, layout)
- Learning rate: step size during parameter updates
- Batch size: number of samples per training step
- Number of epochs: full passes over the dataset
- Network architecture choices:
- Number of layers
- Number of neurons per layer
- Activation functions
| Aspect | Parameters | Hyperparameters |
|---|---|---|
| Learned from data? | ✔️ Yes | ❌ No |
| Set manually? | ❌ No | ✔️ Yes |
| Role | Define model behavior | Control training & architecture |
| Examples | Weights, biases | Learning rate, batch size, epochs |
- Parameters are the internal values (weights, biases) learned by the model to make accurate predictions.
- Hyperparameters define how the model is trained and structured, and must be chosen before training.
- Proper hyperparameter tuning is essential for good model performance.