Deeplearning/L2_regularization.mde at main · RahulAloth/Deeplearning · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136

# Applying L2 Regularization (Ridge) in Deep Learning

## Overview
**L2 regularization**, also known as **Ridge Regularization**, is a widely used technique in deep learning to combat **overfitting** by discouraging the model from learning excessively large weight values. Instead of forcing weights to zero (as in L1), L2 regularization keeps weights **small and evenly distributed**, which leads to more stable and generalizable models.

In this study note, you will learn:
- How to identify overfitting in a baseline model
- The intuition behind L2 regularization
- How to implement L2 regularization using Keras
- How to evaluate its impact on model performance
- Additional learning insights and best practices

---

## Identifying Overfitting
When training a baseline deep learning model, overfitting typically appears as:

- Training loss decreasing steadily
- Validation loss diverging or decreasing at a much slower rate

This divergence between training and validation loss curves suggests that the model is learning noise and dataset-specific patterns rather than generalizable features.

> 🎯 **Objective:** Reduce this gap and ensure both losses decrease at a similar pace.

---

## What Is L2 Regularization?
L2 regularization modifies the loss function by adding a penalty equal to the **sum of squared weight values**:

\[
\text{Loss} = \text{Original Loss} + \lambda \sum w_i^2
\]

Where:
- \( w_i \) are the model weights
- \( \lambda \) (regularization strength) controls how strongly large weights are penalized

---

## Key Properties of L2 Regularization
- Penalizes large weights more strongly than small ones
- Encourages smooth, distributed learning across features
- Improves numerical stability during training
- Reduces variance without drastically increasing bias

---

## Applying L2 Regularization in Keras

### Step 1: Import the Regularizer
```python
from tensorflow.keras.regularizers import l2
````

### Step 2: Define the Regularized Model
Apply L2 regularization to the kernel (weights) of each hidden layer.
```python
model = Sequential([
    Dense(128, activation='relu', kernel_regularizer=l2(0.001)),
    Dense(64, activation='relu', kernel_regularizer=l2(0.001)),
    Dense(1, activation='sigmoid')
])
```
##### Explanation:
- kernel_regularizer=l2(0.001) penalizes large weight values
- 0.001 is a commonly used starting value
- Higher values increase regularization strength and may cause underfitting

### Step 3: Compile the Model

```python
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)
```

### Step 4: Train the Model
```python
history = model.fit(
    X_train,
    y_train,
    epochs=15,
    batch_size=128,
    validation_split=0.1
)
```

#### Training Configuration:
- epochs = 15: Sufficient iterations to observe training trends
- batch_size = 128: Efficient gradient updates
- validation_split = 0.1: Continuous monitoring of generalization

#### Evaluating the Results
- After training, plot both:
- Training loss
- Validation loss

##### Observations
- ✅ Training and validation loss decrease at a similar rate
- ✅ No large divergence between curves
- ✅ Model generalizes better than the baseline
- This behavior confirms that L2 regularization is effectively limiting overfitting.

##### Why L2 Regularization Works
- Penalizes squared weight magnitude, making very large weights costly
- Distributes importance more evenly across features
- Prevents the model from becoming overly sensitive to individual inputs
- Enhances generalization on unseen data

##### Additional Learning Points
###### L2 vs No Regularization
- Without regularization: large weights → fragile models
- With L2 regularization: smoother decision boundaries
- Especially effective for deep networks with many parameters

###### Best Practices
- ⚖️ Start with small values like 0.001 and tune gradually
- 🔄 Combine with Dropout for stronger regularization
- ⏱️ Use Early Stopping for additional overfitting control
- 📊 Monitor validation metrics closely during training

###### When to Use L2 Regularization
- ✅ Large or deep neural networks
- ✅ When all features may be useful
- ✅ When numerical stability is important
- ✅ When you want smoother, more robust models
- 🚫 Avoid overly strong L2 penalties with small datasets

###### Key Takeaways
- L2 regularization discourages large weights using squared penalties
- Easy to apply via kernel_regularizer=l2() in Keras
- Stabilizes training and improves generalization
- Often preferred over L1 in deep learning architectures