L1 regularization, commonly known as Lasso Regularization, is a powerful technique used in deep learning to reduce overfitting and improve model generalization. It works by adding a penalty to the loss function that is proportional to the absolute value of the weights. This penalty encourages the model to learn sparse weight representations, meaning that many weights are driven to zero.
In this study note, we explore:
- Why L1 regularization is needed
- How it helps combat overfitting
- How to apply it in a Keras deep learning model
- How to interpret training results
- Additional learning insights and best practices
A common indicator of overfitting is a divergence between training and validation loss curves:
- Training loss continues to decrease
- Validation loss stagnates or increases
This pattern indicates that the model is memorizing training data instead of learning generalizable patterns.
✅ Goal: Reduce this divergence to improve generalization to unseen data.
L1 regularization modifies the loss function as follows:
[ \text{Loss} = \text{Original Loss} + \lambda \sum |w_i| ]
Where:
- ( w_i ) are the model weights
- ( \lambda ) (alpha) is the regularization strength
- Encourages sparse models (many weights become exactly zero)
- Acts as an implicit feature selector
- Helps reduce variance and overfitting
from tensorflow.keras.regularizers import l1Apply the regularization to the kernel (weights) of each hidden layer.
model = Sequential([
Dense(128, activation='relu', kernel_regularizer=l1(0.001)),
Dense(64, activation='relu', kernel_regularizer=l1(0.001)),
Dense(1, activation='sigmoid')
])- kernel_regularizer=l1(0.001) adds an L1 penalty
- 0.001 controls the strength of regularization
- Larger values increase sparsity but may cause underfitting
model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy']
)history = model.fit(
X_train,
y_train,
epochs=15,
batch_size=128,
validation_split=0.1
)- epochs = 15: Allows sufficient learning
- batch_size = 128: Balances speed and stability
- validation_split = 0.1: Monitors generalization
- After training, plot:
- Training loss
- Validation loss
-
✅ Both curves decrease at a similar rate
-
✅ Validation loss no longer diverges
-
✅ Model generalizes better
-
This indicates that L1 regularization is effective.
-
Why L1 Regularization Works
-
Penalizes large weights
-
Forces the network to rely on only the most important features
-
Simplifies the model representation
-
Reduces sensitivity to noise in the training data
- L1 vs L2 Regularization
| Aspect | L1 (Lasso) | L2 (Ridge) |
|---|---|---|
| Weight behavior | Many weights → 0 | Small but non-zero |
| Feature selection | ✅ Yes | ❌ No |
| Sparsity | High | Low |
| Stability | Less stable | More stable |
- ⚖️ Tune the regularization parameter (λ) carefully
- 🧪 Combine with cross-validation
- 🔀 Often used together with Dropout or Early Stopping
- 🧠 Useful when dataset has many irrelevant features
- ✅ High-dimensional data
- ✅ Need interpretability
- ✅ Feature selection is important
- ✅ Strong signs of overfitting
- 🚫 Avoid if: - Dataset is very small - Important features should not be eliminated
- L1 regularization helps prevent overfitting by enforcing sparsity
- Easy to implement using kernel_regularizer=l1()
- Improves generalization and model robustness
- Useful as part of a broader regularization strategy