In machine learning and deep learning, parameters and hyperparameters play crucial yet distinct roles in how models are constructed, trained, and optimized.
- Parameters are the internal variables learned directly from training data.
- They are not set manually.
- The learning algorithm updates them to minimize the loss function, which measures the difference between predictions and actual values.
- Learned during training
- Adjusted automatically via optimization (e.g., gradient descent)
- Define how the model represents patterns in data
A simple house-price model:
[ Y = W \cdot X + B ]
- W (weight): slope
- B (bias): y-intercept
- Parameters:
WandB - These are adjusted during training to minimize prediction error.
- Input Layer: 784 neurons (28×28 pixel image)
- Hidden Layer 1: 512 neurons
- Hidden Layer 2: 128 neurons
- Output Layer: 10 neurons (digit classes 0–9)
- Input → Hidden1:
784 × 512 - Hidden1 → Hidden2:
512 × 128 - Hidden2 → Output:
128 × 10
- Hidden Layer 1: 512 biases
- Hidden Layer 2: 128 biases
- Output Layer: 10 biases
All weights and biases = parameters learned during training via optimization algorithms such as stochastic gradient descent.
Hyperparameters are external configuration choices set before training begins.
They:
- Govern the behavior of the training algorithm
- Define the model architecture
- Are not learned from data
- Require experimentation and tuning
- Parameters = construction materials (bricks, wood, cement)
- Hyperparameters = architectural blueprint (number of rooms, layout, design choices)
- Controls the size of weight updates during training
- Too high → unstable
- Too low → slow learning
- How many samples are used per gradient update
- Number of times the full dataset is passed through the model during training
| Concept | Parameters | Hyperparameters |
|---|---|---|
| Learned during training? | ✔️ Yes | ❌ No |
| Set manually? | ❌ No | ✔️ Yes |
| Examples | Weights, biases | Learning rate, batch size, epochs |
| Role | Define model behavior | Define training process & model design |
These notes provide a clear differentiation between parameters and hyperparameters, along with examples to deepen understanding.
Hyperparameters are external settings that define how a neural network is structured and trained.
Unlike model parameters (weights and biases), hyperparameters are set before training and have a major influence on learning efficiency, convergence, and generalization.
- They shape the learning process and model architecture.
- Good hyperparameter choices lead to:
- Efficient training
- Stable optimization
- Better generalization to unseen data
- Poorly chosen hyperparameters can cause:
- Overfitting
- Underfitting
- Slow or unstable convergence
The learning rate controls how big a step the model takes when updating weights during each iteration.
- Too high → overshooting, divergence, unstable training
- Too low → slow learning, risk of getting stuck in suboptimal minima
- A common baseline: 0.001 (especially with Adam).
- Use learning rate schedulers:
- Step decay
- Cosine annealing
- Warmup schedules
- Perform a quick scan of values between 0.0001 to 0.1 to find a good range.
Batch size = number of samples processed before updating model parameters.
- Smaller batch sizes:
- Noisier gradients → can escape shallow minima
- Faster initial convergence
- Useful for limited GPU memory
- Larger batch sizes:
- More stable gradients
- Faster computation on modern GPUs
- Might require a higher learning rate
- Start between 32 and 256.
- Adjust based on:
- Hardware memory
- Model stability
- Learning rate behavior
Epochs = number of full passes through the training dataset.
- Too few → underfitting
- Too many → overfitting
- Typical starting range: 10 to 50 epochs
- Use early stopping based on:
- Validation loss
- Validation accuracy
- Stop training when validation performance stops improving.
These define the design of the neural network, such as:
- Number of layers
- Number of neurons per layer
- Start simple → shallow model
- If underfitting → increase depth or width
- For well-studied tasks, start with known architectures:
- ResNet (image classification)
- BERT variants (NLP)
As models grow:
- Add dropout
- Use weight decay (L2 regularization)
- Add batch normalization
Dropout randomly “turns off” a fraction of neurons during training.
Prevents over-reliance on specific nodes → improves generalization.
- 0.1 to 0.5, depending on:
- Layer type
- Model complexity
- If model underfits → reduce dropout
- If model overfits → increase dropout
Help prevent overfitting by penalizing large weights.
- Start with small coefficients
- Increase only if signs of overfitting appear
- Decrease if underfitting
Initialization determines how weights start before training.
Good initialization:
- Keeps gradients stable
- Ensures signals propagate properly
Poor initialization: - Causes vanishing or exploding gradients
- Xavier Initialization
- He Initialization
These are widely used for deep networks.
Controls how gradients update weights during backpropagation.
- SGD (Stochastic Gradient Descent)
- Simple, effective
- Often improved with momentum
- Adam
- Adaptive learning rate per parameter
- Robust across many tasks
- Good default choice
- Start with proven defaults (Adam or SGD+momentum).
- Only adjust optimizer settings if:
- Training divergence occurs
- Convergence is unusually slow
- Start with known, reliable baselines
- Adjust one hyperparameter at a time
- Monitor:
- Validation loss
- Validation accuracy
- Iterate gradually to develop intuition over time.
Hyperparameter tuning is a balancing act, but with systematic testing, patterns and best practices become clearer.
Hyperparameter tuning is the process of finding the best hyperparameter settings to optimize a deep learning model’s training performance, convergence, and generalization. Unlike model parameters (weights, biases), hyperparameters are set before training and are not learned automatically.
Because the hyperparameter search space can be large and complex, several tuning strategies are commonly used.
Grid search evaluates every possible combination of predefined hyperparameter values.
Example:
- Learning rates tested: 0.00001 → 0.1 (5 values)
- Batch sizes tested: 32 → 512 (5 values)
- Total combinations = 5 × 5 = 25 models
- Exhaustive and thorough — guarantees testing every point in the grid
- Simple, intuitive, and widely supported
- Ensures no defined configuration is missed
- Computationally expensive — grows exponentially with more hyperparameters
- Rigid — must evaluate every combination, even poor candidates
- Not ideal for large or continuous search spaces
Instead of testing all combinations, random search samples random values from specified ranges.
Example:
- Learning rate randomly chosen from: 0.0001 → 0.1
- Batch size randomly chosen from: 32 → 512
- More efficient than grid search
- Quickly finds good configurations
- Scales well — add more trials as needed
- Focuses on potentially useful regions of the search space
- No guarantee of exploring all important regions
- Might miss high‑performing configurations
- Less systematic coverage than grid search
Bayesian optimization builds a surrogate model to predict how hyperparameter settings affect performance.
After each evaluation, it updates its internal model to choose the next hyperparameters by balancing:
- Exploration (try new regions)
- Exploitation (refine promising regions)
Think of it like having a “smart assistant” that gets better at guessing where good hyperparameters are.
- Intelligent and efficient search
- Requires fewer evaluations than grid or random search
- Learns from previous trials
- Great for expensive or complex models
- More complex to implement
- Requires maintaining a sophisticated surrogate model
- Computational overhead for the optimization process itself
- The search space is small
- Hyperparameters have only a few discrete options
- You need complete coverage
- The search space is large
- Hyperparameters vary over continuous ranges
- You want faster, more flexible testing
- Training models is expensive
- You need efficient exploration
- The search space is complex and multi-dimensional
- Start with baseline or default values
- Tune one hyperparameter at a time initially
- Use domain knowledge when narrowing ranges
- Use validation metrics to guide decisions
- Iterate gradually — hyperparameter tuning is a process of refinement
Hyperparameter tuning ensures that a model reaches its best possible performance without unnecessary computation. By carefully selecting tuning methods and combining them with practical experience, you can efficiently explore the search space and achieve optimal results.
In this lesson, we learn how to define a tunable deep learning model in Keras, preparing it for hyperparameter tuning using Keras Tuner. This includes creating a function that acts as the model blueprint, specifying which hyperparameters should be searched and how they influence model structure and training.
Before defining the tunable model:
- Import and preprocess data
- Build and train a baseline model
- Import necessary layers and optimizers
These steps should already be completed earlier in the notebook.
To support hyperparameter tuning, we import:
- Dropout layer
- Adam optimizer
-
Keras Tuner repeatedly calls this function, each time with different hyperparameter values.
-
Function Name: build_model(hp)
-
hp is a hyperparameter object that defines which hyperparameters to search.
-
The function returns a compiled model ready for evaluation.
-
a. Model Initialization
-
model = keras.Sequential()
-
b. Input Layer
-
Input shape: 784 (flattened 28×28 image)
-
Hidden Layer 1 (Tunable)
-
Use hp.Int to tune the number of neurons:
hp.Int("hidden1", min_value=32, max_value=512, step=32)
- Dropout Layer 1 (Tunable)
- Use hp.Float to tune dropout rate:
hp.Float("dropout1", min_value=0.1, max_value=0.5, step=0.1)
- Hidden Layer 2 (Tunable)
- Number of neurons chosen with:
hp.Int("hidden2", min_value=16, max_value=128, step=16)
- Dropout Layer 2 (Tunable)
- Same dropout range as earlier:
hp.Float("dropout2", min_value=0.1, max_value=0.5, step=0.1)
-
Output layer has a fixed number of units:
-
10 units for digit classification (0–9)
-
Dense(10, activation="softmax")
#5️⃣ Learning Rate (Tunable with hp.Choice)
- We evaluate a set of discrete learning rates:
- hp.Choice("learning_rate", values=[0.0001, 0.001, 0.01])
- The optimizer uses the selected learning rate.
model.compile(
optimizer=Adam(learning_rate=hp_learning_rate),
loss="categorical_crossentropy",
metrics=["accuracy"]
)
-
Purpose of the Tunable Function
-
The function:
-
Acts as the architectural blueprint
-
Defines all hyperparameters to explore:
-
Hidden layer sizes
-
Dropout rates
-
Learning rate
-
Is repeatedly invoked during the hyperparameter search
-
This allows Keras Tuner to systematically explore the hyperparameter space and find the configuration that maximizes model performance.
-
Please see the code : ./run_hparam_tuning.py