Image classification is an incredibly useful tool in regards to identifying small details or changes in previously existing data. Convolutional Neural Networks (CNNs) are sets of different forms of layers that learn image features and classify them according to a dataset. The combination of convolutional, max pooling, flatten, and dense layers are used to search through image data and connect their previously existing neurons to predict the classificaiton of a given image. For our classification model, our team decided to explore the differences in chihuahuas and muffins.
Our team imported an existing dataset from Kaggle that contained testing and training datasets for muffins and chihuahuas. The dataset consists of 6,000 images of muffins and dogs.
import kagglehub
import shutil
import os
path = kagglehub.dataset_download("samuelcortinhas/muffin-vs-chihuahua-image-classification")
Dataset source: https://www.kaggle.com/datasets/samuelcortinhas/muffin-vs-chihuahua-image-classification/data
The dataset was defined to a path and local path. The images from said dataset were resized to a consistent height and width (256x256) before being placed into a separate dataset.
We then imported TensorFlow to assist in the idenification process. Two directories were created for testing and training our data. We then defined our image data generator before loading the data. We went on to build our sequential model through several Rectified Linear Unit layers and converted our classes into binary.
We created a simple CNN model starting with 3 Convolutional Layers, each of them followed by Max Pooling Layers. After 3 sets of convolutional and max pooling layers, a flatten layer was added. To help the model make decisions, we added a Dense layer with 512 neurons and ReLU activation, and finally a single output neuron with a sigmoid function — since we’re doing binary classification: muffin or chihuahua.
Conv2D Layers: These are fundamental building blocks of Convolutional Neural Networks (CNNs) used for feature extraction from images. Each layer applies multiple filters that slide across the image, performing element-wise multiplication and summation to create activation maps. These filters help the network learn different features like edges, textures, or shapes.
MaxPooling2D: This layer reduces the spatial dimensions (height and width) of the feature maps while retaining the most important information. It does this by selecting the maximum value in each region (usually 2x2), which helps reduce computation and control overfitting.
Flatten Layer: It reshapes the multi-dimensional output from the convolutional layers into a one-dimensional vector, preparing the data for the dense (fully connected) layers.
Dense Layers: These are traditional fully connected layers where each neuron is connected to all neurons in the previous layer. They help the model learn complex patterns and relationships in the data. The final dense layer has a sigmoid activation function, indicating binary classification (Muffin vs. Chihuahua)
Dropout: Dropout is used to prevent overfitting by randomly setting 50% of the input units to zero during training
Activation Functions decide whether a neuron should be activated or not, introducing non-linearity into the model so it can learn complex patterns. For all Conv2D layers we used ReLU activation function and for the last Dense layer we used Sigmoid function.
- ReLU (Rectified Linear Unit) outputs the input directly if it’s positive; otherwise, it outputs zero. This helps the network learn faster and reduces the chance of vanishing gradients.
- Sigmoid squashes input values between 0 and 1, making it useful for binary classification tasks because it can be interpreted as a probability.
The image below shows how our initial model looks like:
After training the model on a dataset of resized images, we tested it using images from the test set to see how well it performs on new, unseen data. As shown in the examples below, the model was able to accurately classify whether the input image was a muffin or a chihuahua, demonstrating that it successfully learned to recognize the visual patterns that separate the two.
Our initial experiment used 3 Convolutional Layers and trained the model for 8 epochs. An epoch simply means that the model has seen the entire training dataset once. So when we say 8 epochs, it means the model went through the training data 8 times, learning a bit more each time. https://docs.google.com/document/d/1aL7TsNyM5jmVpH8vkVLNhDZHusljWXCOHy_FGIBDWuQ/edit?tab=t.0
Below are visualization of training plots:
As the epochs go on, both accuracy lines trend upward, it means the model is learning from the data and improving over time. By the 7th epoch, both training and validation accuracy are close to 90%. Even though the validation loss fluctuates a bit (which is normal, but learning rate can be changed to avoid this), the general downward trend means the model is becoming more confident and accurate in its predictions.
Initially, we set the model with 3 Conv2D layers, 3 MaxPooling2D layers, and 2 Dense layers, and trained it for 8 epochs. This setup achieved an accuracy close to 90%, which was a great result for the first attempt.
In all the experiments we conducted to find the best-performing model, we changed the number of Conv2D layers, epochs, activation functions, and learning rate, observing how each variation affected the model's performance. Some changes didn’t lead to noticeable improvements in accuracy, while others caused the model’s performance to drop significantly.
After running all of these experiments, we identified our best and worst models based on their accuracy and the smoothness of their performance plots.
Below, you can find details about these models and the parameters that led to them.
- Best Model: After this experiment, we decided to add a Dropout layer to prevent overfitting but kept all other parameters the same. With this small tweak, the model performed better and improved the test accuracy to 91%. It’s also clear from the plots that, compared to the initial model, this tweak made the curves smoother and less zigzaggy.
- Worst Model: The worst model was observed when we used 2 Conv2D layers—one with the 'relu' activation function and the other with 'sigmoid'. The number of epochs was set to 8, and the learning rate was left at its default value.
Below is a representation of this model's performance:






