EEG Motor Imagery Classification for BCIs

Simplified to binary classes (eyes open/closed) to understand and improve architectural choices.

The project and my objectives

This document describes the theory and reasoning behind a Convolutional Neural Network (CNN) used to classify between the EEG data (brain signals) of a subject's eyes being open vs closed. I built upon this repository to use it as a guideline for what to learn and also understand what a practical implementation of the theory looks like; I took this route because my overall objective was to gain a fundamental introduction and understanding of machine learning (ML) beyond basic regression, and explore my interest in brain-computer interfaces.

I improved the baseline model by implementing 3×3 kernels to utilise receptive fields, stride-based downsampling, and pure window segmentation to prevent label bleeding, which increased test accuracy from 94% to 99% and reduced the training/testing gap by 3% on average.

Why CNN

The two most basic discriminative models, beyond a multi-layer perceptron, are RNNs and CNNs. I chose a CNN over a recurrent neural network (RNN) due to the nature of multi-channel EEG data. RNNs excel at identifying temporal patterns in the data, whereas CNNs are better suited to learn spatio-temporal features. EEG data has both location (where the electrode is on the head) and time-determined patterns, so a CNN can better model the data.

The methodology

Preprocessing

I use a cleaned data set for subject one from EEG Motor Movement/Imagery Dataset v1.0.0, which includes 11 classes of time point data, but I remove 9 of these 11 classes to classify between eyes open vs closed. This is so that I can focus on the fundamentals before moving to multi-class classification.

I segment the data because patterns in a large data set may bleed together, making it difficult for the model to detect them. I fix this by isolating them in segments, which also creates a larger dataset to train on. I also prevented label bleeding during training by removing windows that had more than one label, which can be important in EEG motor imagery since periods between events can create ambiguous training examples.

I split the data into training and test data before normalisation to avoid data leakage. I then performed channel-wise normalisation on the data, using only the training data to prevent data leakage. Normalising each channel independently prevents channels with larger amplitudes from dominating the training process.

I use mini-batching because it helps escape local minima by injecting noise, which comes from calculating the gradients from a random subset of the data.

Model architecture

Layer structure

Conv1: 1 input channel, 16 kernels, 5×5 kernel, stride 2, batch norm, ReLU
Conv2: 16 input channels, 32 kernels, 3×3 kernel, stride 2, batch norm, ReLU
Conv3: 32 input channels, 64 kernels, 3×3 kernel, stride 2, batch norm, ReLU
MLP: flatten, dense(128), dropout(0.35), dense(2)

Design Reasoning

I chose to use three layers after finding out from experimentation that deeper architectures didn't improve test accuracy but increased overfitting. I used the pyramid structure (doubling the number of kernels in each layer) to compensate for the decreasing spatial resolution while learning increasingly abstract features from the learned features.

I use a kernel size of 3x3 because stacking smaller kernels like 3x3 is more parameter-efficient than using a single large 5x5 kernel, and it achieves the equivalent receptive field with fewer parameters and adds an extra non-linear activation layer (ReLU) in between. Additionally, I didn't go lower than 3x3 because 2x2 isn't symmetrical around a central pixel, which can distort feature representation, and 1x1 solely reduces the number of channels. The first layer I use is 5x5 because there is no previous layer to increase its receptive field.

I use a stride of 2 as my method of downsampling over pooling because it avoids harsh summarising of the data (unlike pooling) to prevent discarding relevant spatial patterns. I used padding = 2 to preserve edge information while downsampling with a stride of 2, ensuring edge elements contribute to feature maps.

I chose to use ReLU as my activation function because it mitigates vanishing gradients since its derivative is either 0 or 1. Although the gradient of 0 can cause dying neurons, I accepted the trade-off given my shallow architecture of 3 layers. I introduce a dropout in each layer to prevent overfitting by setting some node weights to zero, reducing the model's sensitivity to specific training examples. I use batch normalisation between layers because, as the data propagates through the network, it may become uncentered and different features may be on very different scales. I also used L2 regularisation to prevent weights from becoming too large, which could cause overfitting. This was achieved by adding a penalty to the loss function to keep weights small. I chose L2 over L1 regularisation because L1's tendency to zero out weights would compound with ReLU's dying neuron problem in my architecture.

I use cross-entropy loss because it heavily penalises misclassifications compared to regression-based loss functions such as mean squared error (MSE).

Results

I achieved a peak accuracy of 99.18% on test data and an average of 98%, a ~5% improvement over the original model. I also reduced the overfitting gap by 3% on average. Despite these successes, my model's training accuracy reaches peaks of 100%, compared to the baseline's peak of 98%. This suggests my model is overfitting on the training data, and early stopping may address this issue upon implementation soon.

Improvements

This project is still in active development so still suffers some issues, but showcasing my work is important which is why I am publishing the code now. The following are my current focuses.

Reduce overfitting with early stopping.
Implement more sophisticated evaluation techniques such as AUC, cross-validation and a classification report.
Investigate parametric ReLU.
Implement a hyperparameter search instead of using standard values, as the best ones will vary from data set to data set.

Suggestions from an AI PhD student (specialising in speech models for BCIs) I contacted about my model:

Subband filtering
Experiment with temporal frequency Power Spectral Density (PSD) and time frequency representation (TFR)

To gain a better understanding of the entire pipeline and EEG data, I am contacting other PhD students at my university in the hope of working with them to help collect such data.

A link to my more comprehensive and explanatory notes documenting everything I learnt as part of this project can be found here.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EEG Motor Imagery Classification for BCIs

The project and my objectives

Why CNN

The methodology

Preprocessing

Model architecture

Results

Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EEG Motor Imagery Classification for BCIs

The project and my objectives

Why CNN

The methodology

Preprocessing

Model architecture

Results

Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages