Igbo-English Hate Speech Detection

A proof-of-concept machine learning system that classifies Igbo-English code-mixed text as hate speech or non-hate speech. Built as an HND final-year project, this is the first labeled dataset and detection baseline published for the Igbo-English language pair, contributing to NLP research for low-resource African languages.

What this project contributes

A labeled Igbo-English code-mixed dataset (268 examples) built by the author from scratch, the first of its kind publicly available.
A trained binary classifier (TF-IDF features into a dense neural network) with a working evaluation pipeline.
A desktop GUI application built with Kivy, so anyone can install and try the classifier without writing code.
All research artifacts published including the Kaggle notebook and Kaggle dataset, in addition to this repository.

Why it matters

Most NLP research targets high-resource languages like English. Igbo, spoken by over 40 million people in Nigeria, has very limited NLP resources. Igbo-English code-mixing, which is how most Igbo speakers actually communicate online, has almost no published work at all. This project is a small step toward filling that gap.

Dataset

Source: Built from scratch by the author using two methods:

An online Google Form distributed to native Igbo speakers to collect common Igbo hate speech terms and expressions
Igbo-English dictionary websites to source lexical context for non-hate examples

Size: 268 labeled examples (153 hate speech, 115 non-hate)

Design: The dataset uses a minimal-pair structure. The same Igbo words (for example, animal names like Nkịta, Ezi, Ọgazị) appear in both classes, requiring the model to learn contextual usage rather than relying on a blacklist of words.

Example non-hate: "Nkịta, the dog, barks loudly to alert its owner."

Example hate: "You act like a Nkịta, always causing trouble."

Available on Kaggle: Igbo Hate Speech Dataset

Model

A lightweight binary classifier:

Features: TF-IDF vectorization
Architecture: Sequential neural network with one hidden Dense layer (128 units, ReLU activation) and a Sigmoid output
Optimizer: Adam
Loss: Binary cross-entropy
Training: 10 epochs, batch size 32, 20% validation split

Full training notebook published on Kaggle: Igbo Hate Speech Detection notebook

Results

Evaluated on a held-out test set of 54 examples:

Overall test accuracy: 79.6%

Per-class metrics:

Class	Precision	Recall	F1
Non Hate Speech	1.00	0.56	0.72
Hate Speech	0.72	1.00	0.84

The classifier catches 100% of actual hate speech in the test set. It trades off against precision on the non-hate class, which is a reasonable bias for a content moderation baseline but worth flagging honestly.

Confusion Matrix

Classification Metrics by Class

Training and Validation Curves

Limitations and next steps

I want to be transparent about where this project sits as engineering:

Dataset size is small. 268 examples is enough for a proof of concept, not production. Expanding to several thousand examples would meaningfully improve generalization.
The model overfits. Training accuracy reaches ~99% while validation plateaus around 75%. Dropout, L2 regularization, or early stopping would help.
TF-IDF is a simple baseline. Word embeddings (fastText, or an Igbo-aware embedding) or a small transformer would likely outperform TF-IDF on code-mixed text.
Binary labels only. Real hate speech moderation needs more granular categories (offensive, hateful, threatening, and so on).
Class imbalance handling. The 153/115 split is mild, but a production system should use class weights or resampling.

A follow-up version of this project would expand the dataset to at least 2,000 examples and compare the TF-IDF baseline against an embedding-based approach.

Install and run

Prerequisites

Python 3.x
Dependencies in requirements.txt

Setup

git clone https://github.com/Psunday2000/hate_speech_igbo.git
cd hate_speech_igbo
pip install -r requirements.txt

Run the GUI

python main.py

The Kivy-based GUI launches. Enter a message and click Submit to classify it.

Run the notebook

Open hate_speech_model.ipynb in Jupyter to retrain, modify, or experiment with the model.

Repository contents

hate_speech_dataset.csv — the full labeled dataset
hate_speech_model.ipynb — training notebook with preprocessing, model, and evaluation
hspeechmodel.h5 — trained Keras model
vectorizer.pkl — fitted TF-IDF vectorizer
main.py — Kivy GUI application
test.py — standalone inference script
convert.py — utility script
Visualization images (confusion matrix, training curves, classification metrics, tokenization, vectorization)

License

Released under the MIT License. See LICENSE for details.

Citation

If you use this dataset or model in research, please cite:

Sunday, P. C. (2024). Igbo-English Hate Speech Detection: A Proof-of-Concept 
Binary Classifier for a Low-Resource Language Pair. HND Final Year Project, 
Akanu Ibiam Federal Polytechnic, Unwana.
Available at: https://github.com/Psunday2000/hate_speech_igbo

Author

Built by Praise Chiedozie Sunday.

Portfolio: praise-chiedozie-sunday.vercel.app
LinkedIn: praise-sunday
GitHub: @Psunday2000

If you find this project useful, a star on the repo is appreciated.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Classifcation Metrics by Class.png		Classifcation Metrics by Class.png
Confusion Matrix.png		Confusion Matrix.png
Montserrat-Medium.ttf		Montserrat-Medium.ttf
README.md		README.md
Training and Validation Accuracy and Loss over Epochs.png		Training and Validation Accuracy and Loss over Epochs.png
cleaned_data.png		cleaned_data.png
convert.py		convert.py
expr.json		expr.json
expr.txt		expr.txt
hate_speech_dataset.csv		hate_speech_dataset.csv
hate_speech_model.ipynb		hate_speech_model.ipynb
hspeechmodel.h5		hspeechmodel.h5
main.py		main.py
negative_image.png		negative_image.png
positive_image.png		positive_image.png
requirements.txt		requirements.txt
test.py		test.py
tokenization.png		tokenization.png
vectorization.png		vectorization.png
vectorizer.pkl		vectorizer.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Igbo-English Hate Speech Detection

What this project contributes

Why it matters

Dataset

Model

Results

Confusion Matrix

Classification Metrics by Class

Training and Validation Curves

Limitations and next steps

Install and run

Prerequisites

Setup

Run the GUI

Run the notebook

Repository contents

License

Citation

Author

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Igbo-English Hate Speech Detection

What this project contributes

Why it matters

Dataset

Model

Results

Confusion Matrix

Classification Metrics by Class

Training and Validation Curves

Limitations and next steps

Install and run

Prerequisites

Setup

Run the GUI

Run the notebook

Repository contents

License

Citation

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages