Phishing Email Detection - Machine Learning Component

This repository hosts the machine learning component of a project aimed at detecting phishing emails. Android application that utilizes federated machine learning techniques for enhanced privacy and decentralized learning. Focused on preprocessing email data, training a machine learning model, and making predictions to identify potential phishing attempts, project is a part of my Bachelor's thesis on phishing email detection.

Project Overview

This project identifies phishing attempts. Utilizing machine learning techniques, the project uses a series of feature finders that extract features, such as embedded URLs, HTML content, attachments, and more. These features are then fed into a TensorFlow-based model, which is trained to classify phishing and legitimate emails. The ultimate goal is to integrate this model into an Android application, leveraging federated learning to improve model accuracy while maintaining user privacy.

Main Features

EML to MBOX Conversion: A script designed to convert .eml email messages from a sample folder into an .mbox file, used for feature extraction, model training, and prediction processes.
Data Preparation: Automated scripts that load, preprocess, and cleanse email datasets, preparing them for effective model training.
Model Training: Utilizes TensorFlow to construct and train a sophisticated machine learning model adept at distinguishing phishing emails.
Prediction: Employs the trained model to evaluate new datasets, predicting potential phishing attempts with a suite of evaluation metrics to gauge performance.

Feature Finders and Detection Strategy

Our phishing detection uses several feature finders, each responsible for extracting specific elements from emails that are commonly used by phishing attempts:

HTMLFormFinder: Identifies HTML forms within emails, a common phishing vector to solicit user information.
IFrameFinder: Detects the use of IFrames, potentially embedding malicious content invisibly.
FlashFinder: Searches for Flash content links, which could execute harmful scripts.
AttachmentFinder: Counts email attachments, which may contain malicious payloads.
HTMLContentFinder: Looks for specific HTML content indicative of phishing.
URLsFinder: Extracts and evaluates URLs found within emails for malicious links.
ExternalResourcesFinder: Identifies external resources linked within emails that could be harmful.
JavascriptFinder: Detects JavaScript, which can be used in phishing for malicious activities.
CssFinder: Searches for custom CSS that might be used to disguise phishing attempts.
IPsInURLs: Checks for IP addresses in URLs, a technique used to bypass domain name suspicion.
AtInURLs: Identifies '@' symbols in URLs, which can be a sign of deceptive links.
EncodingFinder: Analyzes the content encoding for signs of obfuscation or unusual patterns.

Project Context

This machine learning component is part of a larger system designed for phishing email detection on Android devices. For more information on the entire project, visit the main repository: Phishing Emails Detection Project.

Getting Started

Follow these instructions to set up the machine learning component of the phishing email detection project on your local machine for development, testing, and contribution purposes.

Prerequisites

Ensure you have the following installed:

pip3_requirements.txt

Usage

Usage of the scripts is better described and understandable in the Main Notebook.

Authors

martinszuc

Acknowledgments and References

This project builds upon and extends the work found at MachineLearningPhishing by Diego Ocampo.

Data Sources

The data used for training the phishing detection model were sourced from two main repositories, which provided a rich dataset of phishing emails:

Phishing Pot Dataset by rf-peixoto (converted .eml to mbox using scripts in this repo)
Phishing Dataset by jose at monkey.org (downloaded mbox files)

License

This project is licensed under the MIT License - see the LICENSE.md file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
res		res
samples		samples
samples_res		samples_res
server		server
tf_model_saved		tf_model_saved
tf_model_saved_retrained		tf_model_saved_retrained
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
email_converter_eml_to_mbox.py		email_converter_eml_to_mbox.py
feature_finders.py		feature_finders.py
main.ipynb		main.ipynb
model_predict.py		model_predict.py
model_retrain.py		model_retrain.py
model_train.py		model_train.py
pip3_requirements.txt		pip3_requirements.txt
process_emails_mbox_to_csv.py		process_emails_mbox_to_csv.py
test_notebook.ipynb		test_notebook.ipynb
utils_config.py		utils_config.py
utils_data_preparation.py		utils_data_preparation.py
utils_debug.py		utils_debug.py
utils_email_converter.py		utils_email_converter.py
utils_feature_extraction.py		utils_feature_extraction.py
utils_finders.py		utils_finders.py
utils_model.py		utils_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Phishing Email Detection - Machine Learning Component

Project Overview

Main Features

Feature Finders and Detection Strategy

Project Context

Getting Started

Prerequisites

Usage

Authors

Acknowledgments and References

Data Sources

License

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Phishing Email Detection - Machine Learning Component

Project Overview

Main Features

Feature Finders and Detection Strategy

Project Context

Getting Started

Prerequisites

Usage

Authors

Acknowledgments and References

Data Sources

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages