Skip to content

AnkurRay25/Malware-Detection-Using-Machine-Learning.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Malware Detection Using Machine Learning

Python Scikit-Learn XGBoost Cybersecurity License

Overview

Malware continues to be one of the most significant cybersecurity threats affecting individuals, organizations, and critical infrastructure. Traditional signature-based detection systems often struggle to identify newly emerging malware variants.

This project presents a Machine Learning-based Malware Detection System that classifies executable files as either legitimate or malicious using Portable Executable (PE) file features. Multiple machine learning algorithms are evaluated and compared to identify the most effective approach for malware detection.


Key Features

  • Malware vs Legitimate File Classification
  • PE File Feature Analysis
  • Data Preprocessing and Feature Engineering
  • Class Imbalance Handling using SMOTE
  • Multiple Machine Learning Models
  • Performance Comparison and Evaluation
  • Cybersecurity Analytics

Dataset

The project utilizes a malware analysis dataset containing Portable Executable (PE) file characteristics.

Dataset Features

  • PE Header Information
  • Linker Information
  • File Structure Attributes
  • Executable Metadata
  • Malware Labels

Target Variable

Value Class
0 Malware
1 Legitimate

Data Preprocessing

The following preprocessing techniques were applied:

  • Missing Value Handling
  • Feature Selection
  • Label Encoding
  • Data Cleaning
  • Class Balancing using SMOTE
  • Feature Scaling using StandardScaler

Class Balancing

Synthetic Minority Oversampling Technique (SMOTE) was used to address class imbalance and improve model performance.


Machine Learning Models

The following classification algorithms were implemented and evaluated:

Random Forest

Ensemble learning model based on multiple decision trees.

XGBoost

Gradient boosting framework optimized for high-performance classification.

Decision Tree

Tree-based classification algorithm for malware detection.

Logistic Regression

Linear classification model for binary prediction.

Support Vector Machine (SVM)

Margin-based classification model for malware classification.

K-Nearest Neighbors (KNN)

Distance-based classification algorithm.


Experimental Workflow

Dataset Collection
        │
        ▼
Data Cleaning
        │
        ▼
Feature Engineering
        │
        ▼
SMOTE Balancing
        │
        ▼
Feature Scaling
        │
        ▼
Model Training
        │
        ▼
Model Evaluation
        │
        ▼
Performance Comparison

Model Performance

Model Accuracy
Random Forest 99.21%
XGBoost 99.17%
Decision Tree 98.64%
KNN 98.05%
SVM 96.44%
Logistic Regression 92.37%

Best Performing Model

Random Forest Classifier achieved the highest accuracy of 99.21%, demonstrating excellent capability for malware detection.


Evaluation Metrics

The models were evaluated using:

  • Accuracy
  • Precision
  • Recall
  • F1-Score
  • Confusion Matrix
  • Classification Report

Technologies Used

Programming Language

  • Python

Machine Learning

  • Scikit-Learn
  • XGBoost

Data Processing

  • Pandas
  • NumPy

Visualization

  • Matplotlib
  • Seaborn

Imbalanced Learning

  • SMOTE (Imbalanced-Learn)

Results

The experimental results demonstrate that machine learning techniques can effectively distinguish malicious software from legitimate applications using PE file characteristics.

The comparison of multiple classification algorithms highlights the effectiveness of ensemble learning methods such as Random Forest and XGBoost for cybersecurity applications.


Applications

Cybersecurity

  • Malware Detection Systems
  • Threat Intelligence Platforms
  • Endpoint Security Solutions
  • Security Operations Centers (SOC)

Research

  • Cybersecurity Analytics
  • Malware Classification
  • Machine Learning Security Applications
  • Threat Detection Research

Repository Structure

malware-detection-using-machine-learning/

├── Malware_Detection_Using_Machine_Learning.ipynb
├── README.md
├── requirements.txt
│
├── images/
│   ├── class_distribution.png
│   ├── confusion_matrix.png
│   ├── model_comparison.png
│   └── feature_analysis.png
│
└── dataset/

Requirements

  • Python 3.x
  • Pandas
  • NumPy
  • Scikit-Learn
  • XGBoost
  • Imbalanced-Learn
  • Matplotlib
  • Seaborn

Future Work

  • Deep Learning-based Malware Detection
  • Explainable AI (XAI) for Cybersecurity
  • Real-Time Malware Detection Systems
  • Feature Importance Analysis
  • Zero-Day Malware Detection
  • Cloud-Based Threat Detection

Author

Ankur Ray Chayan

Machine Learning Researcher | Embedded Systems Researcher

Research Interests

  • Artificial Intelligence
  • Machine Learning
  • Cybersecurity
  • Deep Learning
  • Explainable AI
  • Threat Detection

GitHub: https://github.com/AnkurRay25


License

This project is licensed under the MIT License.


About

Machine learning based malware detection system using PE file features, SMOTE balancing, and multiple classification algorithms including Random Forest, XGBoost, SVM, and KNN.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors