Skip to content

Deb-26/Malware-Classification-ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 

Repository files navigation

MalwareGuard – Malware Classification Dashboard

MalwareGuard is an XGBoost-based malware classifier with a modern Flask web dashboard. It uses the ClaMP Integrated malware dataset to train a binary classifier that labels samples as Malware or Benign and visualizes the predictions in a clean UI.


Features

  • XGBoost-based malware classification on PE features (ClaMP Integrated dataset)
  • Trainable model with saved artifact (malware_xgb.joblib)
  • Stylish Flask web dashboard:
    • CSV upload
    • Summary stats (total samples, malware vs benign)
    • Top 50 prediction results with probability
  • Encodes categorical fields (e.g. packer_type) consistently between training & inference

Project Structure

malware-classification-dashboard/
│
├── app.py                     # Flask web app (upload + results pages)
├── train_malware_model.py     # trains XGBoost model and saves malware_xgb.joblib
├── make_test_csv.py           # helper to generate test CSVs from the dataset
├── malware_xgb.joblib         # trained model artifact (generated)
├── requirements.txt           # Python dependencies
├── README.md
│
├── data/
│   ├── ClaMP_Integrated-5184.csv   # main training dataset (from Kaggle)
│   ├── ClaMP_Raw-5184.csv          # optional raw features
│   ├── test_with_labels.csv        # mix of malware/benign with labels (generated)
│   └── test_for_app.csv            # same, but without labels (for UI upload)
│
├── templates/
│   ├── index.html              # upload page
│   └── results.html            # results dashboard
│
└── static/
    └── styles.css              # custom dark-theme styling

Dataset

  • This project uses the ClaMP malware dataset:
    • Kaggle: Classification of Malwares – ClaMP dataset

      You need to download ClaMP_Integrated-5184.csv and place it into the data/ directory.

  • In the code, the path is:
CSV_PATH = "data/ClaMP_Integrated-5184.csv"

Setup

  1. Clone the repo
git clone https://github.com/Deb-26/Malware-Classification-ML.git
cd Classification_of_Malware
  1. Install dependencies
pip install -r requirements.txt
  1. Place the dataset

    Download ClaMP_Integrated-5184.csv from Kaggle and put it into:

data/ClaMP_Integrated-5184.csv

Training the model

Run:

python train_malware_model.py

This will:

  • Load data/ClaMP_Integrated-5184.csv
  • Encode categorical columns (e.g. packer_type)
  • Split into train / validation / test sets
  • Train an XGBoost classifier
  • Print metrics (accuracy, ROC-AUC)
  • Save the model + encoding maps to:
malware_xgb.joblib

Creating a test CSV

To generate a balanced test CSV (mixture of malware + benign):

python make_test_csv.py

This creates:

  • data/test_with_labels.csv – still has the class label (for evaluation)
  • data/test_for_app.csv – no label, good for uploading in the web UI

Running the web app

Make sure malware_xgb.joblib exists (after training), then start Flask:

python app.py

By default, the app runs at:

http://127.0.0.1:5000/

Flow

  • Open the URL in your browser.
  • On the upload page, select data/test_for_app.csv (or any CSV with the same feature columns).
  • Click “Run Malware Analysis”.
  • You’ll be redirected to the results page:
    • Total samples
    • Predicted malware count & percentage
    • Predicted benign count
    • Table of up to 50 rows with:
      • filesize
      • packer_type
      • E_file
      • fileinfo
      • malware_probability (%)
      • prediction_label (Malware / Benign)

Screenshots

  • Dashboard

    image
  • Output

    • test_for_app.csv

      image
    • Results summary cards & Prediction table

      image
    • test_with_labels.csv

      image
    • Results summary cards & Prediction table

      image
    • ClaMP_Integrated-5184.csv

      image
    • Results summary cards & Prediction table

      image

Possible Improvements

  • Add download button to export predictions as CSV
  • Color-coded risk levels based on probability
  • API endpoint (/api/predict) that accepts JSON
  • Model comparison (RandomForest vs XGBoost)
  • Dockerfile for containerized deployment

About

XGBoost-based malware classification model with a Flask web dashboard using the ClaMP malware dataset.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors