Skip to content

orhis/malware

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 

Repository files navigation

Malware Detector

This project is a malware detection framework that utilizes Portable Executable (PE) metadata and CAPA (Capabilities Analysis) rules to classify files as benign or malware. It employs machine learning techniques, specifically Random Forest classifiers, to distinguish between safe and malicious executables.

Features

  • PE Metadata Extraction: extracts static features from PE headers using pefile.
  • CAPA Capabilities: integrates flare-capa to identify capabilities (behaviors) within the binary.
  • Data Pipeline: Tools to download samples, extract features in parallel, and merge datasets.
  • Machine Learning: Trains Random Forest models with different feature sets (PE only vs. PE + CAPA).

Installation

  1. Clone the repository (if you haven't already).
  2. Create a virtual environment (recommended):
    python -m venv venv
    source venv/bin/activate  # On Linux/Mac
    # or
    .\venv\Scripts\activate   # On Windows
  3. Install dependencies:
    pip install -r requirements.txt

Project Structure

malware-detector/
├── data/               # Raw samples and generated CSV metadata
│   ├── raw/
│   │   ├── benign/     # Place benign PE files here
│   │   └── malware/    # Place malware PE files here
│   ├── pe_metadata.csv
│   └── capa_metadata.csv
├── models/             # Trained models are saved here
├── src/                # Source code
│   ├── download_samples.py
│   ├── run_pe.py       # Extract PE features
│   ├── run_capa.py     # Extract CAPA features
│   ├── build_dataset.py
│   └── train_supervised.py
├── config.yaml
└── requirements.txt

Usage Workflow

1. Prepare Data

You need a dataset of benign and malware PE files.

  • Benign Files: Place safe .exe or .dll files in data/raw/benign/.
  • Malware Files: Place malicious files in data/raw/malware/.

You can use the provided script to download malware samples (requires a specific URL structure, typically for a specific dataset source):

python src/download_samples.py <url_to_zip>

2. Extract Features

The training process requires two CSV files: pe_metadata.csv and capa_metadata.csv.

Extract PE Metadata: This script analyzes headers of files in data/raw/benign and data/raw/malware.

python src/run_pe.py

Output: data/pe_metadata.csv

Extract CAPA Capabilities: This script runs flare-capa against the samples. This process can be time-consuming, so it supports parallel processing and resuming.

python src/run_capa.py

Output: data/capa_metadata.csv

3. Train Models

Run the training script to build and evaluate the models. The script runs three scenarios:

  1. PE Only: Uses only header metadata.
  2. PE + All CAPA: Uses headers and all found CAPA rules.
  3. PE + Filtered CAPA: Uses headers and CAPA rules filtered by frequency.
python src/train_supervised.py

Results (Classification Report, Confusion Matrix, ROC AUC) will be printed to the console, and models will be saved in the models/ directory:

  • models/pe/
  • models/capa_all/
  • models/capa_filtered/

About

Projekt max Malware

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages