Malware Detector

This project is a malware detection framework that utilizes Portable Executable (PE) metadata and CAPA (Capabilities Analysis) rules to classify files as benign or malware. It employs machine learning techniques, specifically Random Forest classifiers, to distinguish between safe and malicious executables.

Features

PE Metadata Extraction: extracts static features from PE headers using pefile.
CAPA Capabilities: integrates flare-capa to identify capabilities (behaviors) within the binary.
Data Pipeline: Tools to download samples, extract features in parallel, and merge datasets.
Machine Learning: Trains Random Forest models with different feature sets (PE only vs. PE + CAPA).

Installation

Clone the repository (if you haven't already).

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Linux/Mac
# or
.\venv\Scripts\activate   # On Windows

Install dependencies:
```
pip install -r requirements.txt
```

Project Structure

malware-detector/
├── data/               # Raw samples and generated CSV metadata
│   ├── raw/
│   │   ├── benign/     # Place benign PE files here
│   │   └── malware/    # Place malware PE files here
│   ├── pe_metadata.csv
│   └── capa_metadata.csv
├── models/             # Trained models are saved here
├── src/                # Source code
│   ├── download_samples.py
│   ├── run_pe.py       # Extract PE features
│   ├── run_capa.py     # Extract CAPA features
│   ├── build_dataset.py
│   └── train_supervised.py
├── config.yaml
└── requirements.txt

Usage Workflow

1. Prepare Data

You need a dataset of benign and malware PE files.

Benign Files: Place safe .exe or .dll files in data/raw/benign/.
Malware Files: Place malicious files in data/raw/malware/.

You can use the provided script to download malware samples (requires a specific URL structure, typically for a specific dataset source):

python src/download_samples.py <url_to_zip>

2. Extract Features

The training process requires two CSV files: pe_metadata.csv and capa_metadata.csv.

Extract PE Metadata: This script analyzes headers of files in data/raw/benign and data/raw/malware.

python src/run_pe.py

Output: data/pe_metadata.csv

Extract CAPA Capabilities: This script runs flare-capa against the samples. This process can be time-consuming, so it supports parallel processing and resuming.

python src/run_capa.py

Output: data/capa_metadata.csv

3. Train Models

Run the training script to build and evaluate the models. The script runs three scenarios:

PE Only: Uses only header metadata.
PE + All CAPA: Uses headers and all found CAPA rules.
PE + Filtered CAPA: Uses headers and CAPA rules filtered by frequency.

python src/train_supervised.py

Results (Classification Report, Confusion Matrix, ROC AUC) will be printed to the console, and models will be saved in the models/ directory:

models/pe/
models/capa_all/
models/capa_filtered/

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
malware-detector		malware-detector
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Malware Detector

Features

Installation

Project Structure

Usage Workflow

1. Prepare Data

2. Extract Features

3. Train Models

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

orhis/malware

Folders and files

Latest commit

History

Repository files navigation

Malware Detector

Features

Installation

Project Structure

Usage Workflow

1. Prepare Data

2. Extract Features

3. Train Models

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages