This project is a malware detection framework that utilizes Portable Executable (PE) metadata and CAPA (Capabilities Analysis) rules to classify files as benign or malware. It employs machine learning techniques, specifically Random Forest classifiers, to distinguish between safe and malicious executables.
- PE Metadata Extraction: extracts static features from PE headers using
pefile. - CAPA Capabilities: integrates
flare-capato identify capabilities (behaviors) within the binary. - Data Pipeline: Tools to download samples, extract features in parallel, and merge datasets.
- Machine Learning: Trains Random Forest models with different feature sets (PE only vs. PE + CAPA).
- Clone the repository (if you haven't already).
- Create a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Linux/Mac # or .\venv\Scripts\activate # On Windows
- Install dependencies:
pip install -r requirements.txt
malware-detector/
├── data/ # Raw samples and generated CSV metadata
│ ├── raw/
│ │ ├── benign/ # Place benign PE files here
│ │ └── malware/ # Place malware PE files here
│ ├── pe_metadata.csv
│ └── capa_metadata.csv
├── models/ # Trained models are saved here
├── src/ # Source code
│ ├── download_samples.py
│ ├── run_pe.py # Extract PE features
│ ├── run_capa.py # Extract CAPA features
│ ├── build_dataset.py
│ └── train_supervised.py
├── config.yaml
└── requirements.txt
You need a dataset of benign and malware PE files.
- Benign Files: Place safe
.exeor.dllfiles indata/raw/benign/. - Malware Files: Place malicious files in
data/raw/malware/.
You can use the provided script to download malware samples (requires a specific URL structure, typically for a specific dataset source):
python src/download_samples.py <url_to_zip>The training process requires two CSV files: pe_metadata.csv and capa_metadata.csv.
Extract PE Metadata:
This script analyzes headers of files in data/raw/benign and data/raw/malware.
python src/run_pe.pyOutput: data/pe_metadata.csv
Extract CAPA Capabilities:
This script runs flare-capa against the samples. This process can be time-consuming, so it supports parallel processing and resuming.
python src/run_capa.pyOutput: data/capa_metadata.csv
Run the training script to build and evaluate the models. The script runs three scenarios:
- PE Only: Uses only header metadata.
- PE + All CAPA: Uses headers and all found CAPA rules.
- PE + Filtered CAPA: Uses headers and CAPA rules filtered by frequency.
python src/train_supervised.pyResults (Classification Report, Confusion Matrix, ROC AUC) will be printed to the console, and models will be saved in the models/ directory:
models/pe/models/capa_all/models/capa_filtered/