Skip to content

ivanCipriano/Malware_Detection_Project_Gr06

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🛡️ Malware Detection Pipeline

A research-grade pipeline for PE malware detection combining classical ML and deep learning, with adversarial robustness evaluation via the GAMMA attack framework.

Python XGBoost PyTorch SHAP License


🔍 Overview

This project implements an end-to-end malware detection system for Windows PE files, developed as part of a graduate-level AI for Cybersecurity course. The pipeline supports two classification paradigms and evaluates their adversarial robustness under a realistic black-box attack.

Core capabilities:

  • Feature-based classification with XGBoost on EMBER v2 features (2381-dim feature vectors, reduced to 2358 after leakage removal)
  • Raw-bytes classification with MalConv2 (CNN-based, operates directly on PE byte sequences)
  • Adversarial robustness evaluation using GAMMA (Genetic Adversarial Machine learning Malware Attack) — a black-box section injection attack with optional parallel grid execution
  • SHAP-based interpretability for tree-based models, including adversarial SHAP analysis comparing evaded vs. non-evaded samples
  • Multi-model comparison via Pareto frontier analysis (F1 vs. evasion rate)
  • Security Evaluation Curves (SEC) across a full parameter grid (λ × query budget)
  • Batch experiment execution for automated sequential runs across multiple configurations

The pipeline is designed around clean software engineering principles: a Strategy pattern for model interoperability, a single-source-of-truth feature configuration, and a fully config-driven experiment loop.


🏗️ Architecture

ExperimentPipeline (Template Method)
│
├── ClassifierFactory              → Strategy selection (XGBoost | MalConv2)
│   ├── XGBoostStrategy            → EMBER features + early stopping
│   └── MalConv2Strategy           → Raw bytes + pretrained weights + layer freezing
│
├── EmberFeatureLoader             → .npy loading + leakage removal (SSOT)
├── PEFileListBuilder              → Raw PE path lists for MalConv2
│
├── MetricsCalculator              → Accuracy / Precision / Recall / F1
├── PlotGenerator                  → Confusion matrices + metric comparison
├── ShapExplainer                  → TreeExplainer + 6 SHAP plot types
│
├── GAMMA Pipeline
│   ├── SampleSelector             → Filters correctly-detected malware (≥100)
│   ├── GammaOrchestrator          → λ × query_budget grid (sequential or parallel)
│   │   ├── GammaSectionInjection  (ml-pentest library)
│   │   └── _gamma_worker          → Per-process model reconstruction for parallel runs
│   ├── XGBoostModelWrapper        → EMBER adapter for GAMMA interface
│   └── SECPlotter                 → Security Evaluation Curves + heatmaps
│
├── ModelComparator                → Multi-model Pareto analysis + ranking
│   ├── plot_pareto                → F1 vs. mean evasion rate scatter
│   ├── plot_comparison_heatmap    → Evasion rate per model × GAMMA config
│   ├── plot_sec_bands             → SEC with min/max bands, Pareto models highlighted
│   ├── plot_sec_detail            → Attack size SEC for selected model
│   └── plot_ml_vs_dl              → Side-by-side ML vs DL final comparison
│
├── AdversarialShapAnalyzer        → SHAP waterfall: original vs. adversarial pairs
│
└── BatchRunner                    → Sequential multi-config experiment execution

The two classification paradigms diverge only at data loading and prediction; the rest of the pipeline is fully shared via the BaseClassifier interface.


📁 Project Structure

malware-detection/
│
├── config/
│   ├── experiment_config.yaml         # Default single-run configuration
│   └── batch/
│       ├── config_esperimento_1.yaml  # Batch config — experiment 1
│       └── config_esperimento_2.yaml  # Batch config — experiment 2
│
├── src/
│   ├── config/
│   │   └── feature_config.py          # SSOT for EMBER feature groups and leakage indices
│   │
│   ├── data/
│   │   └── dataset_loader.py          # EmberFeatureLoader + PEFileListBuilder
│   │
│   ├── models/
│   │   ├── base_model.py              # Abstract Strategy interface
│   │   ├── classifier_factory.py      # Factory Method for model instantiation
│   │   ├── xgboost_strategy.py        # XGBoost concrete strategy
│   │   └── malconv2_strategy.py       # MalConv2 concrete strategy
│   │
│   ├── evaluation/
│   │   ├── metrics.py                 # Classification metrics
│   │   ├── plot_generator.py          # Confusion matrices + comparison charts
│   │   ├── shap_explainer.py          # SHAP plots for tree-based models
│   │   ├── adversarial_shap.py        # SHAP analysis on GAMMA-evaded samples
│   │   ├── sec_plotter.py             # Security Evaluation Curves
│   │   └── model_comparator.py        # Multi-model Pareto + SEC comparison
│   │
│   ├── adversarial/
│   │   ├── wrappers.py                # XGBoostModelWrapper (GAMMA adapter)
│   │   ├── sample_selector.py         # Malware selection for GAMMA
│   │   └── gamma_orchestrator.py      # Grid execution (sequential + parallel)
│   │
│   └── pipeline/
│       └── experiment_pipeline.py     # Template Method — full experiment orchestration
│
├── extract_features.py                # Standalone EMBER feature extraction script
├── batch_runner.py                    # Multi-config sequential experiment runner
└── main.py                            # Entry point

🤖 Models

XGBoost (Feature-based)

Operates on EMBER v2 features extracted from PE files — a rich 2381-dimensional representation covering byte histograms, entropy, string statistics, PE header fields, section info, imports, exports, and data directories.

Data leakage removal: Three feature groups known to introduce dataset-specific bias are removed before training:

  • TimeDateStamp (index 626) — compilation epoch correlation
  • COFF Machine type (indices 627–637) — architecture distribution skew between SoReL-20M malware (x86) and modern benign files (x64)
  • Optional Header Subsystem (indices 647–657) — same distribution artifact

This removal is enforced via a single source of truth in feature_config.py, applied consistently across training, inference, SHAP, and the GAMMA adapter.

Parameter Value
n_estimators 400 (+ early stopping)
max_depth 9
learning_rate 0.2
subsample 0.8
colsample_bytree 0.8
early_stopping_rounds 20

MalConv2 (Raw bytes)

A gated CNN that operates directly on raw byte sequences of PE files (up to 1MB). No feature engineering required — the model learns its own byte-level representations.

Supports loading pre-trained weights from ml-pentest. When load_pretrained=True, embedding and convolutional layers are frozen and only the classifier head is fine-tuned, avoiding catastrophic forgetting on smaller datasets.

Parameter Value
channels 128
window_size / stride 500
embd_size 8
max_len 1,048,576 bytes
learning_rate 5e-4
patience 12

⚔️ Adversarial Evaluation (GAMMA)

GAMMA (Genetic Adversarial Machine learning Malware Attack) is a black-box attack that injects benign PE sections into malware files to evade detection, using a genetic algorithm to optimize the injection.

How it works

  1. Section population extraction — benign PE sections (.data, .rdata, .idata, .rodata) are extracted from VirusShare benign files; memoryview objects are sanitized for pickle-compatibility before parallel dispatch
  2. Sample selection — at least 100 malware correctly detected by the target model are selected from the test set; if the selected_malware/ directory already exists, selection is skipped
  3. Grid execution — sequential (default) or parallel via ProcessPoolExecutor with spawn context; each worker independently reconstructs the model from the saved checkpoint
  4. Results aggregation — evasion rates, confidence scores, injected bytes, elapsed time, and stagnation flags are collected per run and saved to gamma_summary.csv

Experimental grid

Parameter Values
lambda (penalty weight) 1e-2, 1e-4, 1e-6, 1e-8
query_budget 20, 60, 120, 300
Total runs 16 (4 × 4)

lambda controls the trade-off between evasion success and injection size. Lower values allow larger injections, typically yielding higher evasion rates at the cost of more bytes added to the file.

Parallel execution

Set n_workers > 1 in the config to distribute grid runs across multiple processes. Each worker rebuilds the model from its checkpoint, avoiding pickle serialization of non-serializable model objects. MalConv2 users should keep n_workers ≤ 2 to avoid GPU OOM.

Security Evaluation Curves

The SECPlotter module generates multiple views of adversarial robustness:

  • SEC by query budget — detection rate vs. number of oracle queries, one line per lambda
  • SEC by lambda — detection rate vs. penalty weight (log scale, inverted axis)
  • Evasion rate heatmap — λ × query budget grid visualization
  • SEC by attack size — detection rate vs. mean bytes injected (KB)
  • ML vs DL comparison — overlay and subplot comparisons between XGBoost and MalConv2

Adversarial SHAP Analysis

AdversarialShapAnalyzer generates four SHAP waterfall plots per GAMMA configuration:

  • Original evaded — the pre-attack file that will later be evaded
  • Original non-evaded — the pre-attack file that will resist the attack
  • Adversarial evaded — same file after GAMMA injection (confidence < 0.5)
  • Adversarial non-evaded — same file after GAMMA injection (confidence > 0.5)

This 2×2 comparison reveals which EMBER features are shifted by the section injection and why some malware samples resist evasion despite the attack.


📊 Multi-Model Comparison

ModelComparator scans an experiment directory, loads metrics.json and gamma_summary.csv from each sub-experiment, and produces a full comparison report:

Pareto frontier analysis — models are projected onto a 2D space (F1 score on test set × mean GAMMA evasion rate). The Pareto frontier identifies models that are non-dominated on both axes; the most robust Pareto model is auto-selected for detailed SEC analysis.

Plots generated by generate_comparison():

Plot Description
pareto_<family>.png Scatter with Pareto frontier, top-3 highlighted, error bars for evasion range
heatmap_<family>.png Evasion rate per model × GAMMA config (16 columns, grouped by λ)
sec_bands_<family>.png SEC with median ± min/max bands, Pareto models colored, others in grey
sec_detail_<family>.png Attack size SEC for the selected model
sec_ml_vs_dl.png Side-by-side final comparison of best ML vs. best DL model

Model label auto-generation — if config_used.yaml is found in the experiment folder, labels are extracted from hyperparameters (e.g. lr=0.2, d=9) rather than folder names.

# Single family comparison
python -m src.evaluation.model_comparator \
    --base-dir experiments/xgboost_models \
    --model-type xgboost \
    --model-family XGBoost \
    --output plots/comparison

# ML vs DL final comparison
python -m src.evaluation.model_comparator \
    --ml-base-dir experiments/xgboost_models \
    --dl-base-dir experiments/malconv2_models \
    --output plots/comparison

⚙️ Installation

# Clone the repository
git clone https://github.com/<your-username>/malware-detection.git
cd malware-detection

# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate  # Linux/macOS
# .venv\Scripts\activate   # Windows

# Install dependencies
pip install -r requirements.txt

Dependencies include: xgboost, torch, shap, scikit-learn, numpy, pandas, matplotlib, seaborn, pyyaml, tqdm, lief, ml-pentest


🚀 Usage

Step 1 — Feature extraction (XGBoost only, one-time operation)

python extract_features.py
# or with a custom config:
python extract_features.py config/experiment_config.yaml

This extracts EMBER features from all PE files in data/raw/ and saves .npy matrices in data/features/. Only required for the XGBoost model.

Step 2 — Run a single experiment

python main.py
# or with a custom config:
python main.py config/experiment_config.yaml

The pipeline will execute in order: setup → train → evaluate (Test + VirusShare) → SHAP → GAMMA.

Step 3 — Run multiple experiments in batch

# Explicit file list
python batch_runner.py config/batch/config_esperimento_1.yaml config/batch/config_esperimento_2.yaml

# All configs in a directory (alphabetical order)
python batch_runner.py --config-dir config/batch/

# Dry-run to preview the queue without executing
python batch_runner.py --config-dir config/batch/ --dry-run

Each experiment runs sequentially; failures are logged and execution continues. A summary table is printed at the end with status, duration, and any error messages.

Skip training (load existing weights)

# config/experiment_config.yaml
skip_training: true
checkpoint_path: "models/xgboost_best.json"  # or .pt for MalConv2

Generate SEC plots from existing results

python -m src.evaluation.sec_plotter \
  --ml experiments/run_xgboost/gamma/xgboost/gamma_summary.csv \
  --dl experiments/run_malconv2/gamma/malconv2/gamma_summary.csv \
  --output plots/sec \
  --ml-name "XGBoost" \
  --dl-name "MalConv2"

Run adversarial SHAP analysis

python -m src.evaluation.adversarial_shap \
  --model-path experiments/xgboost_run/model_checkpoints/xgboost_best.json \
  --gamma-dir experiments/xgboost_run/gamma/xgboost \
  --test-malware-dir data/raw/test/malware \
  --lambda-value 1e-06 \
  --query-budget 120 \
  --output plots/adversarial_shap

🔧 Configuration

The entire pipeline is driven by a single YAML file: config/experiment_config.yaml.

model_type: "xgboost"          # "xgboost" | "malconv2"
skip_training: false
checkpoint_path: ""
experiment_name: "my_run"

paths:
  features_dir: "data/features"
  train_benign_dir: "data/raw/train/benign"
  train_malware_dir: "data/raw/train/malware"
  # ... validation, test, virusshare paths

ml_model:                       # XGBoost hyperparameters
  n_estimators: 400
  max_depth: 9
  learning_rate: 0.2
  # ...

dl_model:                       # MalConv2 hyperparameters
  channels: 128
  load_pretrained: true
  learning_rate: 0.0005
  # ...

gamma:
  enabled: true
  n_workers: 1                  # >1 enables parallel grid execution
  lambda_values: [1.0e-2, 1.0e-4, 1.0e-6, 1.0e-8]
  query_budgets: [20, 60, 120, 300]
  num_benign_sections: 50
  population_size: 20
  iterations: 100

📂 Outputs

All outputs are organized under experiments/<experiment_name>/:

experiments/my_run/
│
├── config_used.yaml                   # Exact config snapshot for reproducibility
├── metrics.json                       # All split metrics (used by ModelComparator)
├── model_checkpoints/
│   ├── xgboost_best.json              # (XGBoost) or
│   └── malconv2_best.pt               # (MalConv2)
│
├── plots/
│   ├── confusion_matrix_test_dataset_1.png
│   ├── confusion_matrix_virusshare_dataset_2.png
│   ├── metrics_comparison.png
│   ├── learning_curve.png
│   ├── shap_summary_plot.png          # XGBoost only
│   ├── shap_bar_plot.png
│   ├── shap_waterfall_plot.png
│   ├── shap_beeswarm_plot.png
│   ├── shap_force_plot.png
│   └── shap_scatter_plot.png
│
└── gamma/
    ├── selected_malware/              # 100 correctly-detected malware for attack
    └── xgboost/                       # (or malconv2/)
        ├── adv_<lambda>_<qb>/         # Adversarial PE files per run
        ├── results_<lambda>_<qb>.json # Per-sample attack results
        ├── gamma_summary.csv          # Aggregated grid results (input to ModelComparator)
        └── plots/
            ├── sec_query_budget_xgboost.png
            ├── sec_lambda_xgboost.png
            ├── evasion_heatmap_xgboost.png
            └── sec_attack_size_xgboost.png

🧠 Key Design Decisions

Strategy Pattern for model interoperabilityXGBoostStrategy and MalConv2Strategy both implement BaseClassifier. The pipeline operates on the abstract interface, with bifurcation only at data loading (numpy arrays vs. file path lists) and GAMMA wrapper construction.

Single Source of Truth for featuresfeature_config.py is the only place where EMBER feature group boundaries and leakage indices are defined. Every downstream module (dataset loader, SHAP explainer, GAMMA adapter, adversarial SHAP analyzer) imports from it, eliminating any risk of index misalignment across modules.

Parallel GAMMA with per-worker model reconstruction — rather than serializing model objects via pickle (which fails for XGBoost's internal C++ structures and PyTorch modules with non-trivial state), each worker in the parallel grid rebuilds the model from the saved checkpoint using _rebuild_wrapper(). This keeps the parallelism safe and avoids any dependency on pickle-compatibility of third-party objects.

memoryview sanitization for pickle — LIEF returns PE section data as memoryview objects, which are not pickle-serializable. The _sanitize_for_pickle() function recursively converts them to list[int], which is the format LIEF's C++ bindings accept on the receiving end anyway.

Conservative fallback in GAMMA adapter — if EMBER feature extraction fails on a GAMMA-modified PE (e.g., corrupted headers), XGBoostModelWrapper.classify_sample returns 1.0 (malware). A file that cannot be parsed does not count as a successful evasion.

Checkpoint-aware pipelineskip_training: true + checkpoint_path allows resuming the evaluation and adversarial steps on previously trained models without retraining. The pipeline also auto-resolves the checkpoint path for the parallel GAMMA orchestrator by looking inside the current experiment's model_checkpoints/ directory before falling back to the config value.

Idempotent GAMMA sample selection — if the selected_malware/ directory already exists and is non-empty, the SampleSelector step is skipped. This makes re-running the GAMMA grid after parameter changes (e.g., adding lambda values) safe without re-classifying the entire test set.

metrics.json as inter-module contract — each experiment saves its classification metrics to metrics.json at the end of evaluation. ModelComparator uses this file to populate the Pareto scatter without needing to re-run inference, decoupling the comparison tool from the training pipeline.


📚 References


👥 Authors

Name GitHub
Antonio Apicella @apiantonio
Ivan Luigi Cipriano @ivanCipriano
Simone Faraulo @SimoneFaraulo
Antonio Graziosi @tonygraziosi13

Releases

No releases published

Packages

 
 
 

Contributors

Languages