A research-grade pipeline for PE malware detection combining classical ML and deep learning, with adversarial robustness evaluation via the GAMMA attack framework.
This project implements an end-to-end malware detection system for Windows PE files, developed as part of a graduate-level AI for Cybersecurity course. The pipeline supports two classification paradigms and evaluates their adversarial robustness under a realistic black-box attack.
Core capabilities:
- Feature-based classification with XGBoost on EMBER v2 features (2381-dim feature vectors, reduced to 2358 after leakage removal)
- Raw-bytes classification with MalConv2 (CNN-based, operates directly on PE byte sequences)
- Adversarial robustness evaluation using GAMMA (Genetic Adversarial Machine learning Malware Attack) — a black-box section injection attack with optional parallel grid execution
- SHAP-based interpretability for tree-based models, including adversarial SHAP analysis comparing evaded vs. non-evaded samples
- Multi-model comparison via Pareto frontier analysis (F1 vs. evasion rate)
- Security Evaluation Curves (SEC) across a full parameter grid (λ × query budget)
- Batch experiment execution for automated sequential runs across multiple configurations
The pipeline is designed around clean software engineering principles: a Strategy pattern for model interoperability, a single-source-of-truth feature configuration, and a fully config-driven experiment loop.
ExperimentPipeline (Template Method)
│
├── ClassifierFactory → Strategy selection (XGBoost | MalConv2)
│ ├── XGBoostStrategy → EMBER features + early stopping
│ └── MalConv2Strategy → Raw bytes + pretrained weights + layer freezing
│
├── EmberFeatureLoader → .npy loading + leakage removal (SSOT)
├── PEFileListBuilder → Raw PE path lists for MalConv2
│
├── MetricsCalculator → Accuracy / Precision / Recall / F1
├── PlotGenerator → Confusion matrices + metric comparison
├── ShapExplainer → TreeExplainer + 6 SHAP plot types
│
├── GAMMA Pipeline
│ ├── SampleSelector → Filters correctly-detected malware (≥100)
│ ├── GammaOrchestrator → λ × query_budget grid (sequential or parallel)
│ │ ├── GammaSectionInjection (ml-pentest library)
│ │ └── _gamma_worker → Per-process model reconstruction for parallel runs
│ ├── XGBoostModelWrapper → EMBER adapter for GAMMA interface
│ └── SECPlotter → Security Evaluation Curves + heatmaps
│
├── ModelComparator → Multi-model Pareto analysis + ranking
│ ├── plot_pareto → F1 vs. mean evasion rate scatter
│ ├── plot_comparison_heatmap → Evasion rate per model × GAMMA config
│ ├── plot_sec_bands → SEC with min/max bands, Pareto models highlighted
│ ├── plot_sec_detail → Attack size SEC for selected model
│ └── plot_ml_vs_dl → Side-by-side ML vs DL final comparison
│
├── AdversarialShapAnalyzer → SHAP waterfall: original vs. adversarial pairs
│
└── BatchRunner → Sequential multi-config experiment execution
The two classification paradigms diverge only at data loading and prediction; the rest of the pipeline is fully shared via the BaseClassifier interface.
malware-detection/
│
├── config/
│ ├── experiment_config.yaml # Default single-run configuration
│ └── batch/
│ ├── config_esperimento_1.yaml # Batch config — experiment 1
│ └── config_esperimento_2.yaml # Batch config — experiment 2
│
├── src/
│ ├── config/
│ │ └── feature_config.py # SSOT for EMBER feature groups and leakage indices
│ │
│ ├── data/
│ │ └── dataset_loader.py # EmberFeatureLoader + PEFileListBuilder
│ │
│ ├── models/
│ │ ├── base_model.py # Abstract Strategy interface
│ │ ├── classifier_factory.py # Factory Method for model instantiation
│ │ ├── xgboost_strategy.py # XGBoost concrete strategy
│ │ └── malconv2_strategy.py # MalConv2 concrete strategy
│ │
│ ├── evaluation/
│ │ ├── metrics.py # Classification metrics
│ │ ├── plot_generator.py # Confusion matrices + comparison charts
│ │ ├── shap_explainer.py # SHAP plots for tree-based models
│ │ ├── adversarial_shap.py # SHAP analysis on GAMMA-evaded samples
│ │ ├── sec_plotter.py # Security Evaluation Curves
│ │ └── model_comparator.py # Multi-model Pareto + SEC comparison
│ │
│ ├── adversarial/
│ │ ├── wrappers.py # XGBoostModelWrapper (GAMMA adapter)
│ │ ├── sample_selector.py # Malware selection for GAMMA
│ │ └── gamma_orchestrator.py # Grid execution (sequential + parallel)
│ │
│ └── pipeline/
│ └── experiment_pipeline.py # Template Method — full experiment orchestration
│
├── extract_features.py # Standalone EMBER feature extraction script
├── batch_runner.py # Multi-config sequential experiment runner
└── main.py # Entry point
Operates on EMBER v2 features extracted from PE files — a rich 2381-dimensional representation covering byte histograms, entropy, string statistics, PE header fields, section info, imports, exports, and data directories.
Data leakage removal: Three feature groups known to introduce dataset-specific bias are removed before training:
TimeDateStamp(index 626) — compilation epoch correlation- COFF
Machinetype (indices 627–637) — architecture distribution skew between SoReL-20M malware (x86) and modern benign files (x64) - Optional Header
Subsystem(indices 647–657) — same distribution artifact
This removal is enforced via a single source of truth in feature_config.py, applied consistently across training, inference, SHAP, and the GAMMA adapter.
| Parameter | Value |
|---|---|
n_estimators |
400 (+ early stopping) |
max_depth |
9 |
learning_rate |
0.2 |
subsample |
0.8 |
colsample_bytree |
0.8 |
early_stopping_rounds |
20 |
A gated CNN that operates directly on raw byte sequences of PE files (up to 1MB). No feature engineering required — the model learns its own byte-level representations.
Supports loading pre-trained weights from ml-pentest. When load_pretrained=True, embedding and convolutional layers are frozen and only the classifier head is fine-tuned, avoiding catastrophic forgetting on smaller datasets.
| Parameter | Value |
|---|---|
channels |
128 |
window_size / stride |
500 |
embd_size |
8 |
max_len |
1,048,576 bytes |
learning_rate |
5e-4 |
patience |
12 |
GAMMA (Genetic Adversarial Machine learning Malware Attack) is a black-box attack that injects benign PE sections into malware files to evade detection, using a genetic algorithm to optimize the injection.
- Section population extraction — benign PE sections (
.data,.rdata,.idata,.rodata) are extracted from VirusShare benign files;memoryviewobjects are sanitized for pickle-compatibility before parallel dispatch - Sample selection — at least 100 malware correctly detected by the target model are selected from the test set; if the
selected_malware/directory already exists, selection is skipped - Grid execution — sequential (default) or parallel via
ProcessPoolExecutorwithspawncontext; each worker independently reconstructs the model from the saved checkpoint - Results aggregation — evasion rates, confidence scores, injected bytes, elapsed time, and stagnation flags are collected per run and saved to
gamma_summary.csv
| Parameter | Values |
|---|---|
lambda (penalty weight) |
1e-2, 1e-4, 1e-6, 1e-8 |
query_budget |
20, 60, 120, 300 |
| Total runs | 16 (4 × 4) |
lambda controls the trade-off between evasion success and injection size. Lower values allow larger injections, typically yielding higher evasion rates at the cost of more bytes added to the file.
Set n_workers > 1 in the config to distribute grid runs across multiple processes. Each worker rebuilds the model from its checkpoint, avoiding pickle serialization of non-serializable model objects. MalConv2 users should keep n_workers ≤ 2 to avoid GPU OOM.
The SECPlotter module generates multiple views of adversarial robustness:
- SEC by query budget — detection rate vs. number of oracle queries, one line per lambda
- SEC by lambda — detection rate vs. penalty weight (log scale, inverted axis)
- Evasion rate heatmap — λ × query budget grid visualization
- SEC by attack size — detection rate vs. mean bytes injected (KB)
- ML vs DL comparison — overlay and subplot comparisons between XGBoost and MalConv2
AdversarialShapAnalyzer generates four SHAP waterfall plots per GAMMA configuration:
- Original evaded — the pre-attack file that will later be evaded
- Original non-evaded — the pre-attack file that will resist the attack
- Adversarial evaded — same file after GAMMA injection (confidence < 0.5)
- Adversarial non-evaded — same file after GAMMA injection (confidence > 0.5)
This 2×2 comparison reveals which EMBER features are shifted by the section injection and why some malware samples resist evasion despite the attack.
ModelComparator scans an experiment directory, loads metrics.json and gamma_summary.csv from each sub-experiment, and produces a full comparison report:
Pareto frontier analysis — models are projected onto a 2D space (F1 score on test set × mean GAMMA evasion rate). The Pareto frontier identifies models that are non-dominated on both axes; the most robust Pareto model is auto-selected for detailed SEC analysis.
Plots generated by generate_comparison():
| Plot | Description |
|---|---|
pareto_<family>.png |
Scatter with Pareto frontier, top-3 highlighted, error bars for evasion range |
heatmap_<family>.png |
Evasion rate per model × GAMMA config (16 columns, grouped by λ) |
sec_bands_<family>.png |
SEC with median ± min/max bands, Pareto models colored, others in grey |
sec_detail_<family>.png |
Attack size SEC for the selected model |
sec_ml_vs_dl.png |
Side-by-side final comparison of best ML vs. best DL model |
Model label auto-generation — if config_used.yaml is found in the experiment folder, labels are extracted from hyperparameters (e.g. lr=0.2, d=9) rather than folder names.
# Single family comparison
python -m src.evaluation.model_comparator \
--base-dir experiments/xgboost_models \
--model-type xgboost \
--model-family XGBoost \
--output plots/comparison
# ML vs DL final comparison
python -m src.evaluation.model_comparator \
--ml-base-dir experiments/xgboost_models \
--dl-base-dir experiments/malconv2_models \
--output plots/comparison# Clone the repository
git clone https://github.com/<your-username>/malware-detection.git
cd malware-detection
# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate # Linux/macOS
# .venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtDependencies include: xgboost, torch, shap, scikit-learn, numpy, pandas, matplotlib, seaborn, pyyaml, tqdm, lief, ml-pentest
python extract_features.py
# or with a custom config:
python extract_features.py config/experiment_config.yamlThis extracts EMBER features from all PE files in data/raw/ and saves .npy matrices in data/features/. Only required for the XGBoost model.
python main.py
# or with a custom config:
python main.py config/experiment_config.yamlThe pipeline will execute in order: setup → train → evaluate (Test + VirusShare) → SHAP → GAMMA.
# Explicit file list
python batch_runner.py config/batch/config_esperimento_1.yaml config/batch/config_esperimento_2.yaml
# All configs in a directory (alphabetical order)
python batch_runner.py --config-dir config/batch/
# Dry-run to preview the queue without executing
python batch_runner.py --config-dir config/batch/ --dry-runEach experiment runs sequentially; failures are logged and execution continues. A summary table is printed at the end with status, duration, and any error messages.
# config/experiment_config.yaml
skip_training: true
checkpoint_path: "models/xgboost_best.json" # or .pt for MalConv2python -m src.evaluation.sec_plotter \
--ml experiments/run_xgboost/gamma/xgboost/gamma_summary.csv \
--dl experiments/run_malconv2/gamma/malconv2/gamma_summary.csv \
--output plots/sec \
--ml-name "XGBoost" \
--dl-name "MalConv2"python -m src.evaluation.adversarial_shap \
--model-path experiments/xgboost_run/model_checkpoints/xgboost_best.json \
--gamma-dir experiments/xgboost_run/gamma/xgboost \
--test-malware-dir data/raw/test/malware \
--lambda-value 1e-06 \
--query-budget 120 \
--output plots/adversarial_shapThe entire pipeline is driven by a single YAML file: config/experiment_config.yaml.
model_type: "xgboost" # "xgboost" | "malconv2"
skip_training: false
checkpoint_path: ""
experiment_name: "my_run"
paths:
features_dir: "data/features"
train_benign_dir: "data/raw/train/benign"
train_malware_dir: "data/raw/train/malware"
# ... validation, test, virusshare paths
ml_model: # XGBoost hyperparameters
n_estimators: 400
max_depth: 9
learning_rate: 0.2
# ...
dl_model: # MalConv2 hyperparameters
channels: 128
load_pretrained: true
learning_rate: 0.0005
# ...
gamma:
enabled: true
n_workers: 1 # >1 enables parallel grid execution
lambda_values: [1.0e-2, 1.0e-4, 1.0e-6, 1.0e-8]
query_budgets: [20, 60, 120, 300]
num_benign_sections: 50
population_size: 20
iterations: 100All outputs are organized under experiments/<experiment_name>/:
experiments/my_run/
│
├── config_used.yaml # Exact config snapshot for reproducibility
├── metrics.json # All split metrics (used by ModelComparator)
├── model_checkpoints/
│ ├── xgboost_best.json # (XGBoost) or
│ └── malconv2_best.pt # (MalConv2)
│
├── plots/
│ ├── confusion_matrix_test_dataset_1.png
│ ├── confusion_matrix_virusshare_dataset_2.png
│ ├── metrics_comparison.png
│ ├── learning_curve.png
│ ├── shap_summary_plot.png # XGBoost only
│ ├── shap_bar_plot.png
│ ├── shap_waterfall_plot.png
│ ├── shap_beeswarm_plot.png
│ ├── shap_force_plot.png
│ └── shap_scatter_plot.png
│
└── gamma/
├── selected_malware/ # 100 correctly-detected malware for attack
└── xgboost/ # (or malconv2/)
├── adv_<lambda>_<qb>/ # Adversarial PE files per run
├── results_<lambda>_<qb>.json # Per-sample attack results
├── gamma_summary.csv # Aggregated grid results (input to ModelComparator)
└── plots/
├── sec_query_budget_xgboost.png
├── sec_lambda_xgboost.png
├── evasion_heatmap_xgboost.png
└── sec_attack_size_xgboost.png
Strategy Pattern for model interoperability — XGBoostStrategy and MalConv2Strategy both implement BaseClassifier. The pipeline operates on the abstract interface, with bifurcation only at data loading (numpy arrays vs. file path lists) and GAMMA wrapper construction.
Single Source of Truth for features — feature_config.py is the only place where EMBER feature group boundaries and leakage indices are defined. Every downstream module (dataset loader, SHAP explainer, GAMMA adapter, adversarial SHAP analyzer) imports from it, eliminating any risk of index misalignment across modules.
Parallel GAMMA with per-worker model reconstruction — rather than serializing model objects via pickle (which fails for XGBoost's internal C++ structures and PyTorch modules with non-trivial state), each worker in the parallel grid rebuilds the model from the saved checkpoint using _rebuild_wrapper(). This keeps the parallelism safe and avoids any dependency on pickle-compatibility of third-party objects.
memoryview sanitization for pickle — LIEF returns PE section data as memoryview objects, which are not pickle-serializable. The _sanitize_for_pickle() function recursively converts them to list[int], which is the format LIEF's C++ bindings accept on the receiving end anyway.
Conservative fallback in GAMMA adapter — if EMBER feature extraction fails on a GAMMA-modified PE (e.g., corrupted headers), XGBoostModelWrapper.classify_sample returns 1.0 (malware). A file that cannot be parsed does not count as a successful evasion.
Checkpoint-aware pipeline — skip_training: true + checkpoint_path allows resuming the evaluation and adversarial steps on previously trained models without retraining. The pipeline also auto-resolves the checkpoint path for the parallel GAMMA orchestrator by looking inside the current experiment's model_checkpoints/ directory before falling back to the config value.
Idempotent GAMMA sample selection — if the selected_malware/ directory already exists and is non-empty, the SampleSelector step is skipped. This makes re-running the GAMMA grid after parameter changes (e.g., adding lambda values) safe without re-classifying the entire test set.
metrics.json as inter-module contract — each experiment saves its classification metrics to metrics.json at the end of evaluation. ModelComparator uses this file to populate the Pareto scatter without needing to re-run inference, decoupling the comparison tool from the training pipeline.
- Anderson, H. S. & Roth, P. (2018). EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models
- Raff, E. et al. (2018). Malware Detection by Eating a Whole EXE — MalConv
- Demetrio, L. et al. (2021). Functionality-Preserving Black-Box Optimization of Adversarial Windows Malware — GAMMA
- Lundberg, S. M. & Lee, S. I. (2017). A Unified Approach to Interpreting Model Predictions — SHAP
| Name | GitHub |
|---|---|
| Antonio Apicella | @apiantonio |
| Ivan Luigi Cipriano | @ivanCipriano |
| Simone Faraulo | @SimoneFaraulo |
| Antonio Graziosi | @tonygraziosi13 |