A high-performance, compact neural network for PE malware classification using the EMBER feature set. MalwareNet employs a Hierarchical Gated Architecture across five semantically distinct feature branches, achieving state-of-the-art adversarial robustness in a model small enough to run in near-zero latency on commodity CPU hardware.
| Metric | Value |
|---|---|
| Parameters | 273,452 |
| AUC-ROC (test set) | 0.9911 |
| Expected Calibration Error (ECE) | 0.0079 |
| AUC under PGD attack (ε=0.1) | ≥ 0.9910 |
| AUC under FGSM attack (ε=0.1) | ≥ 0.9910 |
| Best LTH ticket (val AUC / sparsity) | 0.9904 @ 48.18% sparse |
| Inference latency (CPU, single file) | 0.041 ms |
| Throughput (CPU) | ~24,400 files/sec |
MalwareNet splits the 2,568-dimensional EMBER v3 feature vector into five semantically meaningful groups, each processed by an independent GatedFeatureBlock. Gated representations are fused via a FusionBlock that assigns learnable per-group importance weights before projecting to a binary logit. A Platt Scaling post-processor is baked into the exported ONNX graph so every inference call returns a calibrated probability.
flowchart TD
A[PE File: .exe / .dll] --> B[EMBER2024 Feature Extraction]
B --> C[2568-d EMBER Feature Vector]
C --> G[global features<br/>156-d]
C --> BY[byte histogram/features<br/>512-d]
C --> S[strings features<br/>177-d]
C --> SE[sections features<br/>224-d]
C --> I[imports features<br/>1411-d]
G --> G1[GatedFeatureBlock<br/>16-d]
BY --> B1[GatedFeatureBlock<br/>32-d]
S --> S1[GatedFeatureBlock<br/>32-d]
SE --> SE1[GatedFeatureBlock<br/>32-d]
I --> I1[GatedFeatureBlock<br/>64-d]
G1 --> F[FusionBlock<br/>learnable cross-group weighting]
B1 --> F
S1 --> F
SE1 --> F
I1 --> F
F --> L[Linear Projection<br/>96 → 1]
L --> P[PlattScaler + Sigmoid]
P --> O[Calibrated Malware Probability<br/>0.0 = benign / 1.0 = malware]
O --> R[Rust Desktop App<br/>0.041ms CPU inference<br/>no Python runtime]
See ARCHITECTURE.md for a full technical description.
The exported ONNX model (model_artifacts/malware_model.onnx) already includes sigmoid activation and Platt scaling. It accepts a raw float32 feature vector of dimension 2,568 and returns a calibrated malware probability.
import numpy as np
import onnxruntime as ort
session = ort.InferenceSession(
"model_artifacts/malware_model.onnx",
providers=["CPUExecutionProvider"],
)
# features: np.ndarray of shape (1, 2568), dtype=float32
features = np.random.randn(1, 2568).astype(np.float32)
probability = session.run(
["probability"],
{"features": features},
)[0] # shape (1, 1)
print(f"Malware probability: {probability[0, 0]:.4f}")A native egui desktop application lives in desktop_app/. It embeds the model binary at compile time, extracts EMBER features directly from a user-selected PE file, and displays a calibrated risk score with a visual progress bar. No Python or external runtime required at run time.
- Rust toolchain (stable): https://rustup.rs
- The trained ONNX model must be present at
model_artifacts/malware_model.onnxbefore building — it is compiled into the binary viainclude_bytes!.
cd desktop_app
cargo build --releaseThe binary is written to desktop_app/target/release/desktop_app (Linux/macOS) or desktop_app\target\release\desktop_app.exe (Windows).
# Linux / macOS
./desktop_app/target/release/desktop_app
# Windows
desktop_app\target\release\desktop_app.exeThe app opens a file picker. Select any PE (.exe, .dll) and the model scores it instantly.
CI builds artifacts for all three platforms on every push to main. Download from the latest Actions run:
| Platform | Artifact |
|---|---|
| Linux x64 | desktop_app-linux-x64 |
| macOS x64 | desktop_app-macos-x64 |
| Windows x64 | desktop_app-windows-x64 |
Run the scripts in this order:
Data: Download the EMBER 2024 dataset from FutureComputing4AI/EMBER2024. Only the PE files (Win32, Win64, .NET) are required. After installing
thrember, run:import thrember thrember.download_dataset("./ember2024data/", file_type="PE") thrember.create_vectorized_features("./ember2024data/")Pass the data directory via
--data-dirto any script that requires it.
python hyperparameter_tune.py
python hyperparameter_tune.py --data-dir /path/to/data --n-trials 50Uses Optuna to search over learning rate, focal loss parameters, gate temperature, dropout, and weight decay. Results are persisted to tuning.db so searches can be resumed across sessions. Best parameters are printed at the end and can be passed directly to train_export.py.
python train_export.py
python train_export.py --data-dir ./ember2024data/ --output-dir ./model_artifacts/
python train_export.py --lr 4e-3 --max-epochs 10Trains the model, fits Platt scaling on the validation set, and exports a calibrated ONNX model to model_artifacts/. Key defaults (from the Optuna search):
| Hyperparameter | Value |
|---|---|
| Learning rate | 2.055e-3 |
| Focal loss γ | 3.761 |
| Focal loss α | 0.199 |
| Gate temperature | 2.079 |
| Dropout | 0.058 |
| Weight decay | 3.675e-5 |
The learning rate schedule uses a linear warmup followed by cosine annealing with η_min = 1e-5. Training checkpoints the best val_auc epoch.
python eval.py
python eval.py --state-dict ./model_artifacts/malware_net_calibrated_state_dict.pt
python eval.py --no-plotsLoads the calibrated state dict, runs inference on the held-out test set, and prints ECE, AUC-ROC, TPR at fixed FPR thresholds, and a full classification report. Saves evaluation plots to --output-dir (default: model_artifacts/) unless --no-plots is passed.
python attack.py
python attack.py --state-dict ./model_artifacts/malware_net_calibrated_state_dict.pt \
--data-dir ./ember2024data/ --num-samples 10000Evaluates against FGSM and PGD attacks at ε ∈ {0.001, 0.005, 0.01, 0.05, 0.1} using the Adversarial Robustness Toolbox. Results are written to model_artifacts/adversarial_robustness_report.txt. Under PGD (10 steps, ε=0.1), AUC degradation is < 0.02% relative to the clean baseline. See ARCHITECTURE.md for why the gating mechanism structurally limits adversarial transferability.
To compare the best dense model against the best Lottery Ticket sparse model with the same FGSM/PGD protocol:
python attack_best_models.py
python attack_best_models.py --regular-state-dict ./model_artifacts/malware_net_calibrated_state_dict.pt \
--lth-state-dict ./model_artifacts/lth_best_ticket_iter04_auc0.9904_sp0.48_calibrated_state_dict.ptThis script auto-discovers the best dense export and the highest-AUC LTH export if explicit paths are not provided. It writes:
model_artifacts/adversarial_robustness_best_vs_lth_report.txtmodel_artifacts/adversarial_robustness_best_vs_lth_curves.pngmodel_artifacts/adversarial_robustness_best_vs_lth_deltas.png
python benchmark.pyNo arguments. Runs against model_artifacts/malware_model.onnx with 2,000 warmup iterations followed by 20,000 timed single-file inferences, then reports mean, P95, and P99 latency and throughput.
python lth_malwarenet.pyRuns iterative magnitude pruning starting from the best dense checkpoint and rewinds surviving weights between rounds. The 2026-04-13 run used 10 pruning iterations with 5 fine-tuning epochs per iteration, pruning 20% of MLP/fusion weights and 10% of gate/head weights per round.
Best result from the latest run:
| Iteration | Val AUC | Sparsity |
|---|---|---|
| 4 | 0.9904 | 48.18% |
The best sparse ticket slightly exceeds the dense baseline (0.9899) while removing nearly half of all weights. Artifacts are written to lth_artifacts/, and the exported best checkpoint is model_artifacts/lth_best_ticket_iter04_auc0.9904_sp0.48_calibrated_state_dict.pt.
python pe_mutate_lief.py input.exe output.exeThis script is a small proof of concept for problem-space malware-ML experiments. It uses LIEF to mutate a PE directly, re-extracts EMBER features with thrember, scores each candidate with the exported ONNX model, and greedily keeps mutations that improve the requested objective. The current candidate families are intentionally simple:
- add a section
- rename a section
- add or edit CodeView/PDB debug metadata
By default it tries to decrease the malware probability. This is not a full feasible-attack framework. A real problem-space attack still needs stronger semantics-preserving mutation policies and a functionality oracle.
.
├── model/
│ ├── model.py # MalwareNet + MalwareNetLightning
│ ├── model_utils.py # GatedFeatureBlock, FusionBlock, FocalLoss
│ ├── dims.py # EMBER v3 feature dimensions and group mapping
│ ├── dataset.py # MemmapDataset, EmberMemmapDataModule
│ ├── calibration.py # Platt scaling (LBFGS)
│ ├── export.py # ONNX export + validation utilities
│ ├── attacks.py # ART-based FGSM/PGD evaluation
│ ├── evaluation.py # ROC, calibration, classification metrics
│ ├── train.py # Callback and Trainer factory functions
│ └── seed.py # RNG seeding utilities
├── model_artifacts/ # Saved checkpoints, ONNX model, eval outputs
├── desktop_app/ # Rust egui desktop application
├── train_export.py # Train → calibrate → export pipeline
├── eval.py # Test-set evaluation
├── attack.py # Adversarial robustness evaluation
├── attack_best_models.py # Dense vs LTH adversarial robustness comparison
├── benchmark.py # ONNX inference speed benchmark
├── lth_malwarenet.py # Lottery Ticket Hypothesis pruning experiment
├── lth_artifacts/ # Per-iteration sparse checkpoints and logs
├── hyperparameter_tune.py# Optuna hyperparameter search
└── ARCHITECTURE.md # Full technical architecture document
Core ML stack: torch, pytorch-lightning, torchmetrics, onnx, onnxruntime, optuna, adversarial-robustness-toolbox.
Feature extraction: thrember (EMBER v3 feature extractor — installed from FutureComputing4AI/EMBER2024).
Install Python dependencies:
pip install -r requirements.txt