Malware Analysis Automation Pipeline

A production-quality Python framework for automating malware analysis by integrating multiple external sandbox APIs, normalizing results, and generating comprehensive intelligence reports.

Features

Multi-Sandbox Integration: Parallel submission to Hybrid Analysis, VirusTotal, and Triage
Unified Schema: Normalizes responses from different APIs into a canonical data model
Report Merging: Intelligently combines results from multiple sources with deduplication
IOC Intelligence: Extracts and deduplicated Indicators of Compromise
MITRE ATT&CK Mapping: Correlates observed behaviors with MITRE techniques
Professional Reports: Generates beautifulbootstrap-styled HTML and structured JSON reports
Resilient: Handles API failures, timeouts, and network errors gracefully
Extensible: Plugin-style architecture for adding new sandbox integrations
Proper Logging: Clean, structured logging for debugging and monitoring

Architecture

ENTRY LAYER
└── analyzer.py (CLI orchestrator)

API LAYER
├── clients/
│   ├── base.py (abstract interface)
│   ├── hybrid_analysis.py (HybridAnalysisClient)
│   ├── virustotal.py (VirusTotalClient)
│   └── triage.py (TriageClient - optional)

NORMALIZATION LAYER
├── normalizers/
│   ├── base.py (BaseNormalizer)
│   ├── hybrid_analysis.py
│   ├── virustotal.py
│   └── triage.py

CORE SCHEMA
└── schema.py (UnifiedReport, IOC, MergedReport, etc.)

MERGER & CORRELATION
├── merger.py (combines multiple reports)
└── correlator/
    ├── correlation_engine.py (enrichment)
    └── ioc_extractor.py (IOC extraction)

REPORTING
├── report_generator/
│   ├── html_generator.py (Bootstrap HTML reports)
│   └── json_export.py (Structured JSON)
    └── templates/ (HTML templates)

ARTIFACTS
└── artifacts/{SHA256}/ (raw API responses, final reports)

Installation

1. Clone/Setup Project

cd SOAR

2. Install Dependencies

pip install -r requirements.txt

3. Get API Keys

Hybrid Analysis: https://hybrid-analysis.com/api
VirusTotal: https://www.virustotal.com/gui/my-apikey
Triage (optional): https://triage.com/

4. Configure Environment

cp .env.example .env
# Edit .env and add your API keys
nano .env

5. Verify Setup

python -c "from analyzer import MalwareAnalyzer; MalwareAnalyzer(); print('✓ Setup OK')"

Usage

Basic Analysis

python analyzer.py samples/malware.exe

Custom Output Directory

python analyzer.py samples/malware.exe --output results/

Verbose Logging

python analyzer.py samples/malware.exe --verbose

Advanced Example

python analyzer.py samples/malware.exe \
  --output artifacts/analysis_2024 \
  --verbose

Output

Analysis generates two reports in the output directory:

HTML Report (`report.html`)

Executive summary with risk score
File hashes (MD5, SHA1, SHA256)
Verdict from all sandbox sources
IOC table (IPs, domains, URLs, hashes)
Network activity timeline
Process execution details
MITRE ATT&CK technique mapping
AV engine detections
Beautiful Bootstrap styling

JSON Report (`report.json`)

Structured machine-readable data with:

File metadata
Verdict analysis by source
Risk assessment
Complete IOC list
Network activity details
Process tree
File operations
MITRE techniques
Consolidated metadata

Project Structure

SOAR/
├── analyzer.py                    # Main entry point
├── schema.py                      # Unified data models
├── merger.py                      # Report merging logic
│
├── clients/                       # Sandbox API clients
│   ├── base.py
│   ├── hybrid_analysis.py
│   ├── virustotal.py
│   └── triage.py
│
├── normalizers/                   # API response normalization
│   ├── base.py
│   ├── hybrid_analysis.py
│   ├── virustotal.py
│   └── triage.py
│
├── correlator/                    # Intelligence correlation
│   ├── correlation_engine.py      # Engine & MITRE mapping
│   └── ioc_extractor.py           # IOC extraction
│
├── report_generator/              # Report generation
│   ├── html_generator.py          # HTML reports
│   └── json_export.py             # JSON export
│
├── artifacts/                     # Analysis output
│├── requirements.txt              # Dependencies
├── .env.example                   # Configuration template
└── README.md                      # This file

Key Classes & Components

MalwareAnalyzer (analyzer.py)

Main orchestrator that:

Initializes all sandbox clients
Submits files in parallel
Merges reports
Generates output

analyzer = MalwareAnalyzer()
output_files = analyzer.analyze("samples/malware.exe", "results/")

UnifiedReport (schema.py)

Canonical report format containing:

File metadata (hashes, size, type)
Analysis results (verdict, detections)
Behavioral data (processes, network, files)
IOCs and MITRE techniques
Risk assessment

MergedReport (schema.py)

Consolidated report from multiple sources:

Deduplicates IOCs
Merges network activity
Consensus verdict voting
Weighted risk scoring

Normalizers

Convert sandbox-specific formats to UnifiedReport:

HybridAnalysisNormalizer: Converts Hybrid Analysis API responses
VirusTotalNormalizer: Converts VirusTotal API responses
TriageNormalizer: Converts Triage API responses

ReportMerger (merger.py)

Intelligently combines multiple reports:

IOC deduplication (by type+value)
Process merging (by name+PID)
Network activity merging
Consensus verdict determination
Weighted risk score calculation

CorrelationEngine (correlator/correlation_engine.py)

Enriches merged reports:

Extracts additional IOCs from behavior
Maps processes to MITRE ATT&CK techniques
Performs threat intelligence enrichment
Builds attack narratives

Data Models

IOC (Indicator of Compromise)

IOC(
    ioc_type=IOCType.DOMAIN,  # enum: IP, DOMAIN, URL, MD5,SHA1, SHA256, etc.
    value="malicious.com",
    source=[SandboxSource.HYBRID_ANALYSIS, SandboxSource.VIRUSTOTAL],
    confidence=0.95,           # 0.0 - 1.0
    metadata={}
)

NetworkActivity

NetworkActivity(
    protocol="https",          # tcp, udp, dns, http, https
    direction="outbound",
    destination_ip="192.0.2.1",
    destination_port=443,
    domain="c2.malicious.com",
    url="https://c2.malicious.com/beacon",
    user_agent="Mozilla/5.0..."
)

Process

Process(
    name="malware.exe",
    pid=1234,
    parent_pid=456,
    command_line="malware.exe /c evil.ps1",
    user="admin"
)

Configuration

Environment Variables (.env)

# API Keys (required)
HYBRID_ANALYSIS_API_KEY=xxx
VIRUSTOTAL_API_KEY=yyy
TRIAGE_API_KEY=zzz

# Optional Settings
API_TIMEOUT=30              # seconds
MAX_WORKERS=3               # parallel submissions
LOG_LEVEL=INFO
OUTPUT_DIR=artifacts

Error Handling

The pipeline is designed to be resilient:

Partial Failures: If one sandbox fails, analysis continues with others
Retry Logic: Automatic retries with exponential backoff
Rate Limiting: Handles API rate limits gracefully
Timeouts: Configurable timeouts with warnings
Graceful Degradation: Works with partial data

try:
    analyzer.analyze("malware.exe")
except FileNotFoundError:
    print("File not found")
except RuntimeError as e:
    print(f"Analysis failed: {e}")

Extending the Pipeline

Add a New Sandbox

Create new client in clients/newsandbox.py:

from clients.base import BaseSandboxClient

class NewSandboxClient(BaseSandboxClient):
    def submit_file(self, file_path):
        # Implementation
        pass
    # ... implement other methods

Create normalizer in normalizers/newsandbox.py:

from normalizers.base import BaseNormalizer

class NewSandboxNormalizer(BaseNormalizer):
    def normalize(self, raw_response):
        # Convert to UnifiedReport
        pass

Register in analyzer.py:

from clients.newsandbox import NewSandboxClient
# In _initialize_clients():
if newsandbox_key:
    self.clients[SandboxSource.NEWSANDBOX] = NewSandboxClient(...)

Development

Running Tests

pytest tests/ -v --cov

Code Quality

# Format code
black .

# Lint
flake8 .

# Type checking
mypy .

# Security scan
bandit -r .

Performance

Parallel Submissions: Uses ThreadPoolExecutor for concurrent API calls
Typical Analysis Time: 2-5 minutes (depends on sandbox queue)
Memory Usage: ~50-100MB per analysis
Report Generation: <1 second

Troubleshooting

No API Keys Found

RuntimeError: No sandbox clients initialized. Check API keys in .env

Solution: Ensure .env file exists and contains valid API keys.

Analysis Timeout

WARNING: {sandbox} analysis timeout

Solution: Increase API_TIMEOUT in .env or use --verbose to debug.

Network Errors

RequestException: Connection refused

Solution: Check internet connection and API endpoint availability.

Security Considerations

API Key Storage: Never commit .env with real keys
Input Validation: File paths are validated before submission
Output Sanitization: Reports are HTML-escaped to prevent XSS
Logging: Sensitive data is not logged
Rate Limiting: Respects API rate limits

Limitations & Caveats

No actual sandbox environment (uses external APIs only)
Dependent on third-party sandbox availability
Analysis quality varies by sandbox sophistication
Large files may not be supported by all sandboxes
Real-time API failures affect analysis

Future Enhancements

Caching layer for duplicate file hashes
Machine learning-based risk scoring
Real-time alerting integration
Database backend for historical analysis
Web UI dashboard
CLI configuration file support
Batch analysis mode
Integration with SIEM platforms

License

This project is provided as-is for security research and analysis purposes.

Contact & Support

For issues or questions, refer to the architecture documentation and code comments within each module.

Built with: Python 3.10+, Requests, Jinja2, Bootstrap 5, JSON

Version: 1.0.0
Last Updated: 2024

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
__pycache__		__pycache__
artifacts		artifacts
clients		clients
correlator		correlator
normalizers		normalizers
report_generator		report_generator
results		results
samples		samples
tests		tests
venv		venv
.env		.env
.env.example		.env.example
DELIVERY_SUMMARY.md		DELIVERY_SUMMARY.md
README.md		README.md
TODO.md		TODO.md
analysis.log		analysis.log
analyzer.py		analyzer.py
config.yaml		config.yaml
merger.py		merger.py
requirements.txt		requirements.txt
sandbox_guard.py		sandbox_guard.py
schema.py		schema.py

Folders and files

Latest commit

History

Repository files navigation

Malware Analysis Automation Pipeline

Features

Architecture

Installation

1. Clone/Setup Project

2. Install Dependencies

3. Get API Keys

4. Configure Environment

5. Verify Setup

Usage

Basic Analysis

Custom Output Directory

Verbose Logging

Advanced Example

Output

HTML Report (report.html)

JSON Report (report.json)

Project Structure

Key Classes & Components

MalwareAnalyzer (analyzer.py)

UnifiedReport (schema.py)

MergedReport (schema.py)

Normalizers

ReportMerger (merger.py)

CorrelationEngine (correlator/correlation_engine.py)

Data Models

IOC (Indicator of Compromise)

NetworkActivity

Process

Configuration

Environment Variables (.env)

Error Handling

Extending the Pipeline

Add a New Sandbox

Development

Running Tests

Code Quality

Performance

Troubleshooting

No API Keys Found

Analysis Timeout

Network Errors

Security Considerations

Limitations & Caveats

Future Enhancements

License

Contact & Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

HTML Report (`report.html`)

JSON Report (`report.json`)

Packages