Evidence-Contract Autonomous IR Agent — the agent that structurally cannot lie
SANS FIND EVIL! Hackathon submission — an autonomous incident response agent powered by Claude Code and SIFT Workstation, where every finding is cryptographically bound to the evidence that produced it.
FinDevil is an autonomous incident response agent that reasons through compromised Windows endpoints without a human in the loop. It connects Claude Code (the planning and reasoning engine) to SIFT Workstation forensic tools running inside a network-isolated Docker container, bridging them via the Model Context Protocol (MCP).
The core differentiator is not speed or coverage — it is architectural hallucination prevention through evidence contracts. Every claim the agent makes must carry a citation to a raw artifact (EVTX record, registry key, memory page, filesystem entry). The MCP server's Contract Compiler rejects any finding that cannot be traced back to a hash-verified source. The agent is structurally incapable of asserting something it did not observe.
graph TD
CC["Claude Code\n(Planning + Reasoning)"]
MCP["Investigation Runtime MCP Server\n(Python / FastMCP)"]
EV["Evidence Vault\n(read-only · hash integrity)"]
TG["Tool Gateway\n(Docker SIFT exec + security)"]
COMP["Contract Compiler\n(reject uncited claims)"]
SCE["Self-Correction Engine\n(auto-downgrade)"]
AL["Audit Ledger\n(hash-chained JSONL)"]
SIFT["SIFT Docker Container\n(network_mode: none)"]
CC <-->|MCP Protocol| MCP
MCP --- EV
MCP --- TG
MCP --- COMP
MCP --- SCE
MCP --- AL
TG <-->|docker exec| SIFT
- Claude Code receives a case brief and calls MCP tools to form an investigation plan.
- The Tool Gateway translates each tool call into a
docker execcommand against the SIFT container (Hayabusa, Volatility, Sleuth Kit, Plaso, Zimmerman tools, YARA). - Raw output is ingested into the Evidence Vault, which computes a SHA-256 hash and stores the artifact as read-only.
- Claude Code drafts a finding. The Contract Compiler checks every assertion for a vault citation; uncited claims are rejected before they leave the MCP boundary.
- The Self-Correction Engine compares new evidence against prior findings and issues corrections when confidence must be downgraded.
- Every action — tool call, evidence ingestion, finding, correction — is appended to the Audit Ledger as a hash-chained JSONL entry.
The agent does not just append findings; it revisits them. Three correction classes are enforced automatically:
| Type | Trigger | Effect |
|---|---|---|
DOWNGRADE |
Contradicting evidence arrives | Confidence reduced; prior entry flagged |
RETRACT |
Finding cannot survive contract re-evaluation | Removed from live report; ledger entry preserved |
AMEND |
Scope of an artifact expands | Finding updated; diff recorded |
Every Finding object is a Pydantic model that requires one or more EvidenceRef objects linking it to a vault artifact by hash and byte offset. The Contract Compiler enforces this at write time — not as a prompt instruction, but as a Python type constraint. Prompt-injection attacks cannot bypass it.
FinDevil ships tool wrappers for the full SIFT toolchain against Windows artifacts:
| Artifact type | Tools |
|---|---|
| EVTX event logs | Hayabusa (Sigma rules), native EVTX parser |
| Registry hives | Zimmerman RECmd, Registry Explorer |
| Memory images | Volatility 3 (pslist, netscan, malfind, cmdline) |
| Filesystem / timeline | Sleuth Kit, Plaso / log2timeline |
| Lateral movement / YARA | Custom YARA rule set, hayabusa lateral-movement profile |
Security is not enforced through system-prompt instructions. It is enforced through the runtime:
- The SIFT container runs with
network_mode: none— no exfiltration path. - Evidence is mounted read-only; the agent can read artifacts but cannot modify them.
- The Tool Gateway maintains a registry of allowed binaries; arbitrary shell execution is not possible.
- The Evidence Vault is append-only; hash verification fails if any artifact is tampered with after ingestion.
The Audit Ledger appends every event as a JSONL record where each entry includes prev_hash — the SHA-256 of the previous entry. Replaying the chain from genesis verifies that no entry was silently inserted, deleted, or modified. The findevil trace CLI command validates and pretty-prints the full chain for a case.
findevil investigate <case_dir> # run autonomous investigation
findevil trace <case_id> # replay and validate audit chain
findevil bench <evtx_dir> # batch-scan EVTX files, emit detections
findevil doctor # check Docker/SIFT/MCP health
findevil report <case_id> # render HTML/Markdown report with citations
- Docker (Engine 24+)
- Claude Code with MCP support
- Python 3.12+
git clone https://github.com/findevil/findevil
cd findevil
docker compose up -dThe SIFT container builds from Dockerfile.sift and mounts ./cases/real_evidence read-only at /evidence inside the container. No network access is granted.
pip install -e .Add the FinDevil MCP server to your Claude Code configuration:
{
"mcpServers": {
"findevil": {
"command": "python",
"args": ["-m", "findevil.server"],
"env": {
"FINDEVIL_CASE_DIR": "/absolute/path/to/cases"
}
}
}
}# Place Windows artifacts (EVTX, memory image, registry hives) in cases/real_evidence/
findevil investigate cases/real_evidence --case-id demo-001Or ask Claude Code directly:
"Investigate the Windows endpoint artifacts in cases/real_evidence. Identify signs of lateral movement, persistence mechanisms, and data exfiltration. All findings must cite evidence."
The repository ships with a self-contained test corpus under cases/ and a bench harness. To reproduce the published test results:
# 1. Verify the environment is healthy
findevil doctor
# 2. Run the full test suite
pytest tests/ -v
# 3. Batch-scan the included EVTX corpus and count detections
findevil bench cases/evtx_samples/ --output bench_results.jsonl
# 4. Run the full demo investigation
findevil investigate cases/real_evidence/ --case-id judge-run-01
# 5. Validate the audit chain (proves no tampering)
findevil trace judge-run-01
# 6. Render the final report with evidence citations
findevil report judge-run-01 --format markdownExpected output for step 3: 37,732 detections across 877 EVTX files.
To observe the anti-hallucination system in action, try calling the MCP tool submit_finding with a fabricated artifact hash — the Contract Compiler will reject it with a ContractViolation error before the finding is recorded.
| Metric | Value |
|---|---|
| Test suite | 55 tests across 9 modules |
| EVTX files benchmarked | 877 |
| Total detections | 37,732 |
| Contract violations caught | 100% of injected fabrications |
| Audit chain integrity | Validated on all test cases |
Test modules: test_vault, test_compiler, test_correction, test_ledger, test_executor, test_registry, test_models, test_cli, test_e2e.
| Layer | Technology |
|---|---|
| Reasoning engine | Claude Code (claude-sonnet-4-x via Anthropic API) |
| Agent-tool bridge | Model Context Protocol (MCP) |
| MCP server framework | FastMCP (Python) |
| Data models / validation | Pydantic v2 |
| Forensic platform | SIFT Workstation (Docker) |
| EVTX / Sigma detection | Hayabusa |
| Registry analysis | Zimmerman tools (RECmd) |
| Memory forensics | Volatility 3 |
| Timeline / filesystem | Plaso / log2timeline, Sleuth Kit |
| Malware scanning | YARA |
| Container isolation | Docker (network_mode: none, read-only mounts) |
| CLI | Click |
| Report templating | Jinja2 |
findevil/
├── src/findevil/
│ ├── server.py # FastMCP server — MCP entry point
│ ├── cli.py # Click CLI (investigate / trace / bench / doctor / report)
│ ├── vault/ # Evidence Vault — read-only, SHA-256 integrity
│ ├── contracts/ # Contract Compiler + Finding/EvidenceRef models
│ ├── correction/ # Self-Correction Engine (DOWNGRADE / RETRACT / AMEND)
│ ├── audit/ # Hash-chained Audit Ledger (JSONL)
│ └── tools/ # Tool wrappers: hayabusa, volatility, sleuthkit, plaso,
│ # zimmerman, yara_scanner, executor, registry
├── tests/ # 55-test suite
├── cases/ # Evidence and EVTX sample corpus
├── docker-compose.yml # SIFT container definition
└── Dockerfile.sift # SIFT Workstation image
MIT — see LICENSE.
- SANS Institute — FIND EVIL! Hackathon and the SIFT Workstation platform
- DFIR.training / Eric Zimmerman — Zimmerman forensic tools
- Yamato Security — Hayabusa EVTX analysis engine
- EVTX-ATTACK-SAMPLES — Public EVTX corpus used in benchmarking
- Volatility Foundation — Volatility 3 memory forensics framework
- Anthropic — Claude Code and the Model Context Protocol