EvilTrace is an autonomous DFIR investigation agent for the SANS SIFT Workstation that verifies every finding against raw, SHA-256-sealed forensic tool output and tags each claim GROUNDED, INFERRED, or UNVERIFIED — so an autonomous AI investigation cannot quietly report something the evidence does not support.
Status — hackathon prototype. Built for the SANS Find Evil! AI Hackathon (2026). Developed and tested against a single case (Operation SHIELDBASE, host
base-rd-01). This is a focused proof-of-concept that prioritizes a verifiable end-to-end audit chain over breadth. It is not production software and is not a substitute for a trained examiner's judgment.
| # | Requirement | Location |
|---|---|---|
| 1 | Code repository URL | https://github.com/1hackerway/eviltrace (this README) |
| 2 | Open-source license | LICENSE — MIT, repo root; also shown in the GitHub About sidebar once public |
| 3 | README with setup instructions | This file → Try It Out Locally |
| 4 | Step-by-step run instructions for judges | Try It Out Locally (no live deployment; local run on SIFT per rules) |
| 5 | Text description of features / functionality | This README + DEVPOST.md; samples/investigative_narrative.md and samples/alihadi_01/investigative_narrative.md — structured investigative narratives (analytical-reasoning requirement) |
| 6 | Demonstration video | Watch on YouTube |
| 7 | Architecture diagram | Architecture (Mermaid diagram in this README) |
| 8 | Evidence dataset documentation | datasets/README.md + per-case briefings cases/incident_01/CLAUDE.md, cases/alihadi_01/CLAUDE.md |
| 9 | Accuracy report | samples/accuracy_report.md, samples/accuracy_report_memory.md, samples/alihadi_01/accuracy_report.md, samples/alihadi_01/accuracy_report_memory.md, and samples/accuracy_self_assessment.md (honest self-assessment incl. evidence-integrity / spoliation section) |
| 10 | Agent execution logs | samples/execution_log.jsonl and samples/alihadi_01/execution_log.jsonl — per-tool timestamps + per-turn token usage |
Every finding in the reports can be traced to the specific tool execution that produced it via the artifact citation and the execution log's SHA-256 seals.
EvilTrace orchestrates SIFT forensic tools (Sleuth Kit, Zimmerman EZ Tools, Plaso, Volatility 3, YARA) through a controlled wrapper layer, runs an autonomous Anthropic-SDK agent loop over them, and then validates every finding it produces against the exact, pre-hashed tool output it claims to be based on.
Each reported finding carries:
- an exact artifact citation (
file:LINE) pointing to the raw tool output line that supports it, - a confidence tag —
GROUNDED(the cited line literally contains the claimed evidence),INFERRED(a defensible deduction across grounded facts, but not a single literal line), orUNVERIFIED(no backing line — excluded from conclusions), - an integrity verdict tying the citation back to a SHA-256 hash that was computed on the raw tool output before any LLM saw it.
The validator that assigns these tags is deterministic and never calls an LLM. It cannot be argued with, and it cannot hallucinate.
Autonomous DFIR agents are fast, but they can fabricate. Protocol SIFT's own documentation names this as its primary unsolved risk and prescribes the fix — "use tool outputs as the source of truth, never the AI summary text" — without implementing it. Dr. Brian Carrier (creator of The Sleuth Kit) makes the same point: it is critical for tools to identify what came from AI so you know what to verify.
EvilTrace is that fix, automated. The grounding validator is a direct, deterministic implementation of Carrier's "query for item existence" method: when a finding references an artifact, the validator re-fetches that exact line, re-hashes the source against the pre-LLM seal, and refuses to certify anything it cannot reproduce.
EvilTrace extends the Protocol SIFT baseline rather than replacing it. It adopts the patterns that are table stakes — a layered CLAUDE.md agent persona, on-demand skill context, an stderr → self-correct → re-run loop, an immutable audit log, and a strict "never ask questions mid-task" execution rule — and adds the inference-constraint layer that Protocol SIFT's documentation prescribes but does not build.
It also occupies a deliberately different point on the autonomy/oversight spectrum than human-in-the-loop platforms (for example, AppliedIR's Valhuntir, the hackathon's reference example):
| Human-in-the-loop platform | EvilTrace | |
|---|---|---|
| Trust anchor | A human examiner approves every finding (findings staged as DRAFT; the AI cannot approve its own work) | A deterministic grounding validator certifies every finding against hash-sealed raw output |
| Speed of trust | Bounded by human review | Machine speed |
| What it guarantees | A human signed off | Every reported finding cites real, pre-LLM-hashed evidence; ungrounded claims are downgraded automatically |
Honest boundary: EvilTrace's validator catches fabrication and over-certainty — it does not replace human judgment on interpretation. INFERRED findings are precisely where a human analyst still matters. EvilTrace is not "better than human review"; it is a different tradeoff — full autonomy with a deterministic floor under every claim, useful where human review at machine speed is not available.
The model reasons flexibly. Tool execution is restricted by code. Two interfaces — the SDK agent and the MCP server — both converge on the same read-only wrapper layer, so there is exactly one audit chain regardless of which interface drives the investigation.
Framework posture. EvilTrace's orchestrator is a raw Anthropic SDK tool-use loop — a comparable agentic architecture, expressly permitted by the official hackathon rules alongside Claude Code and OpenClaw. The custom MCP server (eviltrace_mcp.py) is the Claude Code integration path: the same wrappers, the same audit chain, runnable directly from a Claude Code session.
graph TB
PATTERN["ARCHITECTURAL PATTERN<br/>Custom MCP Server + comparable raw-SDK agent loop"]
subgraph prompt ["PROMPT-BASED GUARDRAILS — guidance only (CLAUDE.md hierarchy)"]
C1["~/.claude/CLAUDE.md<br/>global persona"]
C2["./CLAUDE.md<br/>project rules"]
C3["cases/incident_01/CLAUDE.md<br/>case objective + IOCs"]
end
TASK["Investigator task"] --> AGENT
subgraph agentbox ["agent.py — raw Anthropic SDK tool-use loop"]
AGENT["loop: sliding context · max-turns ceiling ·<br/>deterministic vs transient failure recovery"]
ALLOW["typed tool allowlist<br/>(the tool list IS the allowlist)"]
AGENT --> ALLOW
end
prompt -. concatenated into system prompt .-> AGENT
MCP["eviltrace_mcp.py<br/>Custom MCP Server<br/>(3 read-only tools)"]
subgraph arch ["ARCHITECTURAL GUARDRAILS — code-enforced"]
DISK["tools/disk_tools.py<br/>structured tool implementations"]
WRAP["bin/run_* wrappers + _common.sh<br/>read-only guard · line-numbered capture ·<br/>pre-LLM SHA-256 seal · JSONL audit log"]
DISK --> WRAP
end
ALLOW --> DISK
MCP -. shells out to .-> WRAP
WRAP -->|invokes| SIFTT["SIFT forensic tools<br/>TSK · Zimmerman EZ Tools · Plaso · Volatility 3 · YARA"]
SIFTT -->|read-only| EV[("evidence/ — rd01.E01<br/>READ-ONLY")]
WRAP -->|writes only| OUT[("analysis/ · reports/<br/>tool_runs/ + execution_log.jsonl")]
OUT --> VAL
subgraph valbox ["validator.py — deterministic, never calls an LLM"]
VAL["re-fetch cited line · re-hash vs pre-LLM seal ·<br/>tag GROUNDED / INFERRED / UNVERIFIED"]
end
VAL --> REP["case_report.md + accuracy_report.md<br/>(confidence-tagged)"]
HOOK["hooks/stop_hook.py<br/>completion-promise verification"] -. gates completion .-> AGENT
classDef patternLabel fill:none,stroke:none,font-weight:bold
classDef promptLayer stroke-dasharray:6 4
classDef archLayer stroke-width:3px
class PATTERN patternLabel
class prompt promptLayer
class arch,valbox archLayer
Prompt-based guardrails (the CLAUDE.md hierarchy) shape how the agent reasons — persona, objectives, the no-mid-task-questions rule. They are advisory.
Architectural guardrails (the wrapper layer + the typed tool allowlist) decide what can physically execute. The agent has no raw shell. It can only call the tools on its allowlist; every tool shells through a bin/run_* wrapper that refuses to write inside the evidence tree, captures line-numbered output, and SHA-256-seals the raw bytes before any LLM sees them. This is the distinction judges look for, and EvilTrace enforces the safety-critical part in code, not in the prompt.
- Evidence is treated as read-only; wrappers refuse any output path inside the evidence /
/mnt//mediatrees and write only underanalysis/,reports/, andexports/. - The LLM has no raw shell access. The exposed tool list is the allowlist.
- All forensic execution flows through wrapper scripts that hash raw output before it is summarized or validated.
- Findings require an exact artifact citation; the validator is deterministic and LLM-free.
- The completion Stop hook verifies that promised artifacts actually exist before a run is accepted.
Bypass testing. The read-only guardrail (assert_safe_output in bin/_common.sh) is verified by tests/test_guardrail.sh, which probes it with five paths: a legitimate output path, a direct write into the evidence tree, a .. path-traversal escape, a protected-mount (/mnt) write, and an out-of-tree write. Only the legitimate path is allowed; every bypass attempt is refused. Because the guard canonicalizes with readlink -f before checking, the .. traversal collapses into the evidence tree and is caught. Captured run: tests/guardrail_selftest.txt.
EvilTrace reduces hallucination and spoliation risk through architectural guardrails and deterministic validation. It is not designed to defend against a deliberately malicious model, and it does not replace write-blockers, verified evidence handling, or examiner judgment.
EvilTrace runs locally on a SANS SIFT Workstation. There is no hosted deployment — it operates on local forensic evidence and local DFIR tooling.
Prerequisites
- SANS SIFT Workstation (provides Sleuth Kit, Zimmerman EZ Tools via .NET, Plaso, Volatility 3, YARA)
- Python 3.10+
- An Anthropic API key
pip install anthropic --break-system-packages(addfastmcponly if you want to run the MCP server)
Run the disk investigation
git clone https://github.com/1hackerway/eviltrace.git
cd eviltrace
export ANTHROPIC_API_KEY="sk-ant-..."
# EvilTrace ships no evidence. Place (or symlink) the disk image:
mkdir -p cases/incident_01/evidence
# cases/incident_01/evidence/rd01.E01 <-- put the image here
# Autonomous investigation (extracts artifacts, parses them, records findings):
python3 agent.py incident_01
# Deterministic validation: tag every finding + emit the accuracy report:
python3 validator.py --writeReview the outputs
cases/incident_01/reports/case_report.md # findings with confidence tags
cases/incident_01/analysis/findings.json # structured findings + citations
cases/incident_01/reports/accuracy_report.md # grounded / inferred / unverified rates
cases/incident_01/reports/execution_log.jsonl # per-tool + per-turn audit trail
What to expect. The disk run extracts $MFT, Amcache.hve, and the SOFTWARE/SYSTEM hives, parses them, and builds a Plaso timeline — the timeline step alone can take 20–40 minutes. A representative run produces 7 disk findings (5 GROUNDED, 2 INFERRED, 0 UNVERIFIED). Because the agent is autonomous, exact findings can vary slightly run-to-run; the committed samples/ reflect a validated reference run.
Don't have the evidence, or want to skip the run? Pre-computed reference outputs are in samples/ — the disk leg (findings.json, 7 findings) and the memory leg (memory_findings.json, 6 findings) together make up the full incident_01 case: 13 findings — 11 GROUNDED, 2 INFERRED, 0 UNVERIFIED. A judge never needs the 3 GB memory image to verify the project.
Across both cases: 18 findings — 14 GROUNDED, 3 INFERRED, 1 UNVERIFIED. The honest-accuracy write-up — including why the single UNVERIFIED is a correct floor rather than a miss — is in samples/accuracy_self_assessment.md.
A second, independent host shows the pipeline generalizes beyond the SHIELDBASE case: the Ali Hadi public Web Server case (Windows Server 2008 / XAMPP, a partitioned disk with NTFS at sector 2048 plus a memory image), run with no answer key in the briefing. It produced 5 findings — 3 GROUNDED / 1 INFERRED / 1 UNVERIFIED. The lone UNVERIFIED is a true-but-not-mechanically-certifiable directory-creation finding: the event happened, but a low-entropy directory name is not a hard anchor the deterministic validator can ground against a single line, so it floors honestly rather than over-certifying (see samples/accuracy_self_assessment.md §3). Reference outputs are in samples/alihadi_01/.
Optional — the MCP server (Custom MCP Server pattern)
pip install fastmcp --break-system-packages
claude mcp add eviltrace-forensics python3 eviltrace_mcp.py
claude mcp list # should show eviltrace-forensics ✓ ConnectedThe registration is written to your local Claude config, not to this repo, so it must be re-run on a fresh clone.
All confidence counts in the demo come from committed files in this repo. Clone it and run these three commands — no tools or evidence images required:
# Combined scoreboard (per-case + total)
sed -n '/## 2. Results summary/,/Combined/p' samples/accuracy_self_assessment.md
# Per-case scoreboards (independent backing for the sum)
grep -niE 'scoreboard' samples/case_report.md # incident_01 → 11 / 2 / 0
grep -niE 'scoreboard' samples/alihadi_01/case_report.md # alihadi_01 → 3 / 1 / 1Expected: incident_01 (13 = 11/2/0) + alihadi_01 (5 = 3/1/1) = 18 findings — 14 GROUNDED / 3 INFERRED / 1 UNVERIFIED.
Live runs write under cases/incident_01/. The analysis/ outputs (extracted artifacts, parsed CSVs) are gitignored; the reports/ outputs are tracked. For one-stop review, samples/ bundles a copy of every submission-facing artifact (disk + memory).
| Output | Live-run path | Tracked copy |
|---|---|---|
| Disk findings (structured) | cases/incident_01/analysis/findings.json |
samples/findings.json |
| Memory findings (structured) | cases/incident_01/analysis/memory_findings.json |
samples/memory_findings.json |
| Case report (disk) | cases/incident_01/reports/case_report.md |
samples/case_report.md |
| Accuracy report (disk) | cases/incident_01/reports/accuracy_report.md |
samples/accuracy_report.md |
| Accuracy report (memory) | cases/incident_01/reports/accuracy_report_memory.md |
samples/accuracy_report_memory.md |
| Execution log | cases/incident_01/reports/execution_log.jsonl |
samples/execution_log.jsonl |
| Self-correction excerpt | (extracted from the execution log) | samples/self_correction_excerpt.jsonl |
| Raw tool outputs | cases/incident_01/analysis/tool_runs/ |
(excluded — large) |
Demo-video run provenance. The demo video's live-run footage is a separate cold run on a scratch copy of the incident_01 case. Its complete audit artifacts — execution log (including the on-camera self-corrections), findings, session promises, and rendered case report — are committed verbatim in samples/video_demo_run/, so every frame of terminal output in the video traces to a logged tool execution. Canonical submitted results remain samples/ (incident_01) and samples/alihadi_01/.
EvilTrace measures two distinct things and keeps them separate:
- Grounding / inference-constraint accuracy (self-contained — this is the differentiator). For every finding, does the cited line exist, does it match the pre-LLM hash, and does it actually support the claim? This needs no external answer key — it is verifiable from the repo alone.
- Investigative recall (needs an external answer key). Did the agent find everything a human would? This is not yet externally benchmarked, and the accuracy report says so plainly.
A real excerpt from validator.py (trimmed):
[F001] ✔ GROUNDED mft_output.csv:L219674
files matched : procdump.exe
sizes matched : 515776 bytes
path corrob. : tdungan, dashlane, procdump.exe
integrity : VERIFIED (input_seal=OK, csv_seal=SEALED_OK)
[F004] ~ INFERRED SOFTWARE_recmd.csv:L31
files matched : msascuil.exe
MISSING : file:vmtoolsd.exe -> not GROUNDED
integrity : VERIFIED (input_seal=OK, csv_seal=SEALED_OK)
F004 is the moat in one frame: the agent's claim referenced a file the cited line did not contain, so the validator downgraded it from GROUNDED to INFERRED automatically — no human cross-examination required.
The reference run also correctly rejected look-alikes rather than over-reporting — e.g. LogUploader.dll and Qt5*.dll (legitimate OneDrive/Dashlane components), csscan.exe (McAfee), remsh.exe (Windows rempl), and Office-installer binaries — and recorded zero hallucinated findings. Documented IOCs that live only in the memory image (e.g. STUN.exe, msedge.exe, 172.15.1.20) were correctly absent from the disk findings rather than fabricated.
The audit trail is not a narrative summary — it is structured evidence of what the agent and tools actually did. Any finding traces back through a single chain:
finding → artifact citation (file:LINE) → tool_runs/ raw output → pre-LLM SHA-256 seal in execution_log.jsonl
execution_log.jsonl carries three record types:
- tool runs — timestamp, tool, command, return code, raw-output path, pre-LLM SHA-256 (and a
csv_sealfor parsed CSVs), self_correction— when a deterministic tool error occurs, the decision is logged, the structured error is returned to the model, and the model reroutes (it is not a blind retry),llm_turn— per-turn timestamp, model, stop reason, and input/output token usage.
The validator's integrity verdicts describe exactly how far the chain can be proven:
| Verdict | Meaning |
|---|---|
VERIFIED |
Input artifact seal and parsed-CSV seal both match |
INPUT_VERIFIED |
Input artifact seal matches; the parser did not emit a CSV seal |
CSV_VERIFIED |
Parsed output is byte-stable since capture; producer chain not regex-traceable |
RAW_VERIFIED |
Raw tool-output seal matches (used for memory/Volatility findings) |
Full provenance is in datasets/README.md. In brief: Operation SHIELDBASE, host base-rd-01, disk image rd01.E01 (EWF, single-volume NTFS, image MD5 391be74b6830344eace7272f697cf1ae). Evidence files are not included in this repository.
| Requirement | Location |
|---|---|
| Public GitHub repository | this repo |
| Open-source license (MIT) | LICENSE |
| Demo video | (Devpost submission link) |
| Architecture diagram (prompt-based vs architectural guardrails) | Architecture above |
| Dataset documentation | datasets/README.md |
| Accuracy report | samples/accuracy_report.md (disk) + samples/accuracy_report_memory.md (memory) |
| Try-it-out instructions | Try It Out Locally above |
| Agent execution logs (timestamps + token usage) | samples/execution_log.jsonl |
- The primary investigation agent is a raw Anthropic SDK loop.
eviltrace_mcp.pyis a separate Custom MCP Server interface that demonstrates the MCP pattern over the same wrapper layer; it is not on the agent's runtime path. - The pipeline is tuned to the reference image (EWF, single-volume NTFS). Other image formats may require adjustment.
- Grounding accuracy is self-contained and verifiable; investigative recall is not yet externally benchmarked against an independent answer key.
- Some findings are intentionally
INFERREDwhen a single cited line does not prove the full claim — this is by design, not a defect. - Local SIFT tool paths can vary between workstation builds.
- Evidence files are not included in the public repository.
EvilTrace is a research prototype. The AI is a tool to be used by trained incident-response professionals; responsibility for the accuracy and completeness of findings remains with the human examiner. Use only on systems and data you are authorized to analyze. This software is provided "as is", without warranty of any kind — see LICENSE.
SIFT Workstation is a product of the SANS Institute.
Built for the SANS Find Evil! AI Hackathon (2026) by Anand Kumar. EvilTrace extends the Protocol SIFT reference architecture (Rob T. Lee, SANS). Its hallucination-combat design is informed by Dr. Brian Carrier's work on AI verification in DFIR (cybertriage.com). Implementation was done with assistance from Claude Code (Anthropic). The evaluation dataset is the SANS Operation SHIELDBASE scenario.
MIT — see LICENSE.