Skip to content

dejisec/papertrail

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PaperTrail

PaperTrail discovers public documents for a target domain, downloads them, extracts the embedded metadata, and turns the results into structured JSON/CSV reports. It also includes an optional CrewAI orchestration path so each pipeline phase can be delegated to autonomous agents.

Features

  • Deterministic Google dorks – builds Serper.dev queries per file type and optional keywords, with caching/resume support backed by SQLite.
  • Rich metadata extraction – merges pypdf, OOXML, legacy Office, and ExifTool signals, then derives usernames, email domains, and filesystem paths.
  • Actionable reporting – writes results.json, downloads.json, metadata.json, metadata.csv, findings.json, and a Markdown report.md summarizing trends across all documents.
  • Agent mode – when enabled, captures per-step artifacts plus CrewAI session output so analysts can audit or reuse what each agent produced.

Getting Started

Requirements

  • Python 3.12+
  • uv
  • A Serper.dev API key (used to access Google Search results)

Installation

git clone https://github.com/dejisec/papertrail.git
cd papertrail
uv sync

Environment

Create a .env (or copy example.env) and provide the required secrets:

SERPER_API_KEY=sk-serper-your-key-here
# Optional: enable CrewAI mode
OPENAI_API_KEY=sk-openai-...
PAPERTRAIL_USE_CREW=0  # set to 1 to run the CrewAI pipeline

Usage

Run the CLI via uv run papertrail:

papertrail \
  --domain example.com \
  --types pdf,docx,xlsx \
  --keywords finance,"internal audit" \
  --max-results 10 \
  --max-size-mb 75 \
  --out ./output

Key options:

  • --domain/-d (required) – host to scope site: searches.
  • --types/-t (required) – comma-separated list of allowed extensions (pdf, docx, xlsx, doc, xls, ppt, pptx).
  • --keywords/-k – optional comma-separated keywords, quoted automatically if they include spaces.
  • --max-results – cap on normalized results (per domain, shared across types).
  • --max-size-mb – reject downloads larger than this threshold.
  • --out/-o – case directory storing all downloads, cache, and reports.
  • --db-path – custom SQLite path (defaults to <out>/papertrail.db).
  • --resume – reuse cached queries and downloads from previous runs.
  • --json – emit the parsed configuration as JSON (useful for automation) and store it as papertrail-config.json inside the output folder.
  • --verbose – enable structured DEBUG logs for troubleshooting.

Output Layout

Each run produces:

  • results.json – normalized Serper search hits plus run context.
  • downloads.json – outcome per URL (status, MIME, SHA-256, size, final URL).
  • files/ – downloaded documents organized by file type.
  • metadata.json / metadata.csv – structured metadata records.
  • findings.json – aggregate counts (authors, editors, email domains, etc.).
  • report.md – Markdown summary ready for briefings.
  • papertrail.db – SQLite cache that powers --resume.

When PAPERTRAIL_USE_CREW=1, PaperTrail also writes agent artifacts under output/agents/*.json (search payloads, fetch logs, metadata listings, CrewAI session transcripts, and more). report.md references those paths in the “Agent Outputs” section.

Agent Workflow (Optional)

If both PAPERTRAIL_USE_CREW=1 and OPENAI_API_KEY are set, the CLI routes to PapertrailCrewRunner, which:

  1. Runs the same deterministic search/download/metadata pipeline.
  2. Saves JSON artifacts per stage for auditing.
  3. Calls CrewAI’s sequential template (agents/tasks defined in papertrail/config/agents.yaml and config/tasks.yaml).
  4. Records the CrewAI session output so the findings include LLM reasoning.

This mode is completely optional; when the environment variables are absent the CLI falls back to the direct pipeline.

About

Python tool for OSINT-driven public document discovery and metadata extraction, with optional CrewAI-powered agent orchestration.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages