PaperTrail

PaperTrail discovers public documents for a target domain, downloads them, extracts the embedded metadata, and turns the results into structured JSON/CSV reports. It also includes an optional CrewAI orchestration path so each pipeline phase can be delegated to autonomous agents.

Features

Deterministic Google dorks – builds Serper.dev queries per file type and optional keywords, with caching/resume support backed by SQLite.
Rich metadata extraction – merges pypdf, OOXML, legacy Office, and ExifTool signals, then derives usernames, email domains, and filesystem paths.
Actionable reporting – writes results.json, downloads.json, metadata.json, metadata.csv, findings.json, and a Markdown report.md summarizing trends across all documents.
Agent mode – when enabled, captures per-step artifacts plus CrewAI session output so analysts can audit or reuse what each agent produced.

Getting Started

Requirements

Python 3.12+
uv
A Serper.dev API key (used to access Google Search results)

Installation

git clone https://github.com/dejisec/papertrail.git
cd papertrail
uv sync

Environment

Create a .env (or copy example.env) and provide the required secrets:

SERPER_API_KEY=sk-serper-your-key-here
# Optional: enable CrewAI mode
OPENAI_API_KEY=sk-openai-...
PAPERTRAIL_USE_CREW=0  # set to 1 to run the CrewAI pipeline

Usage

Run the CLI via uv run papertrail:

papertrail \
  --domain example.com \
  --types pdf,docx,xlsx \
  --keywords finance,"internal audit" \
  --max-results 10 \
  --max-size-mb 75 \
  --out ./output

Key options:

--domain/-d (required) – host to scope site: searches.
--types/-t (required) – comma-separated list of allowed extensions (pdf, docx, xlsx, doc, xls, ppt, pptx).
--keywords/-k – optional comma-separated keywords, quoted automatically if they include spaces.
--max-results – cap on normalized results (per domain, shared across types).
--max-size-mb – reject downloads larger than this threshold.
--out/-o – case directory storing all downloads, cache, and reports.
--db-path – custom SQLite path (defaults to <out>/papertrail.db).
--resume – reuse cached queries and downloads from previous runs.
--json – emit the parsed configuration as JSON (useful for automation) and store it as papertrail-config.json inside the output folder.
--verbose – enable structured DEBUG logs for troubleshooting.

Output Layout

Each run produces:

results.json – normalized Serper search hits plus run context.
downloads.json – outcome per URL (status, MIME, SHA-256, size, final URL).
files/ – downloaded documents organized by file type.
metadata.json / metadata.csv – structured metadata records.
findings.json – aggregate counts (authors, editors, email domains, etc.).
report.md – Markdown summary ready for briefings.
papertrail.db – SQLite cache that powers --resume.

When PAPERTRAIL_USE_CREW=1, PaperTrail also writes agent artifacts under output/agents/*.json (search payloads, fetch logs, metadata listings, CrewAI session transcripts, and more). report.md references those paths in the “Agent Outputs” section.

Agent Workflow (Optional)

If both PAPERTRAIL_USE_CREW=1 and OPENAI_API_KEY are set, the CLI routes to PapertrailCrewRunner, which:

Runs the same deterministic search/download/metadata pipeline.
Saves JSON artifacts per stage for auditing.
Calls CrewAI’s sequential template (agents/tasks defined in papertrail/config/agents.yaml and config/tasks.yaml).
Records the CrewAI session output so the findings include LLM reasoning.

This mode is completely optional; when the environment variables are absent the CLI falls back to the direct pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
papertrail		papertrail
tests		tests
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
example.env		example.env
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PaperTrail

Features

Getting Started

Requirements

Installation

Environment

Usage

Output Layout

Agent Workflow (Optional)

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PaperTrail

Features

Getting Started

Requirements

Installation

Environment

Usage

Output Layout

Agent Workflow (Optional)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages