PaperTrail discovers public documents for a target domain, downloads them, extracts the embedded metadata, and turns the results into structured JSON/CSV reports. It also includes an optional CrewAI orchestration path so each pipeline phase can be delegated to autonomous agents.
- Deterministic Google dorks – builds Serper.dev queries per file type and optional keywords, with caching/resume support backed by SQLite.
- Rich metadata extraction – merges pypdf, OOXML, legacy Office, and ExifTool signals, then derives usernames, email domains, and filesystem paths.
- Actionable reporting – writes
results.json,downloads.json,metadata.json,metadata.csv,findings.json, and a Markdownreport.mdsummarizing trends across all documents. - Agent mode – when enabled, captures per-step artifacts plus CrewAI session output so analysts can audit or reuse what each agent produced.
- Python 3.12+
- uv
- A Serper.dev API key (used to access Google Search results)
git clone https://github.com/dejisec/papertrail.git
cd papertrail
uv syncCreate a .env (or copy example.env) and provide the required secrets:
SERPER_API_KEY=sk-serper-your-key-here
# Optional: enable CrewAI mode
OPENAI_API_KEY=sk-openai-...
PAPERTRAIL_USE_CREW=0 # set to 1 to run the CrewAI pipelineRun the CLI via uv run papertrail:
papertrail \
--domain example.com \
--types pdf,docx,xlsx \
--keywords finance,"internal audit" \
--max-results 10 \
--max-size-mb 75 \
--out ./outputKey options:
--domain/-d(required) – host to scopesite:searches.--types/-t(required) – comma-separated list of allowed extensions (pdf, docx, xlsx, doc, xls, ppt, pptx).--keywords/-k– optional comma-separated keywords, quoted automatically if they include spaces.--max-results– cap on normalized results (per domain, shared across types).--max-size-mb– reject downloads larger than this threshold.--out/-o– case directory storing all downloads, cache, and reports.--db-path– custom SQLite path (defaults to<out>/papertrail.db).--resume– reuse cached queries and downloads from previous runs.--json– emit the parsed configuration as JSON (useful for automation) and store it aspapertrail-config.jsoninside the output folder.--verbose– enable structured DEBUG logs for troubleshooting.
Each run produces:
results.json– normalized Serper search hits plus run context.downloads.json– outcome per URL (status, MIME, SHA-256, size, final URL).files/– downloaded documents organized by file type.metadata.json/metadata.csv– structured metadata records.findings.json– aggregate counts (authors, editors, email domains, etc.).report.md– Markdown summary ready for briefings.papertrail.db– SQLite cache that powers--resume.
When PAPERTRAIL_USE_CREW=1, PaperTrail also writes agent artifacts under
output/agents/*.json (search payloads, fetch logs, metadata listings, CrewAI
session transcripts, and more). report.md references those paths in the
“Agent Outputs” section.
If both PAPERTRAIL_USE_CREW=1 and OPENAI_API_KEY are set, the CLI routes to
PapertrailCrewRunner, which:
- Runs the same deterministic search/download/metadata pipeline.
- Saves JSON artifacts per stage for auditing.
- Calls CrewAI’s sequential template (agents/tasks defined in
papertrail/config/agents.yamlandconfig/tasks.yaml). - Records the CrewAI session output so the findings include LLM reasoning.
This mode is completely optional; when the environment variables are absent the CLI falls back to the direct pipeline.