Matches local PDF files to existing Zotero library records and uploads them as stored copies. Designed for researchers who have migrated references from EndNote (or similar) and need to attach their annotated PDFs to the corresponding Zotero records.
Zotero cannot natively scan local folders and match PDFs to existing records. This tool automates the matching using metadata extracted from the PDFs via GROBID, then uploads matched files via the Zotero API. The process is split into two phases with a manual review step in between, so nothing touches your Zotero library without your explicit approval.
- Scans configured PDF folders recursively
- Detects annotated/original pairs (files ending in
- annotated.pdf) - Extracts metadata from each PDF using GROBID (title, authors, DOI, journal, volume, issue, pages, year)
- Fetches all records from your Zotero library (cached locally in SQLite)
- Matches PDFs to Zotero records using a tiered strategy:
- Tier 1: DOI exact match (confidence 100)
- Tier 2: Title + Year fuzzy match (confidence ~90-98)
- Tier 3: Author + Year + Title fragment (confidence ~80-90)
- Tier 4: Author + Year only (confidence ~60-75)
- Tier 5: General fuzzy fallback (confidence varies)
- Deduplicates when the same paper exists in multiple folders (keeps annotated version)
- Outputs a CSV for manual review
Open the CSV and edit the action column:
UPLOAD— approve for uploadSKIP— ignore this file- Leave
REVIEW_*as-is if unsure (will not be uploaded) - Correct
zotero_keyif the match is wrong
- Reads the reviewed CSV
- Checks which Zotero records already have PDF attachments (skips those)
- Uploads approved PDFs as stored copies
- Creates symlinks in
unmapped_pdfs/for files that were not uploaded, making manual handling easier
- Python 3.10+
- GROBID running at
localhost:8070(via Docker) - Zotero API key with read/write access to your personal library
-
Clone this repo and install dependencies:
pip install pyzotero rapidfuzz python-dotenv requests -
Start GROBID (add to your
docker-compose.yamlor run directly):docker run -d --name grobid -p 8070:8070 lfoppiano/grobid:0.8.2 -
Copy
.env.exampleto.envand fill in your credentials:cp .env.example .env -
Edit
PDF_SUBFOLDERSinphase1_build_csv.pyto list the folders you want to scan (relative toPDF_BASE_FOLDERfrom.env).
Set TEST_LIMIT = 10 in phase1_build_csv.py, then:
python phase1_build_csv.py
# Review the generated CSV, set action=UPLOAD for approved rows
python phase2_upload.py zotero_mapping_XXXXXXXX.csvSet TEST_LIMIT = None, then:
python phase1_build_csv.py
# Review CSV
python phase2_upload.py zotero_mapping_XXXXXXXX.csvWhen running in batches, use --check-uploaded to mark previously uploaded records in the CSV so you can focus on what's new:
python phase1_build_csv.py --check-uploadedRecords that already have PDFs in Zotero will be marked ALREADY_UPLOADED and sorted to the bottom of the CSV.
| File | Purpose |
|---|---|
phase1_build_csv.py |
Scan, extract metadata, match, produce CSV |
phase2_upload.py |
Upload approved PDFs, create symlinks for unmapped |
.env |
Your Zotero credentials and PDF base path (not committed) |
.env.example |
Template for .env |
zotero_cache.db |
SQLite cache of Zotero records (auto-generated) |
unmapped_pdfs/ |
Symlinks to PDFs that were not uploaded (for manual review) |