Skip to content

fresva/zoteroPDFintegrator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Zotero PDF Integrator

Matches local PDF files to existing Zotero library records and uploads them as stored copies. Designed for researchers who have migrated references from EndNote (or similar) and need to attach their annotated PDFs to the corresponding Zotero records.

Why

Zotero cannot natively scan local folders and match PDFs to existing records. This tool automates the matching using metadata extracted from the PDFs via GROBID, then uploads matched files via the Zotero API. The process is split into two phases with a manual review step in between, so nothing touches your Zotero library without your explicit approval.

How it works

Phase 1: Build mapping CSV (phase1_build_csv.py)

  1. Scans configured PDF folders recursively
  2. Detects annotated/original pairs (files ending in - annotated.pdf)
  3. Extracts metadata from each PDF using GROBID (title, authors, DOI, journal, volume, issue, pages, year)
  4. Fetches all records from your Zotero library (cached locally in SQLite)
  5. Matches PDFs to Zotero records using a tiered strategy:
    • Tier 1: DOI exact match (confidence 100)
    • Tier 2: Title + Year fuzzy match (confidence ~90-98)
    • Tier 3: Author + Year + Title fragment (confidence ~80-90)
    • Tier 4: Author + Year only (confidence ~60-75)
    • Tier 5: General fuzzy fallback (confidence varies)
  6. Deduplicates when the same paper exists in multiple folders (keeps annotated version)
  7. Outputs a CSV for manual review

Manual review

Open the CSV and edit the action column:

  • UPLOAD — approve for upload
  • SKIP — ignore this file
  • Leave REVIEW_* as-is if unsure (will not be uploaded)
  • Correct zotero_key if the match is wrong

Phase 2: Upload to Zotero (phase2_upload.py)

  1. Reads the reviewed CSV
  2. Checks which Zotero records already have PDF attachments (skips those)
  3. Uploads approved PDFs as stored copies
  4. Creates symlinks in unmapped_pdfs/ for files that were not uploaded, making manual handling easier

Prerequisites

  • Python 3.10+
  • GROBID running at localhost:8070 (via Docker)
  • Zotero API key with read/write access to your personal library

Setup

  1. Clone this repo and install dependencies:

    pip install pyzotero rapidfuzz python-dotenv requests
    
  2. Start GROBID (add to your docker-compose.yaml or run directly):

    docker run -d --name grobid -p 8070:8070 lfoppiano/grobid:0.8.2
    
  3. Copy .env.example to .env and fill in your credentials:

    cp .env.example .env
    
  4. Edit PDF_SUBFOLDERS in phase1_build_csv.py to list the folders you want to scan (relative to PDF_BASE_FOLDER from .env).

Usage

First run (test with a small batch)

Set TEST_LIMIT = 10 in phase1_build_csv.py, then:

python phase1_build_csv.py
# Review the generated CSV, set action=UPLOAD for approved rows
python phase2_upload.py zotero_mapping_XXXXXXXX.csv

Production run

Set TEST_LIMIT = None, then:

python phase1_build_csv.py
# Review CSV
python phase2_upload.py zotero_mapping_XXXXXXXX.csv

Batch re-runs

When running in batches, use --check-uploaded to mark previously uploaded records in the CSV so you can focus on what's new:

python phase1_build_csv.py --check-uploaded

Records that already have PDFs in Zotero will be marked ALREADY_UPLOADED and sorted to the bottom of the CSV.

Files

File Purpose
phase1_build_csv.py Scan, extract metadata, match, produce CSV
phase2_upload.py Upload approved PDFs, create symlinks for unmapped
.env Your Zotero credentials and PDF base path (not committed)
.env.example Template for .env
zotero_cache.db SQLite cache of Zotero records (auto-generated)
unmapped_pdfs/ Symlinks to PDFs that were not uploaded (for manual review)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages