Skip to content

merge_pdfs_in_folder hangs indefinitely with PyPDF2 PdfMerger on thousands of PDFs #8

@adamklie

Description

@adamklie

Stage3_Interpretation/A_Plotting/src/utilities.py:129 uses PyPDF2's PdfMerger.append in a loop to combine all per-target PDFs at the end of cNMF_perturbed_gene_analysis.py (called via merge_pdfs_in_folder on line 200 of the Slurm script).

Run on a directory with ~1,950 per-target PDFs, this hung indefinitely — no progress for 1+ hour after the per-target loop completed, no errors, no log output, the SLURM job stayed in RUNNING state until cancelled. Per-target PDFs were intact on disk; only the convenience-merged combined PDF was missing.

Switching to pdfunite (Poppler) merged the same ~1,950 PDFs into one ~880 MB file in 34 seconds:

pdfunite *.pdf merged_perturbed_gene_QC.pdf

Suggested fix

Replace the PyPDF2 path with a subprocess.run(["pdfunite", ...]) call when pdfunite is on PATH, falling back to PyPDF2 only when it isn't.

Related

Worth also filtering the file list to skip 0-byte PDFs before merging — both PyPDF2 and pdfunite choke on those. We had a small number of 0-byte per-target PDFs in our run (genes whose plot creation failed silently inside the parallel loop and produced an empty file).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions