Stage3_Interpretation/A_Plotting/src/utilities.py:129 uses PyPDF2's PdfMerger.append in a loop to combine all per-target PDFs at the end of cNMF_perturbed_gene_analysis.py (called via merge_pdfs_in_folder on line 200 of the Slurm script).
Run on a directory with ~1,950 per-target PDFs, this hung indefinitely — no progress for 1+ hour after the per-target loop completed, no errors, no log output, the SLURM job stayed in RUNNING state until cancelled. Per-target PDFs were intact on disk; only the convenience-merged combined PDF was missing.
Switching to pdfunite (Poppler) merged the same ~1,950 PDFs into one ~880 MB file in 34 seconds:
pdfunite *.pdf merged_perturbed_gene_QC.pdf
Suggested fix
Replace the PyPDF2 path with a subprocess.run(["pdfunite", ...]) call when pdfunite is on PATH, falling back to PyPDF2 only when it isn't.
Related
Worth also filtering the file list to skip 0-byte PDFs before merging — both PyPDF2 and pdfunite choke on those. We had a small number of 0-byte per-target PDFs in our run (genes whose plot creation failed silently inside the parallel loop and produced an empty file).
Stage3_Interpretation/A_Plotting/src/utilities.py:129uses PyPDF2'sPdfMerger.appendin a loop to combine all per-target PDFs at the end ofcNMF_perturbed_gene_analysis.py(called viamerge_pdfs_in_folderon line 200 of the Slurm script).Run on a directory with ~1,950 per-target PDFs, this hung indefinitely — no progress for 1+ hour after the per-target loop completed, no errors, no log output, the SLURM job stayed in RUNNING state until cancelled. Per-target PDFs were intact on disk; only the convenience-merged combined PDF was missing.
Switching to
pdfunite(Poppler) merged the same ~1,950 PDFs into one ~880 MB file in 34 seconds:pdfunite *.pdf merged_perturbed_gene_QC.pdfSuggested fix
Replace the PyPDF2 path with a
subprocess.run(["pdfunite", ...])call whenpdfuniteis on PATH, falling back to PyPDF2 only when it isn't.Related
Worth also filtering the file list to skip 0-byte PDFs before merging — both PyPDF2 and
pdfunitechoke on those. We had a small number of 0-byte per-target PDFs in our run (genes whose plot creation failed silently inside the parallel loop and produced an empty file).