Skip to content

Rewrite Script to check for corrupt or empty PDFs #22

@PascalEgn

Description

@PascalEgn

Description

We should improve/rewrite the script to check for corrupt or empty PDFs to prepare it for the migration to Airflow.

This includes rethinking what function parameters would make sense, some ideas are:

input_url:

Reads exisiting BOITE_O0XXX files in the shared CERNBox directory and checks if the given file numbers contains corrupt/empty PDFs on S3. Here is also an example URL with some files: https://cernbox.cern.ch/s/QslvWRIPsBcDAOK

List of Numbers:

A list of BOITE file numbers which should be checked on S3.

Range of Numbers:

Start and end of a number range to check on S3. For example 205..300 would check all folders in this range (205,206,207...300)

bucket_name

Name of the S3 bucket to be checked

base_prefix:

S3 Path to the PDF files which are meant to be checked. (E.g. raw/PDF/ or raw/CORRECTIONS_2/PDF_OCR/, etc.)

output_url:

CERNBox url to which the generated log file should be uploaded to. Here is an example folder whre files can be uploaded to: https://cernbox.cern.ch/s/OBzMIMo6fDb7gCc

Work involved

  • Think about what parameters make sense
  • Establish connection to view and upload files from and to CERNBox
  • Implement useful parameters into the script

Acceptance criteria

Screenshots(Optional)

Metadata

Metadata

Assignees

Labels

File Import ProjectThis task is related to the file import project of digitization

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions