Skip to content

Improve file matching logic (Boite filename - S3 filename) #21

@PascalEgn

Description

@PascalEgn

Description

Extend the basic matcher from #24 to handle more complexity, more file types, multiple roots and better observability for Airflow operation.

Work involved

  • Add a parameter to configure the types that should get matched. Examples are: PDF, PDF_LATEX, TIFF, PDF_OCR, PDF_TRANSMIS
  • Add a parameter to configure multiple roots. So for example the default root raw/<TYPE>/... but more roots can be passed, for example raw/CORRECTIONS_2/<TYPE>/.... In this case the order of the list should determine which root gets preferred. Which means for example only if a file could not be found in the CORRECTIONS_2 root, the function should search for it in the basic raw/ root.
  • Add run summary metrics (total matched, total unmatched)
  • Add optional report mode. This should lead to a detailed output in the written logs (Bascially display summary metrics, and the matches for all listed files in the boite files)
  • Add optional dry-run mode, this should stop the script execution after the matching happend. So no XML etc get created after...

Acceptance criteria

  • Matcher supports all configured file types
  • Files are found whether they are under primary roots or corrections roots
  • Report mode with detailed logs exist
  • Dry-run mode is implemented

Screenshots(Optional)

Metadata

Metadata

Assignees

No one assigned

    Labels

    File Import ProjectThis task is related to the file import project of digitization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions