Description
Extend the basic matcher from #24 to handle more complexity, more file types, multiple roots and better observability for Airflow operation.
Work involved
- Add a parameter to configure the types that should get matched. Examples are:
PDF, PDF_LATEX, TIFF, PDF_OCR, PDF_TRANSMIS
- Add a parameter to configure multiple roots. So for example the default root
raw/<TYPE>/... but more roots can be passed, for example raw/CORRECTIONS_2/<TYPE>/.... In this case the order of the list should determine which root gets preferred. Which means for example only if a file could not be found in the CORRECTIONS_2 root, the function should search for it in the basic raw/ root.
- Add run summary metrics (total matched, total unmatched)
- Add optional report mode. This should lead to a detailed output in the written logs (Bascially display summary metrics, and the matches for all listed files in the boite files)
- Add optional dry-run mode, this should stop the script execution after the matching happend. So no XML etc get created after...
Acceptance criteria
- Matcher supports all configured file types
- Files are found whether they are under primary roots or corrections roots
- Report mode with detailed logs exist
- Dry-run mode is implemented
Screenshots(Optional)
Description
Extend the basic matcher from #24 to handle more complexity, more file types, multiple roots and better observability for Airflow operation.
Work involved
PDF,PDF_LATEX,TIFF,PDF_OCR,PDF_TRANSMISraw/<TYPE>/...but more roots can be passed, for exampleraw/CORRECTIONS_2/<TYPE>/.... In this case the order of the list should determine which root gets preferred. Which means for example only if a file could not be found in theCORRECTIONS_2root, the function should search for it in the basicraw/root.Acceptance criteria
Screenshots(Optional)