cern-sis · namollayo · Apr 22, 2026 · Apr 13, 2026 · Apr 14, 2026 · Apr 15, 2026
diff --git a/refactory/README.md b/refactory/README.md
@@ -1,15 +1,82 @@
 # refactory
 
-This directory contains scripts and helpers for validating PDF files in an S3 bucket using an inventory of Excel files hosted on CERNBox.
+This directory contains tools for validating PDF files and matching Boite Excel inventory records against S3 files.
 
 ## Structure
 
-- `main.py` - main script that validates PDFs using the CERNBox inventory.
+- `cli.py` - click CLI exposing the main workflows:
+  - `validate-files-integrity`
+  - `file-match`
 - `storage_connection.py` - storage provider abstraction:
   - `S3Provider` for S3.
   - `CernboxProvider` for public CERNBox access.
-- `validate_pdf.py` - validates PDFs locally with `is_pdf_valid(file_path)`.
-- `test_connections.py` - testing/connection experiment script.
+- `check_files/main.py` - validation pipeline used by `validate-files-integrity`.
+- `file_import/refactory_matcher.py` - Boite-to-S3 matcher implementation used by `file-match`.
+- `file_import/boite_matcher.py` - additional matcher implementation and helpers.
+
+## CLI usage
+
+Run the refactory CLI from the repository root:
+
+```bash
+poetry run digitization_v2 --help
+```
+
+The available commands are:
+
+- `validate-files-integrity` — validate PDF integrity and inventory alignment.
+- `file-match` — match Boite Excel records against S3 files and generate JSON outputs.
+
+## 1. Validate files integrity
+
+Use this command to check the Boite inventory against the PDF validation pipeline.
+
+```bash
+poetry run digitization_v2 validate-files-integrity \
+  -d "[122,123]" \
+  -u \
+  -b digitization-dev
+```
+
+Options:
+
+- `-d, --data-source` — Boite inventory source. Supports a CERNBox hash, range (`1..10`), or list (`[1,2]`).
+- `-u, --upload-reports` — upload validation reports back to storage.
+- `-b, --bucket` — S3 bucket name (default: `digitization-dev`).
+
+This command runs the validation pipeline and generates logs such as `s3_pdf_issues.log`.
+
+## 2. Boite-to-S3 file matching
+
+Use this command to match Boite Excel filenames with S3 objects and write structured JSON output.
+
+```bash
+poetry run digitization_v2 file-match \
+  -d "https://cernbox.cern.ch/s/{hash}" \
+  -o ./match_results \
+  -f PDF,PDF_LATEX \
+  -b digitization-dev
+```
+
+Options:
+
+- `-d, --data-source` — local directory or CERNBox URL containing `.xlsx` Boite files.
+- `-o, --output-path` — output directory for JSON results (default: `./match_results`).
+- `-f, --file-types` — comma-separated list of file types to match (default: `PDF,PDF_LATEX`).
+- `-b, --bucket` — S3 bucket name (default: `digitization-dev`).
+
+### Matcher behavior
+
+The `file-match` flow:
+
+- downloads `.xlsx` Boite files from CERNBox if a URL is provided.
+- reads each Boite file and extracts the record ID and filename columns.
+- searches S3 under `raw/<TYPE>/<BOITE>/`.
+- matches filenames case-insensitively.
+- supports both flat and subfolder layouts:
+  - flat: `raw/PDF_LATEX/BOITE_O0125/ISR-LEP-RF-GG-ps.pdf`
+  - nested: `raw/PDF/BOITE_O0125/LEP-RF-SH-ps/LEP-RF-SH-ps.pdf`
+- writes unified mismatch logs in JSON format for missing Boite rows and extra S3 files.
 
 ## Dependencies
 
@@ -27,8 +94,6 @@ poetry install
 - `requests`
 - `pypdf`
 
-> If the project is managed with Poetry, `requirements.txt` is not required.
-
 ## AWS Authentication
 
 `S3Provider` uses `boto3`. Configure credentials using environment variables or the default AWS config files:
@@ -51,55 +116,22 @@ export SECRET_KEY="YOUR_SECRET_KEY"
 
 > `S3Provider` also supports the default endpoint `https://s3.cern.ch`, configured in `storage_connection.py`.
 
-## Usage with Poetry
-
-Run the refactored CLI via Poetry:
-
-```bash
-poetry run digitization_v2 --help
-```
-
-The current command for PDF validation is `validade-files-integrity`.
-
-### Example
-
-```bash
-poetry run digitization_v2 check-integrity -s "[122,123]" -u
-```
-
-Parameters:
-
-- `-i, --inventory-source`: Inventory source. Supports CERNBOX Hash, range (`1..10`), or list (`[1,2]`).
-- `-u, --upload-reports`: Flag to upload validation reports back to the storage provider.
-- `-b, --bucket`: S3 bucket name (default: `digitization-dev`).
-
-### Example without upload
-
-```bash
-poetry run digitization_v2 check-integrity -s "[122,123]"
-```
-
-## Expected output
-
-The CLI generates the same validation reports as the core pipeline:
+## CERNBox Authentication
 
-- a text log file such as `s3_pdf_issues.log`
-- a structured JSON report with valid, corrupted, and missing file details
+`CernboxProvider` reads optional credentials from environment variables:
 
-If `-u` is provided, the reports will be uploaded back to the configured storage provider.
+- `CERNBOX_USER`
+- `CERNBOX_PASSWORD`
 
-## Additional notes
-
-- `CernboxProvider` reads optional credentials from environment variables:
-  - `CERNBOX_USER`
-  - `CERNBOX_PASSWORD`
-
-### Example environment variables for Cernbox
+### Example environment variables for CERNBox
 
 ```bash
 export CERNBOX_USER="your_username"
 export CERNBOX_PASSWORD="your_password"
 ```
 
-- You may still pass `account` and `password` directly to `CernboxProvider` if preferred.
-- Use `test_connections.py` to verify connections before running the main pipeline.
+## Notes
+
+- `file_import/refactory_matcher.py` is the primary matcher used by `file-match`.
+- `test_connections.py` can be used to verify storage connectivity before running either workflow.
+- Use `poetry run digitization_v2 --help` to verify command names and options at runtime.
diff --git a/refactory/__init__.py b/refactory/__init__.py
diff --git a/refactory/main.py → refactory/check_files/main.py b/refactory/main.py → refactory/check_files/main.py
@@ -3,24 +3,23 @@
 import os
 import sys
 import json
-from typing import Union
-from storage_connection import StorageProvider, S3Provider, CernboxProvider
-from validate_pdf import is_pdf_valid
+from refactory.storage_connection import StorageProvider, S3Provider, CernboxProvider
+from .utils import validate_pdf
 
 
 def run_validation_pipeline(
     provider: StorageProvider,
     base_path: str,
     log_file: str,
-    inventory_source: Union[str, list[int]],
+    data_source: str | list[int],
     upload_reports: bool = False,
 
 ):
     """Navigates directories, validates files, and logs files status."""
     target_box_numbers = set()
-    if isinstance(inventory_source, str):
-        inventory_provider = CernboxProvider(inventory_source)
-        excel_files = inventory_provider.list_excel("")
+    if isinstance(data_source, str):
+        data_source_provider = CernboxProvider(data_source)
+        excel_files = data_source_provider.list_files("", '.xlsx')
 
         for file_path in excel_files:
             filename = file_path.split(".")[0]
@@ -29,8 +28,8 @@ def run_validation_pipeline(
 
             if match:
                 target_box_numbers.add(int(match.group(1)))
-    elif isinstance(inventory_source, list):
-        target_box_numbers = set(inventory_source)
+    elif isinstance(data_source, list):
+        target_box_numbers = set(data_source)
 
     print(f"Excel files: {len(target_box_numbers)} boxes to check.")
 
@@ -57,7 +56,7 @@ def run_validation_pipeline(
             continue
         print(f"Processing target Box: {match.group(1) + (match.group(2) or '')}")
 
-        pdf_files = provider.list_pdfs(folder)
+        pdf_files = provider.list_files(folder, 'PDF')
 
         if not pdf_files:
             print(f"⚠️ EMPTY FOLDER: {folder}")
@@ -69,7 +68,7 @@ def run_validation_pipeline(
             with tempfile.NamedTemporaryFile(delete=True) as tmp:
                 provider.download_to_temp(pdf_path, tmp.name)
 
-                if is_pdf_valid(tmp.name):
+                if validate_pdf(tmp.name):
                     valid_files.append(pdf_path)
                     print(f"  ✅ {pdf_path}")
                 else:
@@ -136,6 +135,6 @@ def run_validation_pipeline(
         provider=s3_provider,  # cernbox_provider
         base_path="cern-archives/raw/PDF/",  # "teste/",
         log_file="s3_pdf_issues.log",
-        inventory_source=sys.argv[1],  # public_link_hash
+        data_source=sys.argv[1],  # public_link_hash
         upload_reports=int(sys.argv[2])
     )
diff --git a/refactory/validate_pdf.py → refactory/check_files/utils.py b/refactory/validate_pdf.py → refactory/check_files/utils.py
@@ -2,11 +2,11 @@
 from pypdf import PdfReader
 from pypdf.errors import PdfReadError
 
-def is_pdf_valid(file_path: str) -> bool:
+def validate_pdf(file_path: str) -> bool:
     """Checks if a local PDF is structurally valid and readable."""
     try:
         file_size = os.path.getsize(file_path)
-        if file_size < 100:  
+        if file_size < 100:
             return False
 
         with open(file_path, "rb") as f:
@@ -23,12 +23,12 @@ def is_pdf_valid(file_path: str) -> bool:
         if len(reader.pages) == 0:
             return False
 
-        _ = reader.pages[0] 
+        _ = reader.pages[0]
 
         return True
 
     except OSError as e:
         raise RuntimeError(f"System error when accessing file {file_path}: {e}") from e
 
     except (PdfReadError, Exception):
-        return False
+        return False