Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 81 additions & 49 deletions refactory/README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,82 @@
# refactory

This directory contains scripts and helpers for validating PDF files in an S3 bucket using an inventory of Excel files hosted on CERNBox.
This directory contains tools for validating PDF files and matching Boite Excel inventory records against S3 files.

## Structure

- `main.py` - main script that validates PDFs using the CERNBox inventory.
- `cli.py` - click CLI exposing the main workflows:
- `validate-files-integrity`
- `file-match`
- `storage_connection.py` - storage provider abstraction:
- `S3Provider` for S3.
- `CernboxProvider` for public CERNBox access.
- `validate_pdf.py` - validates PDFs locally with `is_pdf_valid(file_path)`.
- `test_connections.py` - testing/connection experiment script.
- `check_files/main.py` - validation pipeline used by `validate-files-integrity`.
- `file_import/refactory_matcher.py` - Boite-to-S3 matcher implementation used by `file-match`.
- `file_import/boite_matcher.py` - additional matcher implementation and helpers.

## CLI usage

Run the refactory CLI from the repository root:

```bash
poetry run digitization_v2 --help
```

The available commands are:

- `validate-files-integrity` — validate PDF integrity and inventory alignment.
- `file-match` — match Boite Excel records against S3 files and generate JSON outputs.

## 1. Validate files integrity

Use this command to check the Boite inventory against the PDF validation pipeline.

```bash
poetry run digitization_v2 validate-files-integrity \
-d "[122,123]" \
-u \
-b digitization-dev
```

Options:

- `-d, --data-source` — Boite inventory source. Supports a CERNBox hash, range (`1..10`), or list (`[1,2]`).
- `-u, --upload-reports` — upload validation reports back to storage.
- `-b, --bucket` — S3 bucket name (default: `digitization-dev`).

This command runs the validation pipeline and generates logs such as `s3_pdf_issues.log`.

## 2. Boite-to-S3 file matching

Use this command to match Boite Excel filenames with S3 objects and write structured JSON output.

```bash
poetry run digitization_v2 file-match \
-d "https://cernbox.cern.ch/s/{hash}" \
-o ./match_results \
-f PDF,PDF_LATEX \
-b digitization-dev
```

Options:

- `-d, --data-source` — local directory or CERNBox URL containing `.xlsx` Boite files.
- `-o, --output-path` — output directory for JSON results (default: `./match_results`).
- `-f, --file-types` — comma-separated list of file types to match (default: `PDF,PDF_LATEX`).
- `-b, --bucket` — S3 bucket name (default: `digitization-dev`).

### Matcher behavior

The `file-match` flow:

- downloads `.xlsx` Boite files from CERNBox if a URL is provided.
- reads each Boite file and extracts the record ID and filename columns.
- searches S3 under `raw/<TYPE>/<BOITE>/`.
- matches filenames case-insensitively.
- supports both flat and subfolder layouts:
- flat: `raw/PDF_LATEX/BOITE_O0125/ISR-LEP-RF-GG-ps.pdf`
- nested: `raw/PDF/BOITE_O0125/LEP-RF-SH-ps/LEP-RF-SH-ps.pdf`
- writes unified mismatch logs in JSON format for missing Boite rows and extra S3 files.

## Dependencies

Expand All @@ -27,8 +94,6 @@ poetry install
- `requests`
- `pypdf`

> If the project is managed with Poetry, `requirements.txt` is not required.

## AWS Authentication

`S3Provider` uses `boto3`. Configure credentials using environment variables or the default AWS config files:
Expand All @@ -51,55 +116,22 @@ export SECRET_KEY="YOUR_SECRET_KEY"

> `S3Provider` also supports the default endpoint `https://s3.cern.ch`, configured in `storage_connection.py`.

## Usage with Poetry

Run the refactored CLI via Poetry:

```bash
poetry run digitization_v2 --help
```

The current command for PDF validation is `validade-files-integrity`.

### Example

```bash
poetry run digitization_v2 check-integrity -s "[122,123]" -u
```

Parameters:

- `-i, --inventory-source`: Inventory source. Supports CERNBOX Hash, range (`1..10`), or list (`[1,2]`).
- `-u, --upload-reports`: Flag to upload validation reports back to the storage provider.
- `-b, --bucket`: S3 bucket name (default: `digitization-dev`).

### Example without upload

```bash
poetry run digitization_v2 check-integrity -s "[122,123]"
```

## Expected output

The CLI generates the same validation reports as the core pipeline:
## CERNBox Authentication

- a text log file such as `s3_pdf_issues.log`
- a structured JSON report with valid, corrupted, and missing file details
`CernboxProvider` reads optional credentials from environment variables:

If `-u` is provided, the reports will be uploaded back to the configured storage provider.
- `CERNBOX_USER`
- `CERNBOX_PASSWORD`

## Additional notes

- `CernboxProvider` reads optional credentials from environment variables:
- `CERNBOX_USER`
- `CERNBOX_PASSWORD`

### Example environment variables for Cernbox
### Example environment variables for CERNBox

```bash
export CERNBOX_USER="your_username"
export CERNBOX_PASSWORD="your_password"
```

- You may still pass `account` and `password` directly to `CernboxProvider` if preferred.
- Use `test_connections.py` to verify connections before running the main pipeline.
## Notes

- `file_import/refactory_matcher.py` is the primary matcher used by `file-match`.
- `test_connections.py` can be used to verify storage connectivity before running either workflow.
- Use `poetry run digitization_v2 --help` to verify command names and options at runtime.
Empty file added refactory/__init__.py
Empty file.
23 changes: 11 additions & 12 deletions refactory/main.py → refactory/check_files/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,24 +3,23 @@
import os
import sys
import json
from typing import Union
from storage_connection import StorageProvider, S3Provider, CernboxProvider
from validate_pdf import is_pdf_valid
from refactory.storage_connection import StorageProvider, S3Provider, CernboxProvider
from .utils import validate_pdf


def run_validation_pipeline(
provider: StorageProvider,
base_path: str,
log_file: str,
inventory_source: Union[str, list[int]],
data_source: str | list[int],
upload_reports: bool = False,

):
"""Navigates directories, validates files, and logs files status."""
target_box_numbers = set()
if isinstance(inventory_source, str):
inventory_provider = CernboxProvider(inventory_source)
excel_files = inventory_provider.list_excel("")
if isinstance(data_source, str):
data_source_provider = CernboxProvider(data_source)
excel_files = data_source_provider.list_files("", '.xlsx')

for file_path in excel_files:
filename = file_path.split(".")[0]
Expand All @@ -29,8 +28,8 @@ def run_validation_pipeline(

if match:
target_box_numbers.add(int(match.group(1)))
elif isinstance(inventory_source, list):
target_box_numbers = set(inventory_source)
elif isinstance(data_source, list):
target_box_numbers = set(data_source)

print(f"Excel files: {len(target_box_numbers)} boxes to check.")

Expand All @@ -57,7 +56,7 @@ def run_validation_pipeline(
continue
print(f"Processing target Box: {match.group(1) + (match.group(2) or '')}")

pdf_files = provider.list_pdfs(folder)
pdf_files = provider.list_files(folder, 'PDF')
Comment thread
namollayo marked this conversation as resolved.

if not pdf_files:
print(f"⚠️ EMPTY FOLDER: {folder}")
Expand All @@ -69,7 +68,7 @@ def run_validation_pipeline(
with tempfile.NamedTemporaryFile(delete=True) as tmp:
provider.download_to_temp(pdf_path, tmp.name)

if is_pdf_valid(tmp.name):
if validate_pdf(tmp.name):
valid_files.append(pdf_path)
print(f" ✅ {pdf_path}")
else:
Expand Down Expand Up @@ -136,6 +135,6 @@ def run_validation_pipeline(
provider=s3_provider, # cernbox_provider
base_path="cern-archives/raw/PDF/", # "teste/",
log_file="s3_pdf_issues.log",
inventory_source=sys.argv[1], # public_link_hash
data_source=sys.argv[1], # public_link_hash
upload_reports=int(sys.argv[2])
)
8 changes: 4 additions & 4 deletions refactory/validate_pdf.py → refactory/check_files/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@
from pypdf import PdfReader
from pypdf.errors import PdfReadError

def is_pdf_valid(file_path: str) -> bool:
def validate_pdf(file_path: str) -> bool:
"""Checks if a local PDF is structurally valid and readable."""
try:
file_size = os.path.getsize(file_path)
if file_size < 100:
if file_size < 100:
return False

with open(file_path, "rb") as f:
Expand All @@ -23,12 +23,12 @@ def is_pdf_valid(file_path: str) -> bool:
if len(reader.pages) == 0:
return False

_ = reader.pages[0]
_ = reader.pages[0]

return True

except OSError as e:
raise RuntimeError(f"System error when accessing file {file_path}: {e}") from e

except (PdfReadError, Exception):
return False
return False
Loading
Loading