Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 52 additions & 27 deletions refactory/README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
# refactory

This directory contains tools for validating PDF files and matching Boite Excel inventory records against S3 files.
This directory contains tools for validating PDF files, matching Boite Excel inventory records against S3 files, and optionally exporting the results to XML (FFT) for CDS upload.

## Structure

- `cli.py` - click CLI exposing the main workflows:
- `validate-files-integrity`
- `file-match`
- `match-and-export`
- `storage_connection.py` - storage provider abstraction:
- `S3Provider` for S3.
- `CernboxProvider` for public CERNBox access.
- `CernboxProvider` for public/authenticated CERNBox access.
- `check_files/main.py` - validation pipeline used by `validate-files-integrity`.
- `file_import/refactory_matcher.py` - Boite-to-S3 matcher implementation used by `file-match`.
- `file_import/boite_matcher.py` - additional matcher implementation and helpers.
- `file_import/boite_matcher.py` - Boite-to-S3 matcher implementation used by `match-and-export`.
- `file_import/xml_exporter.py` - XML generator (FFT) used for CDS batch uploads.

## CLI usage

Expand All @@ -25,7 +25,9 @@ poetry run digitization_v2 --help
The available commands are:

- `validate-files-integrity` — validate PDF integrity and inventory alignment.
- `file-match` — match Boite Excel records against S3 files and generate JSON outputs.
- `match-and-export` — match Boite Excel records against S3 files, generate JSON outputs, and optionally export/upload XMLs.

---

## 1. Validate files integrity

Expand All @@ -38,45 +40,62 @@ poetry run digitization_v2 validate-files-integrity \
-b digitization-dev
```

Options:
**Options:**

- `-d, --data-source` — Boite inventory source. Supports a CERNBox hash, range (`1..10`), or list (`[1,2]`).
- `-u, --upload-reports` — upload validation reports back to storage.
- `-b, --bucket` — S3 bucket name (default: `digitization-dev`).
- `-p, --base-path` — Base S3 path (default: `cern-archives/raw/PDF/`).

This command runs the validation pipeline and generates logs such as `s3_pdf_issues.log`.

## 2. Boite-to-S3 file matching
---

## 2. Match and Export (Boite-to-S3)

Use this command to match Boite Excel filenames with S3 objects and write structured JSON output.
Use this command to match Boite Excel filenames with S3 objects, write structured JSON outputs, and optionally generate and upload XML files for CDS.

```bash
poetry run digitization_v2 file-match \
poetry run digitization_v2 match-and-export \
-d "https://cernbox.cern.ch/s/{hash}" \
-o ./match_results \
-f PDF,PDF_LATEX \
-b digitization-dev
-p "cern-archives/raw/CORRECTIONS_2,cern-archives/raw/" \
-o ./results \
-f PDF, PDF_LATEX \
-b digitization-dev \
-r \
-x \
-c
```

Options:
**Options:**

- `-d, --data-source` — local directory or CERNBox URL containing `.xlsx` Boite files.
- `-o, --output-path` — output directory for JSON results (default: `./match_results`).
- `-p, --base-paths` — Comma-separated base S3 paths. Order defines priority (e.g., `CORRECTIONS_2` overrides standard `raw` folders) (default: `cern-archives/raw/`).
- `-o, --output-path` — output directory for JSON/XML results (default: `./results`).
- `-f, --file-types` — comma-separated list of file types to match (default: `PDF,PDF_LATEX`).
- `-b, --bucket` — S3 bucket name (default: `digitization-dev`).
- `-r, --report` — Display detailed run summary metrics (Total Matched/Unmatched) and listed missing records in the console.
- `--dry-run` — Stop script execution after the matching phase. No XML generation or uploads will occur.
- `-x, --generate-xml` — Generate XML files (FFT) for CDS upload.
- `-c, --upload-cernbox` — Upload the generated XML files to CERNBox.
- `--cernbox-path` — Target folder inside CERNBox for XML uploads (default: `xml_exports`).

### Matcher & Export behavior

### Matcher behavior
The `match-and-export` flow:

The `file-match` flow:
1. **Downloads** `.xlsx` Boite files from CERNBox if a URL is provided.
2. **Reads** each Boite file and extracts the record ID and filename columns.
3. **Searches** S3 under `<BASE_PATH>/<TYPE>/<BOITE>/`. If multiple base paths are provided, it respects **priority mapping** (preventing duplicates by prioritizing earlier paths).
4. **Matches** filenames case-insensitively. Supports:
- *Flat layouts:* `raw/PDF_LATEX/BOITE_O0125/ISR-LEP-RF-GG-ps.pdf`
- *Nested subfolders:* `raw/PDF/BOITE_O0125/LEP-RF-SH-ps/LEP-RF-SH-ps.pdf`
- *Multi-page grouping:* Automatically groups multiple files (e.g., sequential TIFFs like `_001`, `_002`) under a single record ID dynamically.
5. **Generates** unified mismatch logs in JSON format for missing Boite rows, extra S3 files, and calculates match/unmatch metrics per file.
6. **(Optional) Exports** matching records to XML files if the `-x` flag is used. Generates XML `<datafield>` nodes dynamically based on all resolved file types (PDFs, TIFFs, OCRs).
7. **(Optional) Uploads** the generated XMLs to a specified path in CERNBox if the `-c` flag is used.

- downloads `.xlsx` Boite files from CERNBox if a URL is provided.
- reads each Boite file and extracts the record ID and filename columns.
- searches S3 under `raw/<TYPE>/<BOITE>/`.
- matches filenames case-insensitively.
- supports both flat and subfolder layouts:
- flat: `raw/PDF_LATEX/BOITE_O0125/ISR-LEP-RF-GG-ps.pdf`
- nested: `raw/PDF/BOITE_O0125/LEP-RF-SH-ps/LEP-RF-SH-ps.pdf`
- writes unified mismatch logs in JSON format for missing Boite rows and extra S3 files.
---

## Dependencies

Expand All @@ -93,6 +112,9 @@ poetry install
- `boto3`
- `requests`
- `pypdf`
- `click`

---

## AWS Authentication

Expand All @@ -116,6 +138,8 @@ export SECRET_KEY="YOUR_SECRET_KEY"

> `S3Provider` also supports the default endpoint `https://s3.cern.ch`, configured in `storage_connection.py`.

---

## CERNBox Authentication

`CernboxProvider` reads optional credentials from environment variables:
Expand All @@ -130,8 +154,9 @@ export CERNBOX_USER="your_username"
export CERNBOX_PASSWORD="your_password"
```

---

## Notes

- `file_import/refactory_matcher.py` is the primary matcher used by `file-match`.
- `test_connections.py` can be used to verify storage connectivity before running either workflow.
- `file_import/boite_matcher.py` is the primary matcher used by `match-and-export`.
- Use `poetry run digitization_v2 --help` to verify command names and options at runtime.
Loading
Loading