Skip to content

scilons/packaging

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scilons-packaging

Bundle GROBID outputs (TEI-XML, JSON, Markdown) into compressed Parquet shards suitable for distribution and downstream reading via HF datasets, Spark, DuckDB, or pyarrow.dataset.

Designed for very large trees (tens of millions of files): planning does no per-file stat(), shards are slices of a single directory, and the plan stage runs as a SLURM job, never on the login node.

What it does

Given one or more roots with a <root>/<dir_id>/<file> layout (the JSON and Markdown roots may point to the same physical tree — the per-format suffix filter handles the coexistence), this package:

  1. Discovers files per directory using os.scandir only — file counts, no per-file stat.
  2. Plans ~256 MB (compressed) shards. Each shard is a (format, dir_id, slice_start, slice_end) range over the sorted filenames of one directory.
  3. Builds one Parquet file per shard, zstd level 9, fanning out across a SLURM array job.

Content is stored verbatim. TEI extraction lives in harvesting/scilons_harvesting/.

Schema (identical for all three datasets)

field type notes
doc_id string basename without extension
dir_id string first-level subdirectory name (e.g. 0042)
relpath string path relative to format root
content large_string full file contents (UTF-8)
size_bytes uint64 original on-disk size

Three datasets, joinable on (dir_id, doc_id). No tagged format column needed.

Install

cd scilons-packaging
pip install -e .

Usage

1. Plan — submit as a SLURM job

The plan stage walks every directory under the source roots. On a multi-million-file tree this would be killed by the login-node reaper, so it must run on a compute node.

The sbatch script activates the venv that holds grobid-pkg-* so the job has the package on PATH (SLURM compute nodes don't inherit a source activate from the login shell). The default is <submit-dir>/.venv — i.e. if you cd into the package root and keep your venv as .venv, no override is needed:

cd /netscratch/lfoppiano/scilons/packaging   # has .venv/ here
sbatch \
    --export=ALL,\
TEI_ROOT=/netscratch/.../grobid_tei,\
JSON_ROOT=/netscratch/.../grobid_json_md,\
MD_ROOT=/netscratch/.../grobid_json_md,\
OUTPUT_DIR=/netscratch/.../parquet,\
TARGET_MB=256 \
    scripts/submit_plan.sbatch

If your venv lives elsewhere, override with VENV_DIR=....

Any of TEI_ROOT, JSON_ROOT, MD_ROOT may be omitted; only the formats with a non-empty root are planned. After ~1–5 minutes (depending on tree size and FS speed), the job writes ${OUTPUT_DIR}/shard_plan.json and a human-readable shard_plan_summary.txt.

Optional environment overrides for submit_plan.sbatch:

env var default meaning
VENV_DIR ${SLURM_SUBMIT_DIR}/.venv Path to the venv with grobid-pkg-plan installed. Defaults to the .venv next to where you ran sbatch. The script sources ${VENV_DIR}/bin/activate.
TARGET_MB 256 Target compressed shard size (MB).
AVG_KB_TEI 45 Avg compressed bytes per TEI file (KB).
AVG_KB_JSON 25 Avg compressed bytes per JSON file (KB).
AVG_KB_MD 15 Avg compressed bytes per Markdown file (KB).

The avg-compressed-kb knobs drive target_files_per_shard = target_bytes / avg_kb. Real compression ratios get logged at build time — adjust these for round 2 if shards are systematically over- or under-target.

2. Build shards on Pegasus

submit_all.sh resolves VENV_DIR to the package's .venv by default and forwards it (plus MAX_CONCURRENT) to each sbatch invocation:

cd /netscratch/lfoppiano/scilons/packaging
./scripts/submit_all.sh \
    /netscratch/.../parquet/shard_plan.json \
    /netscratch/.../grobid_tei \
    /netscratch/.../grobid_json_md \
    /netscratch/.../grobid_json_md \
    /netscratch/.../parquet

This submits one SLURM array job per format. By default each array caps in-flight tasks to 256 (MAX_CONCURRENT=256); override via env var if your partition allows more.

For a smoke test before the full fan-out, build one shard:

cd /netscratch/lfoppiano/scilons/packaging   # so .venv is found by default
sbatch --array=0-0 \
    --export=ALL,FORMAT=tei,\
SOURCE_ROOT=/netscratch/.../grobid_tei,\
PLAN_FILE=/netscratch/.../parquet/shard_plan.json,\
OUTPUT_DIR=/netscratch/.../parquet \
    scripts/submit_pegasus.sbatch

3. Read back

import pyarrow.dataset as ds

tei = ds.dataset("/netscratch/.../parquet/tei", format="parquet")
print(tei.schema)
print(tei.count_rows())

# Filter by directory, project columns, scan lazily.
table = tei.to_table(
    filter=ds.field("dir_id") == "0042",
    columns=["doc_id", "content"],
)

# Join with JSON on (dir_id, doc_id):
import pyarrow.compute as pc
js = ds.dataset("/netscratch/.../parquet/json", format="parquet").to_table()
joined = tei.to_table().join(js, keys=["dir_id", "doc_id"])

Local development

The smoke test exercises every architectural property — slice-of-dir shards, JSON+MD shared root, byte-identical round-trip, slice non-overlap:

python -m venv .venv && .venv/bin/pip install -e .

# Generate fixture: 3 small dirs + one 64-file dir to force multi-slice
.venv/bin/python tests/make_sample.py /tmp/test/in --multi-slice-files 64

.venv/bin/grobid-pkg-plan \
    --tei-root /tmp/test/in/tei \
    --json-root /tmp/test/in/json_md \
    --md-root   /tmp/test/in/json_md \
    --output-dir /tmp/test/out \
    --target-shard-mb 1   # tiny to force slicing

# Build all shards
for fmt in tei json md; do
    n=$(.venv/bin/python -c "import json; \
print(len(json.load(open('/tmp/test/out/shard_plan.json'))['shards']['$fmt']))")
    src=/tmp/test/in/tei
    [[ "$fmt" != "tei" ]] && src=/tmp/test/in/json_md
    for sid in $(seq 0 $((n-1))); do
        .venv/bin/grobid-pkg-build \
            --plan /tmp/test/out/shard_plan.json \
            --format $fmt --shard-id $sid \
            --source-root "$src" \
            --output-dir /tmp/test/out/$fmt
    done
done

Layout

scilons-packaging/
├── pyproject.toml
├── README.md
├── grobid_pkg/
│   ├── __init__.py
│   ├── schema.py         # PyArrow schema + per-format suffix table
│   ├── discover.py       # scandir-based, count-only, threaded
│   ├── planner.py        # slice-of-dir shards, file-count packing
│   ├── shard_builder.py  # re-list dir, take slice, write parquet
│   └── cli.py            # grobid-pkg-plan / grobid-pkg-build
├── scripts/
│   ├── submit_plan.sbatch     # plan stage on a compute node
│   ├── submit_pegasus.sbatch  # one task per shard
│   └── submit_all.sh          # submits one array job per format
└── tests/
    └── make_sample.py

Tuning

  • --target-shard-mb — fewer/larger shards (better compression, fewer SLURM tasks) vs more/smaller shards (more parallelism, more metadata). 256 MB is a good default for HF datasets and Spark.
  • --avg-compressed-kb-{tei,json,md} — sets target_files_per_shard. Defaults are conservative (45/25/15 KB). Inspect actual compression ratios in the build logs and tune for round 2.
  • --workers — discovery thread pool size (default 16). Increase if your storage backend tolerates parallel metadata requests; lower if admins complain.
  • MAX_CONCURRENT env var for submit_all.sh — caps in-flight array tasks per format. Default 256.
  • compression_level in submit_pegasus.sbatch — zstd level (default 9). Drop to 3 for ~2× faster builds at ~10 % larger output.

Design notes

The plan stores file index ranges, not filenames. Build workers re-list the directory at run time, sort filenames ascending, and take their slice. This keeps the plan small (~hundreds of KB even for tens of millions of files), at the cost of one assumption: directory contents must be stable between plan and build. For an offline batch this is fine. If files are added/removed in between, the build worker will warn and truncate the slice.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors