scilons-packaging

Bundle GROBID outputs (TEI-XML, JSON, Markdown) into compressed Parquet shards suitable for distribution and downstream reading via HF datasets, Spark, DuckDB, or pyarrow.dataset.

Designed for very large trees (tens of millions of files): planning does no per-file stat(), shards are slices of a single directory, and the plan stage runs as a SLURM job, never on the login node.

What it does

Given one or more roots with a <root>/<dir_id>/<file> layout (the JSON and Markdown roots may point to the same physical tree — the per-format suffix filter handles the coexistence), this package:

Discovers files per directory using os.scandir only — file counts, no per-file stat.
Plans ~256 MB (compressed) shards. Each shard is a (format, dir_id, slice_start, slice_end) range over the sorted filenames of one directory.
Builds one Parquet file per shard, zstd level 9, fanning out across a SLURM array job.

Content is stored verbatim. TEI extraction lives in harvesting/scilons_harvesting/.

Schema (identical for all three datasets)

field	type	notes
`doc_id`	`string`	basename without extension
`dir_id`	`string`	first-level subdirectory name (e.g. `0042`)
`relpath`	`string`	path relative to format root
`content`	`large_string`	full file contents (UTF-8)
`size_bytes`	`uint64`	original on-disk size

Three datasets, joinable on (dir_id, doc_id). No tagged format column needed.

Install

cd scilons-packaging
pip install -e .

Usage

1. Plan — submit as a SLURM job

The plan stage walks every directory under the source roots. On a multi-million-file tree this would be killed by the login-node reaper, so it must run on a compute node.

The sbatch script activates the venv that holds grobid-pkg-* so the job has the package on PATH (SLURM compute nodes don't inherit a source activate from the login shell). The default is <submit-dir>/.venv — i.e. if you cd into the package root and keep your venv as .venv, no override is needed:

cd /netscratch/lfoppiano/scilons/packaging   # has .venv/ here
sbatch \
    --export=ALL,\
TEI_ROOT=/netscratch/.../grobid_tei,\
JSON_ROOT=/netscratch/.../grobid_json_md,\
MD_ROOT=/netscratch/.../grobid_json_md,\
OUTPUT_DIR=/netscratch/.../parquet,\
TARGET_MB=256 \
    scripts/submit_plan.sbatch

If your venv lives elsewhere, override with VENV_DIR=....

Any of TEI_ROOT, JSON_ROOT, MD_ROOT may be omitted; only the formats with a non-empty root are planned. After ~1–5 minutes (depending on tree size and FS speed), the job writes ${OUTPUT_DIR}/shard_plan.json and a human-readable shard_plan_summary.txt.

Optional environment overrides for submit_plan.sbatch:

env var	default	meaning
`VENV_DIR`	`${SLURM_SUBMIT_DIR}/.venv`	Path to the venv with `grobid-pkg-plan` installed. Defaults to the `.venv` next to where you ran `sbatch`. The script `source`s `${VENV_DIR}/bin/activate`.
`TARGET_MB`	256	Target compressed shard size (MB).
`AVG_KB_TEI`	45	Avg compressed bytes per TEI file (KB).
`AVG_KB_JSON`	25	Avg compressed bytes per JSON file (KB).
`AVG_KB_MD`	15	Avg compressed bytes per Markdown file (KB).

The avg-compressed-kb knobs drive target_files_per_shard = target_bytes / avg_kb. Real compression ratios get logged at build time — adjust these for round 2 if shards are systematically over- or under-target.

2. Build shards on Pegasus

submit_all.sh resolves VENV_DIR to the package's .venv by default and forwards it (plus MAX_CONCURRENT) to each sbatch invocation:

cd /netscratch/lfoppiano/scilons/packaging
./scripts/submit_all.sh \
    /netscratch/.../parquet/shard_plan.json \
    /netscratch/.../grobid_tei \
    /netscratch/.../grobid_json_md \
    /netscratch/.../grobid_json_md \
    /netscratch/.../parquet

This submits one SLURM array job per format. By default each array caps in-flight tasks to 256 (MAX_CONCURRENT=256); override via env var if your partition allows more.

For a smoke test before the full fan-out, build one shard:

cd /netscratch/lfoppiano/scilons/packaging   # so .venv is found by default
sbatch --array=0-0 \
    --export=ALL,FORMAT=tei,\
SOURCE_ROOT=/netscratch/.../grobid_tei,\
PLAN_FILE=/netscratch/.../parquet/shard_plan.json,\
OUTPUT_DIR=/netscratch/.../parquet \
    scripts/submit_pegasus.sbatch

3. Read back

import pyarrow.dataset as ds

tei = ds.dataset("/netscratch/.../parquet/tei", format="parquet")
print(tei.schema)
print(tei.count_rows())

# Filter by directory, project columns, scan lazily.
table = tei.to_table(
    filter=ds.field("dir_id") == "0042",
    columns=["doc_id", "content"],
)

# Join with JSON on (dir_id, doc_id):
import pyarrow.compute as pc
js = ds.dataset("/netscratch/.../parquet/json", format="parquet").to_table()
joined = tei.to_table().join(js, keys=["dir_id", "doc_id"])

Local development

The smoke test exercises every architectural property — slice-of-dir shards, JSON+MD shared root, byte-identical round-trip, slice non-overlap:

python -m venv .venv && .venv/bin/pip install -e .

# Generate fixture: 3 small dirs + one 64-file dir to force multi-slice
.venv/bin/python tests/make_sample.py /tmp/test/in --multi-slice-files 64

.venv/bin/grobid-pkg-plan \
    --tei-root /tmp/test/in/tei \
    --json-root /tmp/test/in/json_md \
    --md-root   /tmp/test/in/json_md \
    --output-dir /tmp/test/out \
    --target-shard-mb 1   # tiny to force slicing

# Build all shards
for fmt in tei json md; do
    n=$(.venv/bin/python -c "import json; \
print(len(json.load(open('/tmp/test/out/shard_plan.json'))['shards']['$fmt']))")
    src=/tmp/test/in/tei
    [[ "$fmt" != "tei" ]] && src=/tmp/test/in/json_md
    for sid in $(seq 0 $((n-1))); do
        .venv/bin/grobid-pkg-build \
            --plan /tmp/test/out/shard_plan.json \
            --format $fmt --shard-id $sid \
            --source-root "$src" \
            --output-dir /tmp/test/out/$fmt
    done
done

Layout

scilons-packaging/
├── pyproject.toml
├── README.md
├── grobid_pkg/
│   ├── __init__.py
│   ├── schema.py         # PyArrow schema + per-format suffix table
│   ├── discover.py       # scandir-based, count-only, threaded
│   ├── planner.py        # slice-of-dir shards, file-count packing
│   ├── shard_builder.py  # re-list dir, take slice, write parquet
│   └── cli.py            # grobid-pkg-plan / grobid-pkg-build
├── scripts/
│   ├── submit_plan.sbatch     # plan stage on a compute node
│   ├── submit_pegasus.sbatch  # one task per shard
│   └── submit_all.sh          # submits one array job per format
└── tests/
    └── make_sample.py

Tuning

--target-shard-mb — fewer/larger shards (better compression, fewer SLURM tasks) vs more/smaller shards (more parallelism, more metadata). 256 MB is a good default for HF datasets and Spark.
--avg-compressed-kb-{tei,json,md} — sets target_files_per_shard. Defaults are conservative (45/25/15 KB). Inspect actual compression ratios in the build logs and tune for round 2.
--workers — discovery thread pool size (default 16). Increase if your storage backend tolerates parallel metadata requests; lower if admins complain.
MAX_CONCURRENT env var for submit_all.sh — caps in-flight array tasks per format. Default 256.
compression_level in submit_pegasus.sbatch — zstd level (default 9). Drop to 3 for ~2× faster builds at ~10 % larger output.

Design notes

The plan stores file index ranges, not filenames. Build workers re-list the directory at run time, sort filenames ascending, and take their slice. This keeps the plan small (~hundreds of KB even for tens of millions of files), at the cost of one assumption: directory contents must be stable between plan and build. For an offline batch this is fine. If files are added/removed in between, the build worker will warn and truncate the slice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scilons-packaging

What it does

Schema (identical for all three datasets)

Install

Usage

1. Plan — submit as a SLURM job

2. Build shards on Pegasus

3. Read back

Local development

Layout

Tuning

Design notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
grobid_pkg		grobid_pkg
scripts		scripts
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

scilons-packaging

What it does

Schema (identical for all three datasets)

Install

Usage

1. Plan — submit as a SLURM job

2. Build shards on Pegasus

3. Read back

Local development

Layout

Tuning

Design notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages