Bundle GROBID outputs (TEI-XML, JSON, Markdown) into compressed Parquet
shards suitable for distribution and downstream reading via HF datasets,
Spark, DuckDB, or pyarrow.dataset.
Designed for very large trees (tens of millions of files): planning does
no per-file stat(), shards are slices of a single directory, and the
plan stage runs as a SLURM job, never on the login node.
Given one or more roots with a <root>/<dir_id>/<file> layout (the
JSON and Markdown roots may point to the same physical tree — the
per-format suffix filter handles the coexistence), this package:
- Discovers files per directory using
os.scandironly — file counts, no per-file stat. - Plans ~256 MB (compressed) shards. Each shard is a
(format, dir_id, slice_start, slice_end)range over the sorted filenames of one directory. - Builds one Parquet file per shard, zstd level 9, fanning out across a SLURM array job.
Content is stored verbatim. TEI extraction lives in
harvesting/scilons_harvesting/.
| field | type | notes |
|---|---|---|
doc_id |
string |
basename without extension |
dir_id |
string |
first-level subdirectory name (e.g. 0042) |
relpath |
string |
path relative to format root |
content |
large_string |
full file contents (UTF-8) |
size_bytes |
uint64 |
original on-disk size |
Three datasets, joinable on (dir_id, doc_id). No tagged format
column needed.
cd scilons-packaging
pip install -e .The plan stage walks every directory under the source roots. On a multi-million-file tree this would be killed by the login-node reaper, so it must run on a compute node.
The sbatch script activates the venv that holds grobid-pkg-* so
the job has the package on PATH (SLURM compute nodes don't inherit
a source activate from the login shell). The default is
<submit-dir>/.venv — i.e. if you cd into the package root and
keep your venv as .venv, no override is needed:
cd /netscratch/lfoppiano/scilons/packaging # has .venv/ here
sbatch \
--export=ALL,\
TEI_ROOT=/netscratch/.../grobid_tei,\
JSON_ROOT=/netscratch/.../grobid_json_md,\
MD_ROOT=/netscratch/.../grobid_json_md,\
OUTPUT_DIR=/netscratch/.../parquet,\
TARGET_MB=256 \
scripts/submit_plan.sbatchIf your venv lives elsewhere, override with VENV_DIR=....
Any of TEI_ROOT, JSON_ROOT, MD_ROOT may be omitted; only the
formats with a non-empty root are planned. After ~1–5 minutes
(depending on tree size and FS speed), the job writes
${OUTPUT_DIR}/shard_plan.json and a human-readable
shard_plan_summary.txt.
Optional environment overrides for submit_plan.sbatch:
| env var | default | meaning |
|---|---|---|
VENV_DIR |
${SLURM_SUBMIT_DIR}/.venv |
Path to the venv with grobid-pkg-plan installed. Defaults to the .venv next to where you ran sbatch. The script sources ${VENV_DIR}/bin/activate. |
TARGET_MB |
256 | Target compressed shard size (MB). |
AVG_KB_TEI |
45 | Avg compressed bytes per TEI file (KB). |
AVG_KB_JSON |
25 | Avg compressed bytes per JSON file (KB). |
AVG_KB_MD |
15 | Avg compressed bytes per Markdown file (KB). |
The avg-compressed-kb knobs drive target_files_per_shard = target_bytes / avg_kb. Real compression ratios get logged at build
time — adjust these for round 2 if shards are systematically over- or
under-target.
submit_all.sh resolves VENV_DIR to the package's .venv by
default and forwards it (plus MAX_CONCURRENT) to each sbatch
invocation:
cd /netscratch/lfoppiano/scilons/packaging
./scripts/submit_all.sh \
/netscratch/.../parquet/shard_plan.json \
/netscratch/.../grobid_tei \
/netscratch/.../grobid_json_md \
/netscratch/.../grobid_json_md \
/netscratch/.../parquetThis submits one SLURM array job per format. By default each array
caps in-flight tasks to 256 (MAX_CONCURRENT=256); override via env
var if your partition allows more.
For a smoke test before the full fan-out, build one shard:
cd /netscratch/lfoppiano/scilons/packaging # so .venv is found by default
sbatch --array=0-0 \
--export=ALL,FORMAT=tei,\
SOURCE_ROOT=/netscratch/.../grobid_tei,\
PLAN_FILE=/netscratch/.../parquet/shard_plan.json,\
OUTPUT_DIR=/netscratch/.../parquet \
scripts/submit_pegasus.sbatchimport pyarrow.dataset as ds
tei = ds.dataset("/netscratch/.../parquet/tei", format="parquet")
print(tei.schema)
print(tei.count_rows())
# Filter by directory, project columns, scan lazily.
table = tei.to_table(
filter=ds.field("dir_id") == "0042",
columns=["doc_id", "content"],
)
# Join with JSON on (dir_id, doc_id):
import pyarrow.compute as pc
js = ds.dataset("/netscratch/.../parquet/json", format="parquet").to_table()
joined = tei.to_table().join(js, keys=["dir_id", "doc_id"])The smoke test exercises every architectural property — slice-of-dir shards, JSON+MD shared root, byte-identical round-trip, slice non-overlap:
python -m venv .venv && .venv/bin/pip install -e .
# Generate fixture: 3 small dirs + one 64-file dir to force multi-slice
.venv/bin/python tests/make_sample.py /tmp/test/in --multi-slice-files 64
.venv/bin/grobid-pkg-plan \
--tei-root /tmp/test/in/tei \
--json-root /tmp/test/in/json_md \
--md-root /tmp/test/in/json_md \
--output-dir /tmp/test/out \
--target-shard-mb 1 # tiny to force slicing
# Build all shards
for fmt in tei json md; do
n=$(.venv/bin/python -c "import json; \
print(len(json.load(open('/tmp/test/out/shard_plan.json'))['shards']['$fmt']))")
src=/tmp/test/in/tei
[[ "$fmt" != "tei" ]] && src=/tmp/test/in/json_md
for sid in $(seq 0 $((n-1))); do
.venv/bin/grobid-pkg-build \
--plan /tmp/test/out/shard_plan.json \
--format $fmt --shard-id $sid \
--source-root "$src" \
--output-dir /tmp/test/out/$fmt
done
donescilons-packaging/
├── pyproject.toml
├── README.md
├── grobid_pkg/
│ ├── __init__.py
│ ├── schema.py # PyArrow schema + per-format suffix table
│ ├── discover.py # scandir-based, count-only, threaded
│ ├── planner.py # slice-of-dir shards, file-count packing
│ ├── shard_builder.py # re-list dir, take slice, write parquet
│ └── cli.py # grobid-pkg-plan / grobid-pkg-build
├── scripts/
│ ├── submit_plan.sbatch # plan stage on a compute node
│ ├── submit_pegasus.sbatch # one task per shard
│ └── submit_all.sh # submits one array job per format
└── tests/
└── make_sample.py
--target-shard-mb— fewer/larger shards (better compression, fewer SLURM tasks) vs more/smaller shards (more parallelism, more metadata). 256 MB is a good default for HF datasets and Spark.--avg-compressed-kb-{tei,json,md}— setstarget_files_per_shard. Defaults are conservative (45/25/15 KB). Inspect actual compression ratios in the build logs and tune for round 2.--workers— discovery thread pool size (default 16). Increase if your storage backend tolerates parallel metadata requests; lower if admins complain.MAX_CONCURRENTenv var forsubmit_all.sh— caps in-flight array tasks per format. Default 256.compression_levelinsubmit_pegasus.sbatch— zstd level (default 9). Drop to 3 for ~2× faster builds at ~10 % larger output.
The plan stores file index ranges, not filenames. Build workers re-list the directory at run time, sort filenames ascending, and take their slice. This keeps the plan small (~hundreds of KB even for tens of millions of files), at the cost of one assumption: directory contents must be stable between plan and build. For an offline batch this is fine. If files are added/removed in between, the build worker will warn and truncate the slice.