Fragment Autocomplete is a two-year Pro*Niedersachsen research software project hosted at the Institute for Digital Humanities, University of Göttingen. The project aims to build an open-source and open-access AI-based toolchain that helps scholars generate and evaluate candidate page-level reconstruction hypotheses from surviving medieval manuscript fragments.
The system must preserve the distinction between observed evidence and inference. Reconstruction outputs are candidate scholarly hypotheses with uncertainty, provenance, and comparators; they are not claims that the original manuscript page, text, or decoration has been recovered.
The project follows a layout-first MVP strategy. Early work should estimate page canvas, margins, columns, semantic layout zones, and plausible fragment placement before any experimental full-image generation is considered.
eManuSkript / "Manuskripte digital lesen lernen" is the central technical backbone. Its manuscript layout-analysis model, with roughly 21 manuscript layout labels, should later support segmentation, pseudo-labeling, layout priors, retrieval keys, UI overlays, evaluation signals, and optional conditioning maps.
CoMMA is a text, transcription, and metadata resource. It can support text search, metadata enrichment, witness discovery, and IIIF manifest discovery, but it is not the core visual reconstruction dataset.
This repository currently contains project documentation, PostgreSQL/PostGIS migrations, IIIF ingestion proof-of-concept code, local dataset metadata registration, controlled segmentation runs, segmentation storage, and a minimal local segmentation viewer. It does not implement reconstruction, retrieval, artificial fragment generation, MSI workflows, CoMMA ingestion, production backend/frontend deployment, or a final scholarly interface.
.
|-- AGENTS.md
|-- CHANGELOG.md
|-- NEXT_ACTIONS.md
|-- PROJECT_BOARD.md
|-- README.md
|-- ROADMAP_STATUS.md
|-- data/
|-- docs/
| |-- 01_architecture_overview.md
| |-- architecture_build_notes.md
| `-- figures/architecture/
|-- models/
|-- outputs/
|-- scripts/
|-- src/
`-- tests/
Key directories:
docs/: project planning, architecture, evaluation, data-source, and decision documents.docs/figures/architecture/: Mermaid figure sources and rendered SVG architecture diagrams.src/: future application code boundaries for backend, frontend, ingestion, ML, evaluation, and shared utilities.data/: raw, processed, and metadata asset structure. Large data is excluded from git.models/: model registry metadata. Model binaries are excluded from git.outputs/: generated reports and exports.scripts/: reproducible validation and document build scripts.
Render the architecture figures:
bash scripts/render_architecture_figures.shBuild the architecture PDF, or HTML fallback if PDF dependencies are unavailable:
bash scripts/build_architecture_pdf.shValidate the workspace:
bash scripts/check_workspace.shThe expected output is:
outputs/Fragment_Autocomplete_Architecture_Draft.pdf
If PDF generation is unavailable, the fallback output is:
outputs/Fragment_Autocomplete_Architecture_Draft.html
Install the minimal Python dependencies:
python3 -m pip install -r requirements.txtStart the local PostgreSQL/PostGIS database:
bash scripts/db_start.shApply migrations and seed values:
bash scripts/db_migrate.shIngest a local IIIF fixture:
python3 scripts/ingest_iiif_manifest.py \
--file tests/ingestion/fixtures/iiif_v3_minimal_manifest.json \
--repository "Fixture Repository"Dry-run a manifest without database writes:
python3 scripts/ingest_iiif_manifest.py \
--file tests/ingestion/fixtures/iiif_v3_minimal_manifest.json \
--repository "Fixture Repository" \
--dry-runValidate the IIIF ingestion proof of concept:
bash scripts/validate_iiif_ingestion.shThe ingestion proof of concept registers manifest, repository, manuscript, canvas, and image asset metadata. It does not download full-resolution image sets by default.
Extract the HSP/German controlled vocabularies and metadata mapping from the local source files:
python3 scripts/extract_hsp_metadata_standards.py --verboseApply the database migration, then import the controlled vocabularies:
bash scripts/db_migrate.sh
python3 scripts/import_hsp_normdata.py --verboseValidate the metadata alignment layer:
bash scripts/validate_hsp_metadata_alignment.shThe alignment layer adds controlled vocabularies for SCRP, FORM, CODC, BNDG, and simplified HSP values. It also adds selected HSP-aligned metadata fields while keeping source metadata in raw_metadata for provenance.
The local asset inventory expects these folders when available:
autocomplete-test-dataset/model weights/
Run the inventory script:
python3 scripts/inventory_local_assets.pyThe script writes:
data/metadata/local_assets_manifest.yamldocs/05_local_assets_inventory.mdmodels/model_registry.yaml
Do not commit local dataset images or model weights. These folders and common large model/data extensions are ignored by .gitignore; only inventory metadata should be committed.
Register the local pilot sample dataset:
python3 scripts/register_initial_sample_dataset.py --verboseDry-run registration without database writes:
python3 scripts/register_initial_sample_dataset.py --dry-run --verboseValidate registered records:
bash scripts/validate_initial_sample_dataset.shThis registers local metadata and database records for the pilot full pages and fragments. It does not run eManuSkript, load model weights, run segmentation, or download large image sets.
Inspect the local .pt checkpoints without running inference:
python3 scripts/inspect_model_weights.py --verboseThe script writes:
data/metadata/model_weights_compatibility.yamldocs/06_model_weights_compatibility_report.mdmodels/model_registry.yaml
This inspection does not run inference or segmentation. It performs static archive inspection and safe CPU-side checkpoint checks only. Do not commit the model weights themselves.
Prepare the two controlled inputs for the first future smoke test:
python3 scripts/prepare_segmentation_test_inputs.py --verboseValidate the prepared inputs:
bash scripts/validate_segmentation_test_inputs.shThis step does not run segmentation, inference, or model loading. It only verifies local file paths, database links, and the selected model path.
Run the controlled smoke test on the two prepared inputs:
python3 scripts/run_segmentation_smoke_test.py --device cpu --verboseValidate the emitted results:
bash scripts/validate_segmentation_smoke_test.shOutputs are written under outputs/segmentation_smoke_test/, while the machine-readable summary and report are written to:
data/metadata/segmentation_smoke_test_results.yamldocs/08_segmentation_smoke_test_report.md
Store the existing smoke-test outputs in PostgreSQL/PostGIS:
python3 scripts/store_segmentation_outputs.py --verboseValidate the stored segmentation_run and layout_region records:
bash scripts/validate_segmentation_storage.shThis step stores metadata and region geometry for the two smoke-test samples only. It does not rerun inference, and it does not build the local UI viewer.
Start the local PostgreSQL/PostGIS database:
bash scripts/db_start.shValidate that the stored viewer data and local image paths are available:
python3 scripts/validate_segmentation_viewer_data.pyRun the local read-only viewer:
bash scripts/run_segmentation_viewer.shYou can also run the combined viewer validation:
bash scripts/validate_segmentation_viewer.shThis viewer is local-only and read-only. It displays already stored segmentation outputs from PostgreSQL/PostGIS and local files; it does not run inference, store new results, or build the final project UI.
Prepare the 10 pilot inputs from the registered initial sample dataset:
python3 scripts/prepare_segmentation_pilot_inputs.py --verboseValidate the prepared pilot inputs:
bash scripts/validate_segmentation_pilot_inputs.shRun the full pilot segmentation on CPU:
python3 scripts/run_segmentation_pilot.py --device cpu --conf 0.25 --imgsz 320 --verboseStore the pilot outputs in PostgreSQL/PostGIS:
python3 scripts/store_segmentation_pilot_outputs.py --verboseValidate the pilot run and storage:
bash scripts/validate_segmentation_pilot_run.sh
bash scripts/validate_segmentation_pilot_storage.sh
python3 scripts/validate_segmentation_viewer_data.pyRun the local viewer manually:
bash scripts/run_segmentation_viewer.shPilot outputs are stored in PostgreSQL/PostGIS and can be inspected in the local read-only viewer. Artificial fragment generation, reconstruction, retrieval, MSI workflows, and CoMMA ingestion remain separate later steps.
Build the artificial fragment generator for complete-page samples.