Skip to content

A preprocessing pipeline for Peirce’s manuscripts developed for the Peirce Interprets Peirce project, enabling structured data extraction and diagram discovery using IIIF manifests.

License

Notifications You must be signed in to change notification settings

friendlynihilist/pip-processor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PIP-Manuscripts-Processor

PIP-Manuscripts-Processor is a modular preprocessing and analysis pipeline developed within the Peirce Interprets Peirce project.
It enables structured access to the digitised manuscripts of Charles S. Peirce and supports downstream tasks such as visual classification, diagram recognition, and semantic annotation.

Features

  • Extracts structured metadata from Harvard’s Houghton Library IIIF manifests
  • Downloads and organises manuscript pages by Robin’s classification system
  • Identifies and classifies manuscript pages into text, diagram_mixed, and cover
  • Computes CLIP embeddings for all pages to support downstream ML tasks
  • Generates derivative datasets (e.g. only diagram-rich pages) for layout detection
  • Provides UMAP visualisation for interpretability and quality control
  • Prepares outputs for semantic reinjection into IIIF using oa:Annotation

Installation

git clone https://github.com/friendlynihilist/PIP-Manuscripts-Processor.git
cd PIP-Manuscripts-Processor
pip install -r requirements.txt

Usage

Example: extract CLIP embeddings for the full corpus:

python src/features/generate_clip_embeddings_full.py

Run the classification pipeline on training/test sets:

python src/classification/train_logistic_clip.py

Generate UMAP plots from CLIP vectors and Robin categories:

python src/visualisation/umap_diagram_by_category.py

Folder Structure

  • data/raw/Manuscripts/: original image files, organised by category and item ID
  • data/processed/: metadata files, CSVs, embeddings, classification results
  • data/derived/: generated subsets, e.g. layout-ready diagram pages
  • src/: all scripts grouped by function (features, classification, visualisation, layout)

License

MIT License

About

A preprocessing pipeline for Peirce’s manuscripts developed for the Peirce Interprets Peirce project, enabling structured data extraction and diagram discovery using IIIF manifests.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •