Skip to content

Add entry point for extracting datasets from TEI#2

Merged
lfoppiano merged 4 commits intodevfrom
add-tei-processing-dataset
Apr 13, 2026
Merged

Add entry point for extracting datasets from TEI#2
lfoppiano merged 4 commits intodevfrom
add-tei-processing-dataset

Conversation

@lfoppiano
Copy link
Copy Markdown
Collaborator

@lfoppiano lfoppiano commented Apr 12, 2026

Summary

  • Adds TEI (Text Encoding Initiative) XML processing for dataset extraction
  • Fixes URL extraction for supplementary materials generated by pub2tei
  • Improves reference parsing error handling
  • Includes merge of model location updates from feature/update-models-locations

lfoppiano and others added 4 commits January 2, 2025 19:50
…al generated by pub2tei)

(cherry picked from commit 39c0e43)
- Fix .toList() Java 16 incompatibility (use Collectors.toList() for Java 11)
- Fix wrong customisation file name (software -> dataset) in DatasetDisambiguator
- Add title to selectedSequences in processTEIDocument (was silently discarded)
- Fix NPE in XMLUtilities.segment() when sentence detection fails
- Fix biblioRefMap key mismatch (use consistent refKey integer keys)
- Add bounds check for classifier results to prevent IndexOutOfBoundsException
- Fix Content-Type mismatch: use APPLICATION_JSON instead of TEXT_PLAIN for JSON endpoints
- Fix off-by-one in getLastDirectChild (loop now checks index 0)
- Fix DatastetAnalyzer.getInstance() race condition (restore synchronized block)
- Fix getTextNoRefMarkersAndMarkerPositions duplicating content for multi-child refs
- Fix DatasetParser.getInstance() broken double-checked locking
- Add null checks for originFile in finally blocks

https://claude.ai/code/session_018EBZhK2RtGtsvN4E1rp2tF
@lfoppiano lfoppiano closed this Apr 13, 2026
@lfoppiano lfoppiano reopened this Apr 13, 2026
@lfoppiano lfoppiano merged commit 68632f6 into dev Apr 13, 2026
1 of 2 checks passed
@lfoppiano lfoppiano deleted the add-tei-processing-dataset branch April 13, 2026 07:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants