Skip to content

Add HDT read/write serialization support#55

Merged
magbak merged 3 commits into
DataTreehouse:mainfrom
sihingkk:hdt
Jun 14, 2026
Merged

Add HDT read/write serialization support#55
magbak merged 3 commits into
DataTreehouse:mainfrom
sihingkk:hdt

Conversation

@sihingkk

@sihingkk sihingkk commented Jun 12, 2026

Copy link
Copy Markdown

This adds HDT (Header Dictionary Triples) as a supported format for Model.read() and Model.write(), using the hdt crate.

What's included

  • Reading: m.read("file.hdt") (format inferred from the .hdt extension, or pass format="hdt"). Implemented by iterating the HDT dictionary and mapping into oxrdf terms; literals are parsed via oxrdf's N-Triples-style lexical forms, matching the hdt crate's conventions.
  • Writing: m.write("file.hdt", format="hdt"). The HDT four-section dictionary and triple bitmaps are built directly in memory and written straight to the output — no temporary file or intermediate N-Triples serialization. Literals with special characters are stored N-Triples-escaped, following the hdt crate's conventions.
  • reads()/writes() reject "hdt" with a clear error since it is a binary format.
  • Python tests in py_maplib/tests/test_hdt.py covering round-trips of IRIs, blank nodes, plain/typed/language-tagged literals, special characters, format inference, multi-graph behavior, and the error paths.

Notes and limitations

  • The hdt crate loads the entire HDT file into memory on read; there is no streaming.
  • HDT is read-only by design, so write always produces a fresh file.
  • Cargo.lock grows by the hdt crate's dependency tree.

All py_maplib tests pass (384 passed, 2 skipped).

@thenonameguy

Copy link
Copy Markdown
Contributor

Writing: m.write("file.hdt", format="hdt"). The hdt crate builds its dictionary from N-Triples input, so writing goes through a temporary N-Triples serialization; literals with special characters are stored N-Triples-escaped, again following the hdt crate.

Please do this without a temporary file/in-memory, so we don't need a 2x memory/disk space for |triples|.

curiouspresence and others added 2 commits June 12, 2026 15:14
Read uses the hdt crate's dictionary iteration mapped into oxrdf quads,
with literals parsed via oxrdf Literal::from_str.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@thenonameguy

Copy link
Copy Markdown
Contributor

fixes #52

@magbak magbak left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy with this, especially the tests! Great work :-)

@magbak magbak merged commit 2b58e44 into DataTreehouse:main Jun 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants