Implement RDFC-1.0 by mielvds · Pull Request #245 · digitalbazaar/pyld

mielvds · 2026-02-25T09:06:22Z

This PR is a rather big one. It fully implements URDNA2015, URDNA2012 and RDFC10 and has complete test-suite coverage on these algorithms (with the exception of the one on poisonous datasets, but more about that later). It 'fixes' #220

Some general remarks before I dive into the details:

I went back and forth on how approach this: move the canon stuff to a new repo first, or fix the problems first. I sided with the latter as the test harness is already in place and the code is too tangled to not break everything.
This introduces rdflib into PyLD, but limited to the normalization/canonicalization canon.py. This made the code much much cleaner, but ironically didn't fix what I introduced it for: nquad serialization. The relevant methods are copied over from rdflib until I can turn them into PRs for rdflib.
I added some functions to transform the legacy RDF.JS dataset structure in an rdflib.Dataset and back. This should ensure backwards-compatibility on some methods such as jsonld.normalization() and URDNA2015.main().
While switching to rdflib introduces significant changes, I tried to change as little as possible otherwise. Optimizing the algorithm implementations can be done later.

Overview of the changes:

added rdflib dependency in all relevant config and code files.
the change to rdflib mostly changed
- RDF term type checking (e.g., checking is something is a bnode)
- looping over triples/quads
- serialization
- constructing RDF terms (custom deepcopy is no longer needed)
util.py
- added the functions from_legacy_dataset() and to_legacy_dataset() to easily go from an rdflib.Dataset to an RDFJS-like dict wherever needed.
- Also added unittests in tests/test_util.py for these functions. We might want to move more general-purpose methods to this file.
canon.py:
- the main logic was moved to URDNA2015._canonicalize(self, dataset: Dataset) while handling input and output remained in URDNA2015.main(). The latter now does the nquads parsing instead of JsonLdProcessor.normalize(), so all parsing and serialization is handled by the same class.
- the method URDNA2015._canonicalize(self, dataset: Dataset) accepts an rdflib.Dataset and returns a tuple with
  - the canonicalized result as a nquads str and
  - the blank node identifier map as dict.
- The method URDNA2015 .main(self, dataset: str | dict | Dataset, options) now accepts a rdflib.Dataset object in addition to an nquads str or the original RDFJS-likedict. It returns
  - a str: the serialized nquads result, or
  - a dict: the result as RDFJS-like dataset or the blank node identifier map when the new parameter outputMap is True.
- the hashing algorithm is now an class attribute URDNA2015.hash_algorithm so it easily configurable (required for RDFC1.0)
- the permutations() function now uses itertools.permutations instead of a custom implementation.
- replacements for rdflib's _nq_row and _quoteLiteral, which should eventually move to a fix for rdflib's nquads serializer.
tests/runtests.py
- (re-)enabled all skipped URDNA2015, URDNA2012
- Added the RDFC10 tests
- added support for testing blank-node identifier maps. If the result of a test is a dict and the expected value is a string, the expected value is now parsed as JSON.
- added support for testing with different hashing algorithms
- !! I did not include test 74c on dataset poisoning because I'm not sure how to handle that yet. We need to discuss possible guardrails first before continuing on this.

It's possible that we move this code out before ever merging, but at least it can be easily tested. That said, when importing this functionality as an external library, we include rdflib anyway. It therefore makes sense to continue replacing the RDFJS-like data structures elsewhere in the code, essentially making it fully rdflib compliant and dependent.

Since this involves RDFLib: @nicholascar, anyone who can maybe have a look at this? Also @davidlehn @BigBlueHat @anatoly-scherbakov

github-actions · 2026-02-25T09:06:59Z

Coverage Report

File	Stmts	Miss	Cover	Missing
lib/pyld
canon.py	279	15	95%	35, 46, 49–54, 78–79, 508, 530, 594, 604, 658–659
context_resolver.py	102	8	92%	70, 129, 143, 189, 195–197, 221
iri_resolver.py	141	3	98%	307–308, 318
jsonld.py	2441	175	93%	274–278, 359, 396, 401, 411, 428–443, 479–480, 522, 530, 550, 558–559, 667–672, 752–753, 768–769, 829–834, 859–860, 875–876, 886–887, 912–913, 959, 968–974, 991, 1023, 1032, 1039, 1104, 1127, 1149–1151, 1161–1164, 1179, 1202, 1214–1217, 1309, 1321–1324, 1344, 1373, 1406, 1410, 1429–1430, 1447, 1490, 1508–1511, 1520, 1565, 1668–1675, 1702–1704, 1709, 1730–1733, 1909, 1947, 2380, 2389–2396, 2488, 2515, 2538, 2547, 3063, 3116, 3119, 3182, 3267, 3301, 3309, 3587, 3592, 3594, 3643, 3781–3785, 3851, 3888, 3951, 4171–4172, 4186, 4474, 4608, 4642, 4671–4674, 4700–4701, 4884, 4948, 4975, 5049, 5051, 5069, 5167, 5216, 5448, 5450, 5489, 5491, 5500, 5639, 5767, 5854, 5974–5988, 6047–6053, 6120, 6333, 6339–6343, 6382, 6385, 6475–6476, 6485
nquads.py	119	4	97%	79, 119, 243–244
util.py	63	2	97%	41, 114
lib/pyld/documentloader
aiohttp.py	65	22	66%	28–38, 69, 75, 87, 99–107, 118–121, 138, 153–155
requests.py	41	5	88%	55, 69, 81–89
TOTAL	3318	234	93%

Tests	Skipped	Failures	Errors	Time
1776	50 💤	0 ❌	0 🔥	12.463s ⏱️

WhiteGobo

I just looked over the parts, where the behaviour of rdflib plays a role, because i dont fully understand what the code itself does. I hope my comments are a little helpful.
I have another more generell comment, as BNode is a subclass to str if you eg. check types with mypy you might get no error, when you by mistake compare a str to a BNode. But this would always fail. this might play a role, when you call hash_related_blank_node. But as i said, i have not a full grasp on the program.

WhiteGobo · 2026-04-25T08:26:20Z

+        encoded = self._quote_encode(l_)
+
+        if l_.language:
+            if l_.datatype:


I would remove this test if l_.datatype: in rdflib, if language is given, the datatype should always be forced, so this test for datatype should always return the same. As far as i remember the correct datatype for lang tagged literals is rdf:langstring, so there might be in future a change to return rdfs:langstring instead of None.

Agreed! But this is code copied (and cleaned) from rdflib's nquads serializer, so I'll mix it in a PR there.

mielvds · 2026-05-05T10:14:51Z

as BNode is a subclass to str if you eg. check types with mypy you might get no error, when you by mistake compare a str to a BNode. But this would always fail. this might play a role, when you call hash_related_blank_node. But as i said, i have not a full grasp on the program.

@WhiteGobo do you mean isinstance(component, BNode)? what would be a better way to check if something is a blank node?

WhiteGobo · 2026-05-05T10:47:10Z

do you mean isinstance(component, BNode)?

No i did think more of finding type errors with tools like mypy. And i think of the over way aroung so something like isinstance(accidently_some_BNode, str).

This was merely a warning, when using str instead of directly using a BNode, that mypy wouldnt find inconsistent use between those two.
Just to illustrate take for example canon.py line 55ff. I will change some lines, which would produce an error:

        bnode_map: "dict[str, str]" # this would be used, so that mypy can detect wrong inserts.
        normalized, bnode_map = self._canonicalize(rdflib_dataset)
        # mapping old bnode IDs to their new canonical IDs
        for k, v in parser._bnode_ids.items():
            bnode_id = str(v)
            # Imagine, you were unconcentrated and you used by mistake the original bnode instead of the extracted str.
            if v in bnode_map: # This line is errornous and therefore mypy should throw a warning
                bnode_map[k] = bnode_map[bnode_id]
                del bnode_map[bnode_id]

For these kind of scenario, i use mypy to warn me, that i did something wrong, but mypy would register BNode as valid value, because, its just a subclass to str. And i dont mean, that extracting the id from a BNode is wrong, i just wanted to point it out, if you use mypy to find such errors. As you extract bnode_map from somewhere else, i dont even think, you should prefer BNode over str.

mielvds · 2026-05-05T14:15:59Z

ah ok, type checking, gotya. Adding type hints everywhere would generate too many changes (and be too much work). I think this for some other PR

gjhiggins · 2026-05-09T23:19:31Z

I know it's not my horse but FYI ... in the course of adapting your implementation as an RDFLib plugin, I encountered use case 9 in the w3c documentation of the spec and I don't seem to be able to replicate the results using the code in this PR. If my understanding of your example is correct and the following snippet is valid, the other unexpected result is that the resulting id_map flips arbitrarily between two canonicalizations ...

from pyld.canon import RDFC10
data = '_:e0 <http://purl.org/base#p1> _:e1 .\n' + \
'_:e1 <http://purl.org/base#p2> "Foo" .\n' + \
'_:e2 <http://purl.org/base#p1> _:e3 .\n' + \
'_:e3 <http://purl.org/base#p2> "Foo" .\n'

expected = {'e0':  'c14n0,', 'e1':  'c14n1', 'e2':  'c14n2', 'e3':  'c14n3'}

actual = [
    {'e1': 'c14n0', 'e0': 'c14n1', 'e3': 'c14n2', 'e2': 'c14n3'},
    {'e3': 'c14n0', 'e2': 'c14n1', 'e1': 'c14n2', 'e0': 'c14n3'},
]

canonicalized = RDFC10().main(data, dict(
    inputFormat='application/n-quads', outputMap=True)
)

assert canonicalized in actual

I'd be temped to ascribe it to a documentation infelicity were it not for the exhaustive execution log that accompanies the use case description --- and of course, I could just be holding it wrong.

Cheers
Graham

mielvds · 2026-05-11T08:24:09Z

resulting id_map flips arbitrarily between two canonicalizations

Good question. It strange that this is not part of the test-suite. Could you perhaps repeat your question there, because I'm interested in what the expected behavior is here.

In order to pass some tests, the blank nodes are mapped according to the order of the objects. This might have something to do with rdflib's parser assigning new IDs to every blanknode, which might change that order. I added your case as a unittest and will investigate a little further.

anatoly-scherbakov

The PR adds specifications/rdf-canon to .gitmodules and wires it into default spec dirs at tests/runtests.py, but the PR diff does not include a gitlink for the submodule. Thus, specifications/rdf-canon is absent, so default pytest silently skips those tests instead of running the newly added coverage.

mielvds · 2026-05-11T13:03:37Z

oh yes, I noticed the CI was not running the tests at all. I added the command, but the submodule is indeed not being fetched. Can you fix this @anatoly-scherbakov ? I don't have much experience with this.

anatoly-scherbakov · 2026-05-11T13:22:37Z

@mielvds committed. In the olden days, I remembered how to do this. One has to do git submodule add command or something like that. Editing .gitmodules alone won't suffice, and I did stumble upon this more than once.

Now, I must confess, I do entrust this kind of tedium to agents.

mielvds · 2026-05-11T13:45:42Z

Well, I don't need an agent to know you must use git submodule add, so I must have done something weird. Anyway, thanks for the fix!

gjhiggins · 2026-05-11T14:45:27Z

Good question. It strange that this is not part of the test-suite. Could you perhaps repeat your question there, because I'm interested in what the expected behavior is here.

Given that the rdf-canon test suite includes explicit RDFC10MapTest id_map tests, I'd assume the described behaviour is precisely what is expected. (Up until today, I hadn't paid any attention to the map tests and was unaware that my adaptation of the RDFLib W3C test suite harness was omitting to run those tests. I adjusted my adaptation to recognise and actually run the RDFC10MapTest tests and I'm seeing one failure: the SHA384 test: eval passes, the map doesn't).

In order to pass some tests, the blank nodes are mapped according to the order of the objects. This might have something to do with rdflib's parser assigning new IDs to every blanknode, which might change that order. I added your case as a unittest and will investigate a little further.

I became aware of the ID reassignment issue after reading @WhiteGobo's observation in the skolemization discussion but I'm doubtful that it's the source of the issue in the duplicate path RDFC10MapTest failure. After extending **kwargs processing to ingest the bnode_context dict and making a couple of adjustments to my code, I'm now able to recover the pre-vs-post id_map and check it directly - and it looks just fine and it's the approach that I use when running the RDFC10MapTests, so from that perspective, the ID reassignment doesn't appear relevant (/me ensures hat is edible).

mielvds · 2026-05-11T16:59:58Z

That's exactly what this implementation does as well. You can pass "hashAlgorithm": "SHA384" to main()

gjhiggins · 2026-05-11T17:56:16Z

I'm seeing one failure: the SHA384 test: eval passes, the map doesn't).

Meh, operator error - had I forgotten to copy the hash_algorithm kwarg-and-binding into the serialize call params in my newly-authored maptest runner. After remediating this blunder, all the tests in the W3C rdf-canon suite now pass.

mielvds force-pushed the 220-RDFC10 branch from 5737882 to e663b59 Compare April 14, 2026 07:19

mielvds marked this pull request as ready for review April 23, 2026 08:13

mielvds mentioned this pull request Apr 23, 2026

Use pyld as the json-ld parser for rdflib RDFLib/rdflib#2308

Open

3 tasks

mielvds requested review from BigBlueHat, anatoly-scherbakov and davidlehn April 23, 2026 12:18

WhiteGobo reviewed Apr 25, 2026

View reviewed changes

mielvds force-pushed the 220-RDFC10 branch from a437bc2 to 4d3e96e Compare April 27, 2026 07:30

mielvds force-pushed the 220-RDFC10 branch from 5c9055f to bfc62b1 Compare May 4, 2026 09:55

mielvds force-pushed the 220-RDFC10 branch from d610d9b to bfc62b1 Compare May 5, 2026 14:18

mielvds added 14 commits May 11, 2026 12:08

Add stub for RDFC-1.0 and prepare tests.

edf1a30

Add rdf-canon to submodules

f2ab46d

Add rdflib to dependencies

ef04cbe

Re-enable canon tests

306b2e6

Introduce rdflib into canon.py and replace internal model.

4a4e702

Don't normalize literals

aed8cca

Fixes positions and does small touchups

62de196

Replace permutations with stdlib

a9fa560

Fix identifier ordering

116592e

Switch from NT to NQ serialization

81459b8

Remove legacy nquads parsing dependency by moving parsing to canon.py

7dd4476

Move triple data structure convert functions to util.py

49e48fb

Cleanup of old canon code

4fe584f

Add tests for conversion methods in util.py and do fixes

af2717c

mielvds added 10 commits May 11, 2026 12:08

Make NQ serialization part of class & add override for RDFC1.0

7cfb48d

Fix RDFC1.0 literal encoding

7fad267

Add configurable hashAlgorithm

3b46ac5

Add option to return the bnode map.

da2500b

Rename _main to _canonicalize, add docstring and move function

1cffc03

Minimize change

e434337

Make bnode map merge from parser more simple and robust

dbf2fe6

Make linter happy

edc1a28

Separate triple and dataset conversion from legacy

6139d66

Fix passing graph as identifier

bbbb2db

mielvds force-pushed the 220-RDFC10 branch from 638d888 to bbae7ad Compare May 11, 2026 10:15

Add rdflib and RDFC10 changes to changelog

a1d1f71

mielvds force-pushed the 220-RDFC10 branch from bbae7ad to a1d1f71 Compare May 11, 2026 10:17

anatoly-scherbakov approved these changes May 11, 2026

View reviewed changes

mielvds added 2 commits May 11, 2026 14:53

Call rdf-canon tests in github actions

fafa982

Re-enable hashAlgorithm parameter in tests

1aad656

[#220]: Add RDF canonicalization test submodule

f8b31bd

Conversation

mielvds commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WhiteGobo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WhiteGobo Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mielvds Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

mielvds commented May 5, 2026

Uh oh!

WhiteGobo commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mielvds commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gjhiggins commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mielvds commented May 11, 2026

Uh oh!

anatoly-scherbakov left a comment

Choose a reason for hiding this comment

Uh oh!

mielvds commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anatoly-scherbakov commented May 11, 2026

Uh oh!

mielvds commented May 11, 2026

Uh oh!

gjhiggins commented May 11, 2026

Uh oh!

mielvds commented May 11, 2026

Uh oh!

gjhiggins commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mielvds commented Feb 25, 2026 •

edited

Loading

github-actions Bot commented Feb 25, 2026 •

edited

Loading

WhiteGobo left a comment •

edited

Loading

WhiteGobo Apr 25, 2026 •

edited

Loading

WhiteGobo commented May 5, 2026 •

edited

Loading

mielvds commented May 5, 2026 •

edited

Loading

gjhiggins commented May 9, 2026 •

edited

Loading

mielvds commented May 11, 2026 •

edited

Loading