Implement RDFC-1.0#245
Conversation
There was a problem hiding this comment.
I just looked over the parts, where the behaviour of rdflib plays a role, because i dont fully understand what the code itself does. I hope my comments are a little helpful.
I have another more generell comment, as BNode is a subclass to str if you eg. check types with mypy you might get no error, when you by mistake compare a str to a BNode. But this would always fail. this might play a role, when you call hash_related_blank_node. But as i said, i have not a full grasp on the program.
| encoded = self._quote_encode(l_) | ||
|
|
||
| if l_.language: | ||
| if l_.datatype: |
There was a problem hiding this comment.
I would remove this test if l_.datatype: in rdflib, if language is given, the datatype should always be forced, so this test for datatype should always return the same. As far as i remember the correct datatype for lang tagged literals is rdf:langstring, so there might be in future a change to return rdfs:langstring instead of None.
There was a problem hiding this comment.
Agreed! But this is code copied (and cleaned) from rdflib's nquads serializer, so I'll mix it in a PR there.
@WhiteGobo do you mean |
No i did think more of finding type errors with tools like mypy. And i think of the over way aroung so something like This was merely a warning, when using bnode_map: "dict[str, str]" # this would be used, so that mypy can detect wrong inserts.
normalized, bnode_map = self._canonicalize(rdflib_dataset)
# mapping old bnode IDs to their new canonical IDs
for k, v in parser._bnode_ids.items():
bnode_id = str(v)
# Imagine, you were unconcentrated and you used by mistake the original bnode instead of the extracted str.
if v in bnode_map: # This line is errornous and therefore mypy should throw a warning
bnode_map[k] = bnode_map[bnode_id]
del bnode_map[bnode_id]For these kind of scenario, i use mypy to warn me, that i did something wrong, but mypy would register BNode as valid value, because, its just a subclass to |
|
ah ok, type checking, gotya. Adding type hints everywhere would generate too many changes (and be too much work). I think this for some other PR |
|
I know it's not my horse but FYI ... in the course of adapting your implementation as an RDFLib plugin, I encountered use case 9 in the w3c documentation of the spec and I don't seem to be able to replicate the results using the code in this PR. If my understanding of your example is correct and the following snippet is valid, the other unexpected result is that the resulting id_map flips arbitrarily between two canonicalizations ... from pyld.canon import RDFC10
data = '_:e0 <http://purl.org/base#p1> _:e1 .\n' + \
'_:e1 <http://purl.org/base#p2> "Foo" .\n' + \
'_:e2 <http://purl.org/base#p1> _:e3 .\n' + \
'_:e3 <http://purl.org/base#p2> "Foo" .\n'
expected = {'e0': 'c14n0,', 'e1': 'c14n1', 'e2': 'c14n2', 'e3': 'c14n3'}
actual = [
{'e1': 'c14n0', 'e0': 'c14n1', 'e3': 'c14n2', 'e2': 'c14n3'},
{'e3': 'c14n0', 'e2': 'c14n1', 'e1': 'c14n2', 'e0': 'c14n3'},
]
canonicalized = RDFC10().main(data, dict(
inputFormat='application/n-quads', outputMap=True)
)
assert canonicalized in actualI'd be temped to ascribe it to a documentation infelicity were it not for the exhaustive execution log that accompanies the use case description --- and of course, I could just be holding it wrong. Cheers |
Good question. It strange that this is not part of the test-suite. Could you perhaps repeat your question there, because I'm interested in what the expected behavior is here. In order to pass some tests, the blank nodes are mapped according to the order of the objects. This might have something to do with rdflib's parser assigning new IDs to every blanknode, which might change that order. I added your case as a unittest and will investigate a little further. |
anatoly-scherbakov
left a comment
There was a problem hiding this comment.
The PR adds specifications/rdf-canon to .gitmodules and wires it into default spec dirs at tests/runtests.py, but the PR diff does not include a gitlink for the submodule. Thus, specifications/rdf-canon is absent, so default pytest silently skips those tests instead of running the newly added coverage.
|
oh yes, I noticed the CI was not running the tests at all. I added the command, but the submodule is indeed not being fetched. Can you fix this @anatoly-scherbakov ? I don't have much experience with this. |
|
@mielvds committed. In the olden days, I remembered how to do this. One has to do Now, I must confess, I do entrust this kind of tedium to agents. |
|
Well, I don't need an agent to know you must use |
Given that the rdf-canon test suite includes explicit RDFC10MapTest
I became aware of the ID reassignment issue after reading @WhiteGobo's observation in the skolemization discussion but I'm doubtful that it's the source of the issue in the duplicate path RDFC10MapTest failure. After extending |
|
That's exactly what this implementation does as well. You can pass "hashAlgorithm": "SHA384" to |
Meh, operator error - had I forgotten to copy the |
This PR is a rather big one. It fully implements URDNA2015, URDNA2012 and RDFC10 and has complete test-suite coverage on these algorithms (with the exception of the one on poisonous datasets, but more about that later). It 'fixes' #220
Some general remarks before I dive into the details:
canon.py. This made the code much much cleaner, but ironically didn't fix what I introduced it for: nquad serialization. The relevant methods are copied over from rdflib until I can turn them into PRs for rdflib.rdflib.Datasetand back. This should ensure backwards-compatibility on some methods such asjsonld.normalization()andURDNA2015.main().Overview of the changes:
rdflibdependency in all relevant config and code files.rdflibmostly changedutil.pyfrom_legacy_dataset()andto_legacy_dataset()to easily go from anrdflib.Datasetto an RDFJS-likedictwherever needed.tests/test_util.pyfor these functions. We might want to move more general-purpose methods to this file.canon.py:URDNA2015._canonicalize(self, dataset: Dataset)while handling input and output remained inURDNA2015.main(). The latter now does the nquads parsing instead ofJsonLdProcessor.normalize(), so all parsing and serialization is handled by the same class.URDNA2015._canonicalize(self, dataset: Dataset)accepts anrdflib.Datasetand returns a tuple withstranddict.URDNA2015 .main(self, dataset: str | dict | Dataset, options)now accepts ardflib.Datasetobject in addition to an nquadsstror the original RDFJS-likedict. It returnsstr: the serialized nquads result, ordict: the result as RDFJS-like dataset or the blank node identifier map when the new parameteroutputMapisTrue.URDNA2015.hash_algorithmso it easily configurable (required for RDFC1.0)permutations()function now usesitertools.permutationsinstead of a custom implementation._nq_rowand_quoteLiteral, which should eventually move to a fix for rdflib's nquads serializer.tests/runtests.pyIt's possible that we move this code out before ever merging, but at least it can be easily tested. That said, when importing this functionality as an external library, we include rdflib anyway. It therefore makes sense to continue replacing the RDFJS-like data structures elsewhere in the code, essentially making it fully rdflib compliant and dependent.
Since this involves RDFLib: @nicholascar, anyone who can maybe have a look at this? Also @davidlehn @BigBlueHat @anatoly-scherbakov