While working on format ID workflows, we came up against the problem of multi-part objects, that identifying them is difficult, and that atomising them on ingest is lossy/dangerous.
Noting [Identification of Multi-Part Digital Objects (PHAIDRA - o:1424890)](https://phaidra.univie.ac.at/detail/o:1424890)
Shapefiles
Ref [file extension](https://desktop.arcgis.com/en/arcmap/latest/manage-data/shapefiles/shapefile-file-extensions.htm)
See http://fileformats.archiveteam.org/wiki/Shapefile
Experimental syntax
Local file convention: gpo.us.gov/fmt/1 gpo:fmt/1
- {id}/ (R)
- {id}/{id}-mets.xml (R)
- {id}/{id}-marc.xml (R)
- {id}/([0-9]+).(iso|img) (R+)
Using {id} as a short-hand for a named group (?P<id>.*)
i.e. the tree means - if there's this, then there should be this. Tree does not mean 'descend the hierarchy', but that could be added as an option, perhaps.
(R) means required, which should be the default.
Have to explicitly add (O) for optional
(+) means more than one allowed
Shapefile: digipres.org/id/fmt/1 dpo:fmt/1
- {id}.shp -> nationalarchives.gov.uk/PRONOM/x-fmt/235 nap:x-fmt/235
- {id}.shx
- {id}.dbf
- {id}.xxx (O)
2026-04-17 notes on Checking Folder Structures
Start with a structure compatible with the tree command, in JSON?
e.g.
$ tree -s -J --noreport ailink
[
{"type":"directory","name":"ailink","size":4096,"contents":[
{"type":"file","name":"ailink.chm","size":22116},
{"type":"file","name":"ailink.exe","size":360960},
{"type":"file","name":"ailink.txt","size":3326},
{"type":"file","name":"cpm.bat","size":292},
{"type":"file","name":"cpm.doc","size":8914},
{"type":"file","name":"cpm.exe","size":55456},
{"type":"file","name":"cpm.img","size":512},
{"type":"file","name":"cpm.pif","size":967},
{"type":"file","name":"cpmwin.bat","size":232},
{"type":"file","name":"instal.bat","size":235},
{"type":"file","name":"setup.cfg","size":134},
{"type":"file","name":"setup.exe","size":274944}
]}
]
Set up Python bindings for this.
Then set up a relatively simple [Schematron](https://schematron.com/)-inspired language for making assertions about file system layouts, using JSONPath instead of XPath. For example, in the GPO data set, every item should either be a directory with a known layout, or a known/expected manifest file.
{
pattern: {
[
rule: {
context: "$[*].name == 'part_2.md5',
assert: [
".size > 0"
]
}
rule: {
context: ".type == 'directory'",
assert: [
".name =~ /^\d+$/",
".contents[?(@.name == .name.append('-mets.xml'))].size > 0"
}
}
]
}
}
i.e. context runs a JSONMatch which we then run the assertions on. YAML might actually be easier going for writing these matchers with comments etc.
[Python JSONPath](https://jg-rp.github.io/python-jsonpath/)
https://github.com/json-path/JsonPath?tab=readme-ov-file#path-examples
[RFC 9535: JSONPath: Query Expressions for JSON](https://www.rfc-editor.org/rfc/rfc9535.html#section-2.4.9)
Asserts can also be test + message.
i.e. if it's a directory, check:
- it's name is numerical,
- it has a file called {foldername}-mets.xml with size > 0
- etc...
There is definitely something here about making the rules clearer, but somethings are not clear. How to we catch files that are not matched?
So, make it so context should match one or more files/directories only?
Make patterns at the entity level? so part_2.md and .type = dir & numeric are pattern contexts, and then everything in content has to match a rule?
These are just all very naturally nested rules! It needs to be file focussed anyway.
# No zero-length files anywhere
for $..[?(@.type == 'file')]:
assert _.size > 0
for $..[?(@.type == 'file') && (@.size == 0)]:
report "Zero-size file detected!"
# For top-level directories with names made of numbers only:
in $.[?(@.type == 'directory') && (@.name =~ /^\d+$/)]:
# There should be a matching <PARENT_FOLDER_NAME>-mets.xml files:
match $.contents[?(@.name == $.name.append('-mets.xml')]:
assert ($.contentType == 'application/xml')
# There should be one or more iso/img files:
for $.contents[?(@.name =~ /^\d+\.(iso|img)$/):
# Explicitly match each file:
match _
# For each of those (_), there should be an .idx file
match $.contents[?(@.name == _.name.append('.idx'))]
# for match syntax that automatically records each match:
for match $.contents[?(@.name =~ /^\d+\.(iso|img)$/):
# For each of those (_), there should be an .idx file
match $.contents[?(@.name == _.name.append('.idx'))]
patterns:
- context: $.. # i.e. anywhere rather than the default $.
for: "@.type == 'file' && @.size == 0"
report: "Zero-size file detected!"
- for: "@type == 'directory' && match(@.name, '[0-9]+')"
contains:
- "@.name == $.name.append('-mets.xml') && @.type == 'file'"
- "@.name == $.name.append('-marc.xml') && @.type == 'file'"
with:
- match: "@.name =~ /^\d+\.(iso|img)$/"
contains:
- "@.name == _.name.append('.idx')"
While processing, the system should track with files have been matched by any rule. Any files not matched by any rule should cause some noise!
Noting that direct use of element names would go wrong due to support for dots etc.
Idea to annotate the tree with all visited elements so we can tell files were not matched by anyone.
in keyword makes a new root, for adds a context, takes the array of matches, and passed them to a series of match calls as filter context, see [Advanced Usage - Python JSONPath](https://jg-rp.github.io/python-jsonpath/advanced/#filter-variables)
I think assertions should use the same logic as matches?
$[?(@.type == 'directory')].contents[?match(@.name, "[a-z]+\\.exe")]
$[?(@.type == 'directory')].contents[?match(@.name, "[a-z]+\\.(exe|txt)")]
[JSONPath Online Evaluator](https://jsonpath.com/)
This is all rather cumbersome! Could some kind of regex-replace work?
directory:
- name: "/([0-9]+)/"
files:
- "\1-marc.xml"
- "\1-mets.xml"
- for: "/(.+)\.(iso|img)/"
files:
- "\0"
- "\0.idx"
Maybe?! Kinda hacky. Similar to [Overview - dirschema](https://materials-data-science-and-informatics.github.io/dirschema/v0.1.0/) e.g [dirschema/schemas/toy_dataset.dirschema.yaml at main · Materials-Data-Science-and-Informatics/dirschema](https://github.com/Materials-Data-Science-and-Informatics/dirschema/blob/main/schemas/toy_dataset.dirschema.yaml)
The Schematron let support is cleaner/more explicit.
Note that tree . -s -X also works, making XML:
<?xml version="1.0" encoding="UTF-8"?>
<tree>
<directory name="ailink" size="4096">
<file name="ailink.chm" size="22116"></file>
<file name="ailink.exe" size="360960"></file>
<file name="ailink.txt" size="3326"></file>
<file name="cpm.bat" size="292"></file>
<file name="cpm.doc" size="8914"></file>
<file name="cpm.exe" size="55456"></file>
<file name="cpm.img" size="512"></file>
<file name="cpm.pif" size="967"></file>
<file name="cpmwin.bat" size="232"></file>
<file name="instal.bat" size="235"></file>
<file name="setup.cfg" size="134"></file>
<file name="setup.exe" size="274944"></file>
</directory>
</tree>
Noting also multipart format use case: [Multipart Formats - Google Sheets](https://docs.google.com/spreadsheets/d/1bNcBRMJrjUzz1NF0qvl2kyzcS8AX0K0qhJ34eDBBEtE/edit?gid=0#gid=0) e.g. http://fileformats.archiveteam.org/wiki/Shapefile
<rule context="">
<assert test="@r and @g and @b" >One of R G B missing</assert>
<assert test="any other than @r,@g,@b,@a" >Invalid attribute </assert>
</rule>
e.g. matching a well-formed Shapefile directory:
.//directory[file[matches(@name, '.*\.shp$')] and file[matches(@name, '.*\.shp2$')]]
See also:
2024-10-16 Extension sets
Looking at:
It occurs that syntaxes could be:
While working on format ID workflows, we came up against the problem of multi-part objects, that identifying them is difficult, and that atomising them on ingest is lossy/dangerous.
Noting [Identification of Multi-Part Digital Objects (PHAIDRA - o:1424890)](https://phaidra.univie.ac.at/detail/o:1424890)
Shapefiles
Ref [file extension](https://desktop.arcgis.com/en/arcmap/latest/manage-data/shapefiles/shapefile-file-extensions.htm)
See http://fileformats.archiveteam.org/wiki/Shapefile
Experimental syntax
Local file convention: gpo.us.gov/fmt/1 gpo:fmt/1
Using {id} as a short-hand for a named group
(?P<id>.*)i.e. the tree means - if there's this, then there should be this. Tree does not mean 'descend the hierarchy', but that could be added as an option, perhaps.
(R) means required, which should be the default.
Have to explicitly add (O) for optional
(+) means more than one allowed
Shapefile: digipres.org/id/fmt/1 dpo:fmt/1
2026-04-17 notes on Checking Folder Structures
Start with a structure compatible with the
treecommand, in JSON?e.g.
Set up Python bindings for this.
Then set up a relatively simple [Schematron](https://schematron.com/)-inspired language for making assertions about file system layouts, using JSONPath instead of XPath. For example, in the GPO data set, every item should either be a directory with a known layout, or a known/expected manifest file.
i.e. context runs a
JSONMatchwhich we then run the assertions on. YAML might actually be easier going for writing these matchers with comments etc.[Python JSONPath](https://jg-rp.github.io/python-jsonpath/)
https://github.com/json-path/JsonPath?tab=readme-ov-file#path-examples
[RFC 9535: JSONPath: Query Expressions for JSON](https://www.rfc-editor.org/rfc/rfc9535.html#section-2.4.9)
Asserts can also be test + message.
i.e. if it's a directory, check:
There is definitely something here about making the rules clearer, but somethings are not clear. How to we catch files that are not matched?
So, make it so
contextshould match one or more files/directories only?Make
patternsat the entity level? sopart_2.mdand.type = dir & numericare pattern contexts, and then everything incontenthas to match a rule?These are just all very naturally nested rules! It needs to be file focussed anyway.
While processing, the system should track with files have been matched by any rule. Any files not matched by any rule should cause some noise!
Noting that direct use of element names would go wrong due to support for dots etc.
Idea to annotate the tree with all visited elements so we can tell files were not matched by anyone.
inkeyword makes a new root,foradds a context, takes the array of matches, and passed them to a series of match calls as filter context, see [Advanced Usage - Python JSONPath](https://jg-rp.github.io/python-jsonpath/advanced/#filter-variables)I think assertions should use the same logic as matches?
[JSONPath Online Evaluator](https://jsonpath.com/)
This is all rather cumbersome! Could some kind of regex-replace work?
Maybe?! Kinda hacky. Similar to [Overview - dirschema](https://materials-data-science-and-informatics.github.io/dirschema/v0.1.0/) e.g [dirschema/schemas/toy_dataset.dirschema.yaml at main · Materials-Data-Science-and-Informatics/dirschema](https://github.com/Materials-Data-Science-and-Informatics/dirschema/blob/main/schemas/toy_dataset.dirschema.yaml)
The Schematron
letsupport is cleaner/more explicit.Note that
tree . -s -Xalso works, making XML:Noting also multipart format use case: [Multipart Formats - Google Sheets](https://docs.google.com/spreadsheets/d/1bNcBRMJrjUzz1NF0qvl2kyzcS8AX0K0qhJ34eDBBEtE/edit?gid=0#gid=0) e.g. http://fileformats.archiveteam.org/wiki/Shapefile
e.g. matching a well-formed Shapefile directory:
See also:
2024-10-16 Extension sets
Looking at:
It occurs that syntaxes could be:
codecs: e.g.application/octet-stream; extensions="xyz, chm"extensions="tif/tiff"i.e. using slash to mean OR (or possibly"tif|tiff")extensions="shp:sot"i.e. using:to mean AND, for compounds.application/zip; extensions="shp:sot"doesn't work as there's only one binaryapplication/zip; extensions="/shp+sot"i.e. using+for AND and/to mean 'inside the thing'.application/octet-stream; globs="*-gz"inode/directory; globs="/*.shp+/*.sot?application/x-directorywhich is whatfilesays, andinodeis less standard and more device-specific.