Skip to content

Explore ways to describe multi-part formats #3

@anjackson

Description

@anjackson

While working on format ID workflows, we came up against the problem of multi-part objects, that identifying them is difficult, and that atomising them on ingest is lossy/dangerous.

Noting [Identification of Multi-Part Digital Objects (PHAIDRA - o:1424890)](https://phaidra.univie.ac.at/detail/o:1424890)

Shapefiles

Ref [file extension](https://desktop.arcgis.com/en/arcmap/latest/manage-data/shapefiles/shapefile-file-extensions.htm)

+shp+shx+dbf

See http://fileformats.archiveteam.org/wiki/Shapefile

Experimental syntax

Local file convention: gpo.us.gov/fmt/1 gpo:fmt/1

  • {id}/ (R)
    • {id}/{id}-mets.xml (R)
    • {id}/{id}-marc.xml (R)
    • {id}/([0-9]+).(iso|img) (R+)
      • {id}/{\1}.{\2} (R)

Using {id} as a short-hand for a named group (?P<id>.*)

i.e. the tree means - if there's this, then there should be this. Tree does not mean 'descend the hierarchy', but that could be added as an option, perhaps.
(R) means required, which should be the default.
Have to explicitly add (O) for optional
(+) means more than one allowed

Shapefile: digipres.org/id/fmt/1 dpo:fmt/1

  • {id}.shp -> nationalarchives.gov.uk/PRONOM/x-fmt/235 nap:x-fmt/235
  • {id}.shx
  • {id}.dbf
  • {id}.xxx (O)

2026-04-17 notes on Checking Folder Structures

Start with a structure compatible with the tree command, in JSON?

$ tree -s -J --noreport

e.g.

$  tree -s -J --noreport ailink
[
  {"type":"directory","name":"ailink","size":4096,"contents":[
    {"type":"file","name":"ailink.chm","size":22116},
    {"type":"file","name":"ailink.exe","size":360960},
    {"type":"file","name":"ailink.txt","size":3326},
    {"type":"file","name":"cpm.bat","size":292},
    {"type":"file","name":"cpm.doc","size":8914},
    {"type":"file","name":"cpm.exe","size":55456},
    {"type":"file","name":"cpm.img","size":512},
    {"type":"file","name":"cpm.pif","size":967},
    {"type":"file","name":"cpmwin.bat","size":232},
    {"type":"file","name":"instal.bat","size":235},
    {"type":"file","name":"setup.cfg","size":134},
    {"type":"file","name":"setup.exe","size":274944}
  ]}

]

Set up Python bindings for this.

Then set up a relatively simple [Schematron](https://schematron.com/)-inspired language for making assertions about file system layouts, using JSONPath instead of XPath. For example, in the GPO data set, every item should either be a directory with a known layout, or a known/expected manifest file.

{ 
	pattern: {
		[
			rule: {
				context: "$[*].name == 'part_2.md5',
				assert: [
					".size > 0"
				]
			}
			rule: { 
				context: ".type == 'directory'",
				assert: [
					".name =~ /^\d+$/",
					".contents[?(@.name == .name.append('-mets.xml'))].size > 0"
				}
			}
		]
	}
}

i.e. context runs a JSONMatch which we then run the assertions on. YAML might actually be easier going for writing these matchers with comments etc.

[Python JSONPath](https://jg-rp.github.io/python-jsonpath/)
https://github.com/json-path/JsonPath?tab=readme-ov-file#path-examples
[RFC 9535: JSONPath: Query Expressions for JSON](https://www.rfc-editor.org/rfc/rfc9535.html#section-2.4.9)

Asserts can also be test + message.

i.e. if it's a directory, check:

  • it's name is numerical,
  • it has a file called {foldername}-mets.xml with size > 0
  • etc...

There is definitely something here about making the rules clearer, but somethings are not clear. How to we catch files that are not matched?

So, make it so context should match one or more files/directories only?

Make patterns at the entity level? so part_2.md and .type = dir & numeric are pattern contexts, and then everything in content has to match a rule?

These are just all very naturally nested rules! It needs to be file focussed anyway.

# No zero-length files anywhere
for $..[?(@.type == 'file')]:
	assert _.size > 0
for $..[?(@.type == 'file') && (@.size == 0)]:
	report "Zero-size file detected!"
# For top-level directories with names made of numbers only:
in $.[?(@.type == 'directory') && (@.name =~ /^\d+$/)]:
	# There should be a matching <PARENT_FOLDER_NAME>-mets.xml files:
	match $.contents[?(@.name == $.name.append('-mets.xml')]:
		assert ($.contentType == 'application/xml')
	# There should be one or more iso/img files:
	for $.contents[?(@.name =~ /^\d+\.(iso|img)$/):
		# Explicitly match each file:
		match _
		# For each of those (_), there should be an .idx file
		match $.contents[?(@.name == _.name.append('.idx'))]
	# for match syntax that automatically records each match:
	for match $.contents[?(@.name =~ /^\d+\.(iso|img)$/):
		# For each of those (_), there should be an .idx file
		match $.contents[?(@.name == _.name.append('.idx'))]
patterns:
- context: $.. # i.e. anywhere rather than the default $.
  for: "@.type == 'file' && @.size == 0"
  report: "Zero-size file detected!"
- for: "@type == 'directory' && match(@.name, '[0-9]+')"
  contains:
  - "@.name == $.name.append('-mets.xml') && @.type == 'file'"
  - "@.name == $.name.append('-marc.xml') && @.type == 'file'"
  with:
  - match: "@.name =~ /^\d+\.(iso|img)$/"
    contains:
    - "@.name == _.name.append('.idx')"

While processing, the system should track with files have been matched by any rule. Any files not matched by any rule should cause some noise!

Noting that direct use of element names would go wrong due to support for dots etc.

Idea to annotate the tree with all visited elements so we can tell files were not matched by anyone.

in keyword makes a new root, for adds a context, takes the array of matches, and passed them to a series of match calls as filter context, see [Advanced Usage - Python JSONPath](https://jg-rp.github.io/python-jsonpath/advanced/#filter-variables)

I think assertions should use the same logic as matches?

$[?(@.type == 'directory')].contents[?match(@.name, "[a-z]+\\.exe")]
$[?(@.type == 'directory')].contents[?match(@.name, "[a-z]+\\.(exe|txt)")]

[JSONPath Online Evaluator](https://jsonpath.com/)

This is all rather cumbersome! Could some kind of regex-replace work?

directory:
- name: "/([0-9]+)/"
  files:
  - "\1-marc.xml"
  - "\1-mets.xml"
  - for: "/(.+)\.(iso|img)/"
    files:
    - "\0"
    - "\0.idx"

Maybe?! Kinda hacky. Similar to [Overview - dirschema](https://materials-data-science-and-informatics.github.io/dirschema/v0.1.0/) e.g [dirschema/schemas/toy_dataset.dirschema.yaml at main · Materials-Data-Science-and-Informatics/dirschema](https://github.com/Materials-Data-Science-and-Informatics/dirschema/blob/main/schemas/toy_dataset.dirschema.yaml)

The Schematron let support is cleaner/more explicit.

Note that tree . -s -X also works, making XML:

<?xml version="1.0" encoding="UTF-8"?>
<tree>
  <directory name="ailink" size="4096">
    <file name="ailink.chm" size="22116"></file>
    <file name="ailink.exe" size="360960"></file>
    <file name="ailink.txt" size="3326"></file>
    <file name="cpm.bat" size="292"></file>
    <file name="cpm.doc" size="8914"></file>
    <file name="cpm.exe" size="55456"></file>
    <file name="cpm.img" size="512"></file>
    <file name="cpm.pif" size="967"></file>
    <file name="cpmwin.bat" size="232"></file>
    <file name="instal.bat" size="235"></file>
    <file name="setup.cfg" size="134"></file>
    <file name="setup.exe" size="274944"></file>
  </directory>
</tree>

Noting also multipart format use case: [Multipart Formats - Google Sheets](https://docs.google.com/spreadsheets/d/1bNcBRMJrjUzz1NF0qvl2kyzcS8AX0K0qhJ34eDBBEtE/edit?gid=0#gid=0) e.g. http://fileformats.archiveteam.org/wiki/Shapefile

    <rule context="">
      <assert test="@r and @g and @b"           >One of R G B missing</assert>
      <assert test="any other than @r,@g,@b,@a" >Invalid attribute   </assert>
    </rule>

e.g. matching a well-formed Shapefile directory:

.//directory[file[matches(@name, '.*\.shp$')] and file[matches(@name, '.*\.shp2$')]]

See also:

2024-10-16 Extension sets

Looking at:

It occurs that syntaxes could be:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions