Explore ways to describe multi-part formats

While working on format ID workflows, we came up against the problem of multi-part objects, that identifying them is difficult, and that atomising them on ingest is lossy/dangerous.



Noting [[Identification of Multi-Part Digital Objects (PHAIDRA - o:1424890)](https://phaidra.univie.ac.at/detail/o:1424890)](https://phaidra.univie.ac.at/detail/o:1424890)

## Shapefiles

Ref [[file extension](https://desktop.arcgis.com/en/arcmap/latest/manage-data/shapefiles/shapefile-file-extensions.htm)](https://desktop.arcgis.com/en/arcmap/latest/manage-data/shapefiles/shapefile-file-extensions.htm)

```
+shp+shx+dbf
```

See http://fileformats.archiveteam.org/wiki/Shapefile

## Experimental syntax

Local file convention: gpo.us.gov/fmt/1 gpo:fmt/1 
- {id}/ (R)
	- {id}/{id}-mets.xml (R)
	- {id}/{id}-marc.xml (R)
	- {id}/([0-9]+).(iso|img) (R+)
		- {id}/{\1}.{\2} (R)

Using {id} as a short-hand for a named group `(?P<id>.*)`

i.e. the tree means - if there's this, then there should be this. Tree does not mean 'descend the hierarchy', but that could be added as an option, perhaps.
(R) means required, which should be the default.
Have to explicitly add (O) for optional
(+) means more than one allowed

Shapefile: digipres.org/id/fmt/1 dpo:fmt/1
- {id}.shp -> nationalarchives.gov.uk/PRONOM/x-fmt/235 nap:x-fmt/235
- {id}.shx
- {id}.dbf
- {id}.xxx (O)

## 2026-04-17 notes on Checking Folder Structures

Start with a structure compatible with the `tree` command, in JSON?

```
$ tree -s -J --noreport
```

e.g. 

```
$  tree -s -J --noreport ailink
[
  {"type":"directory","name":"ailink","size":4096,"contents":[
    {"type":"file","name":"ailink.chm","size":22116},
    {"type":"file","name":"ailink.exe","size":360960},
    {"type":"file","name":"ailink.txt","size":3326},
    {"type":"file","name":"cpm.bat","size":292},
    {"type":"file","name":"cpm.doc","size":8914},
    {"type":"file","name":"cpm.exe","size":55456},
    {"type":"file","name":"cpm.img","size":512},
    {"type":"file","name":"cpm.pif","size":967},
    {"type":"file","name":"cpmwin.bat","size":232},
    {"type":"file","name":"instal.bat","size":235},
    {"type":"file","name":"setup.cfg","size":134},
    {"type":"file","name":"setup.exe","size":274944}
  ]}

]
```

Set up Python bindings for this.

Then set up a relatively simple [[Schematron](https://schematron.com/)](https://schematron.com/)-inspired language for making assertions about file system layouts, using JSONPath instead of XPath. For example, in the GPO data set, every item should either be a directory with a known layout, or a known/expected manifest file.

```
{ 
	pattern: {
		[
			rule: {
				context: "$[*].name == 'part_2.md5',
				assert: [
					".size > 0"
				]
			}
			rule: { 
				context: ".type == 'directory'",
				assert: [
					".name =~ /^\d+$/",
					".contents[?(@.name == .name.append('-mets.xml'))].size > 0"
				}
			}
		]
	}
}
```

i.e. context runs a `JSONMatch` which we then run the assertions on. YAML might actually be easier going for writing these matchers with comments etc. 

[[Python JSONPath](https://jg-rp.github.io/python-jsonpath/)](https://jg-rp.github.io/python-jsonpath/) 
https://github.com/json-path/JsonPath?tab=readme-ov-file#path-examples
[[RFC 9535: JSONPath: Query Expressions for JSON](https://www.rfc-editor.org/rfc/rfc9535.html#section-2.4.9)](https://www.rfc-editor.org/rfc/rfc9535.html#section-2.4.9)

Asserts can also be test + message.

i.e. if it's a directory, check:
- it's name is numerical,
- it has a file called {foldername}-mets.xml with size > 0
- etc...

There is definitely something here about making the rules clearer, but somethings are not clear. How to we catch files that are not matched? 

So, make it so `context` should match one or more files/directories only?

Make `patterns` at the entity level? so `part_2.md` and `.type = dir & numeric` are pattern contexts, and then everything in `content` has to match a rule?

These are just all very naturally nested rules! It needs to be file focussed anyway.

```
# No zero-length files anywhere
for $..[?(@.type == 'file')]:
	assert _.size > 0
for $..[?(@.type == 'file') && (@.size == 0)]:
	report "Zero-size file detected!"
# For top-level directories with names made of numbers only:
in $.[?(@.type == 'directory') && (@.name =~ /^\d+$/)]:
	# There should be a matching <PARENT_FOLDER_NAME>-mets.xml files:
	match $.contents[?(@.name == $.name.append('-mets.xml')]:
		assert ($.contentType == 'application/xml')
	# There should be one or more iso/img files:
	for $.contents[?(@.name =~ /^\d+\.(iso|img)$/):
		# Explicitly match each file:
		match _
		# For each of those (_), there should be an .idx file
		match $.contents[?(@.name == _.name.append('.idx'))]
	# for match syntax that automatically records each match:
	for match $.contents[?(@.name =~ /^\d+\.(iso|img)$/):
		# For each of those (_), there should be an .idx file
		match $.contents[?(@.name == _.name.append('.idx'))]
```

```
patterns:
- context: $.. # i.e. anywhere rather than the default $.
  for: "@.type == 'file' && @.size == 0"
  report: "Zero-size file detected!"
- for: "@type == 'directory' && match(@.name, '[0-9]+')"
  contains:
  - "@.name == $.name.append('-mets.xml') && @.type == 'file'"
  - "@.name == $.name.append('-marc.xml') && @.type == 'file'"
  with:
  - match: "@.name =~ /^\d+\.(iso|img)$/"
    contains:
    - "@.name == _.name.append('.idx')"
```
While processing, the system should track with files have been matched by any rule. Any files not matched by any rule should cause some noise!

Noting that direct use of element names would go wrong due to  support for dots etc.

Idea to annotate the tree with all visited elements so we can tell files were not matched by anyone.

`in` keyword makes a new root, `for` adds a context, takes the array of matches, and passed them to a series of match calls as filter context, see [[Advanced Usage - Python JSONPath](https://jg-rp.github.io/python-jsonpath/advanced/#filter-variables)](https://jg-rp.github.io/python-jsonpath/advanced/#filter-variables) 

I think assertions should use the same logic as matches?

```
$[?(@.type == 'directory')].contents[?match(@.name, "[a-z]+\\.exe")]
$[?(@.type == 'directory')].contents[?match(@.name, "[a-z]+\\.(exe|txt)")]
```

[[JSONPath Online Evaluator](https://jsonpath.com/)](https://jsonpath.com/)

This is all rather cumbersome! Could some kind of regex-replace work?

```
directory:
- name: "/([0-9]+)/"
  files:
  - "\1-marc.xml"
  - "\1-mets.xml"
  - for: "/(.+)\.(iso|img)/"
    files:
    - "\0"
    - "\0.idx"
```

Maybe?! Kinda hacky. Similar to [[Overview - dirschema](https://materials-data-science-and-informatics.github.io/dirschema/v0.1.0/)](https://materials-data-science-and-informatics.github.io/dirschema/v0.1.0/) e.g  [[dirschema/schemas/toy_dataset.dirschema.yaml at main · Materials-Data-Science-and-Informatics/dirschema](https://github.com/Materials-Data-Science-and-Informatics/dirschema/blob/main/schemas/toy_dataset.dirschema.yaml)](https://github.com/Materials-Data-Science-and-Informatics/dirschema/blob/main/schemas/toy_dataset.dirschema.yaml) 

The Schematron `let` support is cleaner/more explicit.

Note that `tree . -s -X` also works, making XML:

```xml
<?xml version="1.0" encoding="UTF-8"?>
<tree>
  <directory name="ailink" size="4096">
    <file name="ailink.chm" size="22116"></file>
    <file name="ailink.exe" size="360960"></file>
    <file name="ailink.txt" size="3326"></file>
    <file name="cpm.bat" size="292"></file>
    <file name="cpm.doc" size="8914"></file>
    <file name="cpm.exe" size="55456"></file>
    <file name="cpm.img" size="512"></file>
    <file name="cpm.pif" size="967"></file>
    <file name="cpmwin.bat" size="232"></file>
    <file name="instal.bat" size="235"></file>
    <file name="setup.cfg" size="134"></file>
    <file name="setup.exe" size="274944"></file>
  </directory>
</tree>
```

Noting also multipart format use case: [[Multipart Formats - Google Sheets](https://docs.google.com/spreadsheets/d/1bNcBRMJrjUzz1NF0qvl2kyzcS8AX0K0qhJ34eDBBEtE/edit?gid=0#gid=0)](https://docs.google.com/spreadsheets/d/1bNcBRMJrjUzz1NF0qvl2kyzcS8AX0K0qhJ34eDBBEtE/edit?gid=0#gid=0) e.g. http://fileformats.archiveteam.org/wiki/Shapefile 

```xml
    <rule context="">
      <assert test="@r and @g and @b"           >One of R G B missing</assert>
      <assert test="any other than @r,@g,@b,@a" >Invalid attribute   </assert>
    </rule>
```

e.g. matching a well-formed Shapefile directory:
```xpath
.//directory[file[matches(@name, '.*\.shp$')] and file[matches(@name, '.*\.shp2$')]]
```

- [[XPath online real-time tester, evaluator and generator for XML & HTML](https://xpather.com/)](https://xpather.com/)
- [[pydantic-xml](https://pydantic-xml.readthedocs.io/en/latest/)](https://pydantic-xml.readthedocs.io/en/latest/) 

See also:

- [[nationalarchives/pronom-signatures](https://github.com/nationalarchives/pronom-signatures)](https://github.com/nationalarchives/pronom-signatures)
- [[pronom-signatures/format_schema.json at develop · nationalarchives/pronom-signatures](https://github.com/nationalarchives/pronom-signatures/blob/develop/format_schema.json)](https://github.com/nationalarchives/pronom-signatures/blob/develop/format_schema.json) 
- [[datamodel-code-generator | Pydantic Docs](https://pydantic.dev/docs/validation/latest/integrations/dev-tools/datamodel_code_generator/)](https://pydantic.dev/docs/validation/latest/integrations/dev-tools/datamodel_code_generator/)
- [[formatscaper · PyPI](https://pypi.org/project/formatscaper/)](https://pypi.org/project/formatscaper/) 2024 tool from a 'Max Moser'?
- Ross's Pronom Tools in Python
	- [[src.pronom_tools.pronom_tools API documentation](https://ffdev-info.github.io/pronom-release-tools/pronom_tools/pronom_tools.html)](https://ffdev-info.github.io/pronom-release-tools/pronom_tools/pronom_tools.html)
	- [[ffdev-info/pronom-release-tools: Tools, and API for working with PRONOM releases](https://github.com/ffdev-info/pronom-release-tools/)](https://github.com/ffdev-info/pronom-release-tools/)
	- [[pronom-tools · PyPI](https://pypi.org/project/pronom-tools/)](https://pypi.org/project/pronom-tools/)
- [[droid/Signature syntax.md at main · digital-preservation/droid](https://github.com/digital-preservation/droid/blob/main/Signature%20syntax.md)](https://github.com/digital-preservation/droid/blob/main/Signature%20syntax.md)
- [[pronom-file-signature-research.pdf](https://cdn.nationalarchives.gov.uk/documents/information-management/pronom-file-signature-research.pdf)](https://cdn.nationalarchives.gov.uk/documents/information-management/pronom-file-signature-research.pdf)
- [[A closer look at Pronom signatures - Open Preservation Foundation](https://openpreservation.org/blogs/closer-look-pronom-signatures/)](https://openpreservation.org/blogs/closer-look-pronom-signatures/) from Adam F (2010)
- Ross's Signature Development Utility
	- [[PRONOM Research Week: Signature Development Utility 2.0 - ffdev.info - Open Preservation Foundation](https://openpreservation.org/blogs/pronom-research-week-signature-development-utility-2-0-ffdev-info/)](https://openpreservation.org/blogs/pronom-research-week-signature-development-utility-2-0-ffdev-info/)
	- [[PRONOM Signature Development Utility - COPTR](https://coptr.digipres.org/index.php/PRONOM_Signature_Development_Utility)](https://coptr.digipres.org/index.php/PRONOM_Signature_Development_Utility) 

## 2024-10-16 Extension sets

Looking at:
- https://developer.mozilla.org/en-US/docs/Web/Media/Formats/codecs_parameter#basic_syntax
- https://en.wikipedia.org/wiki/Filename#Comparison_of_filename_limitations

It occurs that syntaxes could be:

- Like `codecs`: e.g. `application/octet-stream; extensions="xyz, chm"`
- Or e.g. `extensions="tif/tiff"` i.e. using slash to mean _OR_ (or possibly `"tif|tiff"`)
- Or e.g. `extensions="shp:sot"` i.e. using `:` to mean _AND_, for compounds.
	- But `application/zip; extensions="shp:sot"` doesn't work as there's only one binary
	- Perhaps `application/zip; extensions="/shp+sot"` i.e. using `+` for _AND_ and `/` to mean 'inside the thing'.
- Possibly globs... 
	- e.g. `application/octet-stream; globs="*-gz"`
	- e.g. `inode/directory; globs="/*.shp+/*.sot` ?
	- Or perhaps better `application/x-directory` which is what `file` says, and `inode` is less standard and more device-specific.
	- [[Signature development utility 2.0](https://ffdev.info/)](https://ffdev.info/)
- [[Using siegfried tooling for signature development for ](https://openpreservation.org/blogs/using-siegfried-tooling-for-signature-development-for-pronom2019/)[#PRONOM2019](https://github.com/orgs/digipres/projects/3/views/3#PRONOM2019) - Open Preservation Foundation](https://openpreservation.org/blogs/using-siegfried-tooling-for-signature-development-for-pronom2019/) - slightly out of date
- [[DROID Container Signature Files: What they are and how to create them: A template and an example, or few… - Open Preservation Foundation](https://openpreservation.org/blogs/droid-container-signature-files-what-they-are-and-how-to-create-them-a-template-and-an-example-or-few/)](https://openpreservation.org/blogs/droid-container-signature-files-what-they-are-and-how-to-create-them-a-template-and-an-example-or-few/)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore ways to describe multi-part formats #3

Shapefiles

Experimental syntax

2026-04-17 notes on Checking Folder Structures

2024-10-16 Extension sets

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Explore ways to describe multi-part formats #3

Description

Shapefiles

Experimental syntax

2026-04-17 notes on Checking Folder Structures

2024-10-16 Extension sets

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions