Skip to content

initial plugin metadata extraction#159

Draft
JFRudzinski wants to merge 2 commits intodevelopfrom
metadata-extractor
Draft

initial plugin metadata extraction#159
JFRudzinski wants to merge 2 commits intodevelopfrom
metadata-extractor

Conversation

@JFRudzinski
Copy link
Copy Markdown
Collaborator

@JFRudzinski JFRudzinski commented Mar 19, 2026

Summary

This PR adds the first metadata extraction for nomad-simulation-parsers using the nomad-plugin-metadata pipeline.

What to review

Please review nomad_plugin_metadata.yaml.
This is the canonical, merged file intended for querying/registry usage.

How files work

  • .metadata/nomad_plugin_metadata.auto.yaml
    Machine-generated metadata from package/plugin introspection.
  • .metadata/nomad_plugin_metadata.manual.yaml
    Maintainer-owned manual curation/overrides (not machine-overwritten).
  • nomad_plugin_metadata.yaml
    Final merged output (auto + manual; manual non-empty values take precedence).
  • .metadata/plugin-metadata.override-report.yaml
    Report of conflicts where manual overrides auto.

Goal of this PR

Initial extraction pass for developer feedback before broader rollout:

  • verify extracted parser/file-format metadata
  • identify missing/incorrect fields
  • align expected manual curation scope

References

@JFRudzinski JFRudzinski requested a review from ndaelman-hu March 19, 2026 14:27
@JFRudzinski
Copy link
Copy Markdown
Collaborator Author

@ndaelman-hu If you want more context, see https://github.com/FAIRmat-NFDI/nomad-plugins-metadata -- inspired from and with export compatibility to datatractor ... for now I am mostly concerned with the schema itself and the accuracy/usefulness of the metadata that is extracted

@JFRudzinski JFRudzinski marked this pull request as draft March 19, 2026 14:36
@ndaelman-hu
Copy link
Copy Markdown
Collaborator

@ndaelman-hu If you want more context, see https://github.com/FAIRmat-NFDI/nomad-plugins-metadata -- inspired from and with export compatibility to datatractor ... for now I am mostly concerned with the schema itself and the accuracy/usefulness of the metadata that is extracted

I see. So this is a design for our own in-house plugin metadata, with compatibility for datatractor towards the future in mind?
Even if the latter isn't the current focus, I would start there as it gives a template for such schemas. I'll then evaluate all your additions on top.

@JFRudzinski
Copy link
Copy Markdown
Collaborator Author

@ndaelman-hu If you want more context, see https://github.com/FAIRmat-NFDI/nomad-plugins-metadata -- inspired from and with export compatibility to datatractor ... for now I am mostly concerned with the schema itself and the accuracy/usefulness of the metadata that is extracted

I see. So this is a design for our own in-house plugin metadata, with compatibility for datatractor towards the future in mind? Even if the latter isn't the current focus, I would start there as it gives a template for such schemas. I'll then evaluate all your additions on top.

Yes exactly, it was inspired/built directly off of datatractor, with tooling to generate datatractor compliant metadata files, but with extension towards nomad-specific usage.

Copy link
Copy Markdown
Collaborator

@ndaelman-hu ndaelman-hu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema within NOMAD Plugins

I'm trying to understand the primary objective of this schema. I noticed that extracts into a single yaml file:

  • endpoint metadata (names, etc.)
  • project metadata (authors, etc.)
  • GitHub telemetry

So, I surmise at this point that the schema aims to provide an intermediate (and centralized) artifact to generate a documentation page?

Schema vs DataTractor

Given that it seems partially inspired by DataTractor, I would like to contrast their goals.
DataTractor aims to provide an automated setup:

  • gives project and author metadata, sure. This may incentivize community building.
  • capture all necessary information to install a fileformat-specific parser.
  • also the instructions on how to call it once installed.

This yields a single interface that can deploy a whole parsing library environment, while avoiding excessive installation.

Final Objective

This objective is less relevant in our NOMAD universe, where plugins are already installed and integrated via entry points. If we do want to register NOMAD parser in DataTractor in the future, we will have to provide an installation and calling template too, though.

Some additional questions to help focus the objective:

  • What are the intended consumers of this metadata (documentation generator, plugin registry, CI/CD)?
  • How will this be kept in sync with code changes (automated CI/CD or manual regeneration)?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparison chart metadata fields between this schema and DataTractor:

NOMAD Field Datatractor Field Conversion Quality
id id Direct ✅ Perfect
name name Direct ✅ Perfect
description description Direct ✅ Perfect
subject subject Direct ✅ Perfect
upstream_repository source_repository Direct ✅ Perfect
documentation documentation Direct ✅ Perfect
license license.spdx String → object ✅ Good
supported_filetypes supported_filetypes[].id Direct ✅ Perfect
file_format_support FileType entries Nested → standalone ⚠️ Manual
schema_dependencies installation Declarative → executable ❌ Lossy
parser_details usage Regex → templates ❌ Cannot automate
entry_points N/A No mapping ❌ NOMAD-specific

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On coverage of current schema:

taken from entrypoints metadata

Field In init.py In Metadata Schema
Parser name ✅ name='parsers/vasp' ✅ parser_name
Filename pattern ✅ mainfile_name_re ✅ mainfile_name_re
Content pattern ✅ mainfile_contents_re ✅ mainfile_contents_re
MIME pattern ✅ mainfile_mime_re ✅ mainfile_mime_re
Binary header ✅ mainfile_binary_header ✅ mainfile_binary_header
Aliases ✅ aliases=['parsers/vasp'] ✅ parser_aliases
Compression ✅ supported_compressions ✅ compression_support
Level ✅ level=0 ✅ parser_level

Gap: code_name and code_homepage exist in some __init__.py files but aren't extracted to metadata schema.

taken from Git project metadata

Category Examples
Package info plugin_version, license, upstream_repository
People authors[], maintainers[] with emails/affiliations
GitHub telemetry stars, created, last_updated, archived
Deployment on_central, on_pypi, pypi_package
Discovery suggested_usages, subject tags, maturity
Datatractor supported_filetypes (FileType IDs)
Provenance Where/when/how metadata was generated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants