Faster definitive data DOIs #11
SimonFlower
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Options for faster DOI publication
Summary
The current format of archive files used to access Intermagnet DOI data (magYYYY_defYYYY.zip) will not allow the frequency of updates that we want. A new way of grouping data is needed. A proposal is made for keeping data in single observatory-year archive files. This would significantly simplify the publication process, but would require software to be created for users to access the data.
Requirements for Intermagnet definitive data publication
Current arrangements
The current formats for data mean that one year of Intermagnet definitive data occupy a large number of files (e.g. over 2,000 in 2019). The current data structure for definitive data publication is described here: https://tech-man.intermagnet.org/stable/appendices/archivedataformats.html#intermagnet-physical-media-directory-structure. Currently data for DOI publication is stored in annual zip archives. This leads to a manageable number of data files (under 100 for all DOI publications across all years).
The imcdview software currently requires data in the structure described in the Intermagnet manual, but only the folders (and their observatory data sub-folders) are necessary - other folders (such as , <obsy_inf> and <ctry_inf>) are not required.
For each year, data is requested from data providers until no further data can be expected before the entire annual data set is published.
Our current system can be divided into publication of two different types of data:
Discussion
In order to make management of the data possible for both administrators and users, data needs to be grouped into a small number of archive files. Publishing the data without grouping it into archive files would simplify adding new data to a DOI, but is unrealistic for maintenance of the DOI data archive and would make it complex for users to access the data. It is not acceptable to add data to a ZIP archive file - this constitutes a change in terms of the DOI.
Data is currently grouped into archive files by year. It could alternatively be grouped by observatory. However there is no obvious advantage to grouping by observatory - doing so would not make it easier to add data to a DOI in a form that allows users easy access to the data that they want.
How frequently do we anticipate adding data to the DOIs? Answering this question will help define an appropriate way to structure the data. For previously unpublished data, we are agreed that we should stop waiting for all data from a year to be available before publishing. However we cannot expect data to be published as soon as it is checked - this could lead to weekly or more frequent updates. I suggest either monthly or 4-times yearly would be realistic. Either of these frequencies would allow much more rapid availability of data for users than our current publication system.
Corrections to previously published data are relatively infrequent, but once it is known that corrected data is available it is important to publish the corrections as quickly as possible. Corrected data is more complex to publish under a DOI as the previous (incorrect) data must be maintained alongside the new (correct) data.
Adding new definitive data as soons as it is made available by data checkers increases the complexity of publication. At present we collect data from all observatories for a given year into a single archive, waiting until the data set is as complete as possible before publication. Data is published in archive files with names mag_def.zip. We cannot continue to use this form of archive file if we plan to publish data more frequently that once a year. At the least would need to become . Morevover, for each , we will generate (and need to keep) many publications. For example, if the entire definitive data set for the year 2025 takes 4 years to collect and is published 4-times a year, there would be a 16-fold increase in the number of archive files we need to store.
This approach may be possible for minute data, where the size of data files is relatively small, but will it be acceptable for 1-second data?
Alternative proposal
Hold data in archives that contain one year of data from just one observatory. To allow the data repository to be searched, the archive file names must include:
EG: esk_2020_2023-03.zip
Data can be easily added to this archive as it becomes available. There will never be the need to "re-publish" anything, data will only ever be added. Corrections to data are applied by creating a new archive file with an updated year and month of publication. This will also reduce the size of data files when correcting data. In the present system, correction of one observatory's data means that the entire year with all observatories is re-published, whereas this system will only add one "observatory-year" of data - a reduction of around 99%.
If we adopt this approach, a single DOI can be used for all ongoing Intermagnet definitive data publication.
The current use of symbolic links to show users the most recently published data is no long required (further simplifying publication).
The number of archive files will make it too complex for users to download data directly. Instead an application (e.g. a web form) will be needed that allows users to enter:
In response the application will provide the user with the list of observatory-year archive files that were available at the publication date that the user chose. This system allows easy and exact repeatability of searches, meaning that authors can cite the Intermagnet DOI along with a publication date in their papers, and colleagues are able to exactly replicate the same data set. Similar search functionality can be built into the imcdview application (and MagPy?) to allow software to directly download a precise data set using the DOI.
Beta Was this translation helpful? Give feedback.
All reactions