- Version: 2.2.0 (not released to Zenodo)
- Released: 2026/02/09
- Author(s): Bryan Gee (UT Libraries, University of Texas at Austin; bryan.gee@austin.utexas.edu; ORCID: 0000-0003-4517-3290)
- Contributor(s): None
- License: MIT
- README last updated: 2026/02/09
This repository contains Python code that is designed to gather and organize metadata from a number of individual research data repository/platform APIs in order to analyze and summarize research dataset publications that are affiliated with at least one researcher from a particular institution. This code is being developed in the specific context of retrieving data for the University of Texas at Austin but can be readily and easily adapted for use at other institutions.
- dataset-records-retrieval.py: This is the primary Python script for conducting large-scale records retrieval through the DataCite API. It also includes functionality for using a set of different APIs to try and identify deposits on Figshare that lack affiliation metadata but that can be connected to an article with at least one author from a focal institution.
- config-template.json: This is the config file that contains most parameters and that stores API keys. This file should be populated with personal information as necessary, with affiliation permutations modified if applying this to a different institution, and renamed as config.json in order for scripts to work. For some fields, the UT Austin specific information is left in as a model of how information should be formatted.
- journal-list.json: This file contains the official journal names and ISSNs to be queried as part of one of the possible Figshare workflows (construction of a hypothetical SI DOI and testing its existence). This file contains all PLOS titles as an example but could be expanded to any other journal that does uses the model of appending '.s00x' to the article DOI for mediated Figshare deposits.
- data-dictionary.csv: This file describes the columns that are contained in each output and accessory-output file. Note that exporting certain files is coded out in the present script, and columns for those files are not defined here since those files are not considered essential (e.g., checkpoint files).
- accessory-scripts/dataset-records-retrieval-visualization.py: This file contains the code used to generate visuals for the 2025 RDAP Summit, the CNI 2025 Spring Meeting, the associated preprint, and the associated manuscript.
- accessory-scripts/plos-osi-search.py: This file contains the code used to retrieve the latest version of the PLOS Open Science Indicators (OSI) Dataset, identifies articles that list data as having been shared in part or in whole through Supplemental Information (mediated Figshare deposit for all PLOS titles), retrieves a list of PLOS articles with at least one author from a focal institution, and searches for matches to identify PLOS articles co-authored by a university researcher where 'data' were deposited on Figshare through the mediated process. It does the same for affiliated articles that link to NCBI deposits (note that this could be reuse rather than novel generation).
- accessory-scripts/plos-webscraping-supp-info.py: This file retrieves a list of affiliated PLOS articles (across all PLOS titles) and then uses beautifulsoup to scrape the webpages for each article, targeting the consistently-formatted Supporting information section that contains certain metadata on the files that are hosted on Figshare through a mediated process. As of this release, this is permissible under the PLOS text scraping policy, but re-users should make sure to double-check that these conditions have not changed before attempting a run themselves.
- accessory-scripts/datacite-ror-query.py: This file contains a trimmed version of the main workflow and uses the ROR identifier (specified in the config.json file) to search for affiliated datasets instead of a single or multiple affiliation name strings. The only purpose for this script is to quantify the degree to which a ROR-based query will result in an incomplete retrieval due to lack of widespread adoption of ROR / re-curation of deposits published prior to ROR integration.
- accessory-scripts/datacite-Figshare-partner-query.py: This file is an adaptation of one of the secondary Figshare workflows that identifies journal-mediated, DataCite-minted deposits without affiliation metadata and attempts to connect them to articles that were (co)authored by a researcher at a focal institution. In the workflow in the main codebase, a publisher (e.g., 'Taylor & Francis') and specified resource types, with the script set to loop through a list of publishers. In this accessory script, only a single publisher is queried, but the query is broadened to capture any resource type (e.g., 'audiovisual', 'image'); note that this has to be time-capped for the past few years to keep the scale of the retrieval manageable, which could be on the order of hundreds of thousands of records for just one publisher. The script runs the same cross-matching of DataCite-minted deposits against a list of affiliated articles retrieved through OpenAlex. The purpose of this workflow is to explore whether some objects not labeled as 'dataset' might contain data and whether some objects labeled as 'dataset' might not be data. The use of file formats to attempt to predict the 'data' nature of an object is still very preliminary. It is intended to lay the foundation for a process in the main workflow to more rigorously assess the contents of objects labeled as 'dataset.'
- accessory-scripts/datacite-Figshare-partner-query_metadata-only.py: This file is used only to retrieve the automated metadata summaries that are included in DataCite API responses (e.g., number of DOIs published per year). This can potentially be useful for rapid summaries when gathering the data would otherwise be an intensive API query process due to the number of records, but it should be used carefully, especially for repositories that may mint DOIs at a very granular scale (e.g., Dataverse installations, Figshare, Zenodo).
- accessory-scripts/crossref-query.py: This file conducts a general institution-based query to the Crossref REST API. It is separated from the primary workflow based on the results for UT Austin (hundreds of thousands of results, most of which have nothing to do with UT Austin), which indicate that this does not need to be run as frequently as a DataCite query and could instead have a recently generated output file pulled in to concatenate with the primary workflow's output. This script is configured to export a CSV file with the same fields as the primary workflow output.
- accessory-scripts\Figshare-deposits-linked-articles.py: This file identifies Figshare deposits and then queries each one in the DataCite API to look for any object where the dataset is listed as being 'IsSupplementTo'. It then feeds those related article DOIs into the Crossref API to identify which journals/publishers are associated with Figshare deposits. This process applies to both datasets listed with 'Figshare' as the publisher in the DataCite metadata and datasets listed with a publisher partner like 'Taylor & Francis,' which are adjusted in the main workflow.
- accessory-scripts\Figshare-deposits-additional-metadata.py. This file identifies Figshare deposits and queries each one in the Figshare API to obtain additional metadata that may not be crosswalked, like file formats.
- accessory-data/20250310-mediated-Figshare-metadata-summary.csv: This file contains a manually compiled summary of select metadata for Figshare deposits mediated through publisher partners(filter on 'Publishers'); it is intended to provide insight into possible filter parameters that may permit their programmatic retrieval. This is a static file created on 2025/03/10, and partners/metadata may change in the future (e.g., SciELO journals was listed the last time I examined this in October 2024). Briefly, I accessed each publisher's Figshare collection through the web interface and selected 10 random deposits, with preference given to recent deposits. A few listed publishers are not recorded in the CSV file: JACC and SAGE redirect to the publishers' homepage, not a Figshare collection; Human Genome Variation is a database; and IEEE Standards, Medical Affairs Professional Society, Optica Open, and Physiome appeared to contain out-of-scope topic (e.g., only preprints in Optica Open). I recorded which indexer (DataCite vs. Crossref) was used to mint the DOI; what the listed publisher name is (listed_publisher); the client-id and provider-id if minted through DataCite, as these represent queryable fields that indicate a Figshare connection; up to 10 DOIs that were examined; whether the DOIs contain the string 'Figshare' (doi_figshare); and how the DOIs were constructed (doi_construction). There are only a few DOI construction models: a default Figshare DOI that is randomly created and assigned with the prefix '10.6084/m9.Figshare'; appending .t00x or .s00x to the end of the article DOI, where 'x' is a sequential integer; or a randomly created DOI using the associated journal/publisher's DOI prefix.
- accessory-scripts/preprint-dryad-date-comparison.py: This file is narrowly focused, being designed only to look at discrepancies in timestamps in Dryad datasets as part of a case study in the forthcoming preprint. It retrieves all records for Dryad from both the DataCite and Dryad APIs and all timestamps that are available in each.
- accessory-scripts/preprint-reads-dataset-reanalysis.py: This file is narrowly focused, being designed only to reanalyze select parts of the RADS dataset (Johnston et al., 2024; PLOS ONE). It retrieves additional metadata for deposits linked to certain repositories and then either (1) retrieves information on linked articles or (2) applies the same deduplication steps of the primary workflow to the RADS data; these are not mutually exclusive but are not directly related functions.
Several scripts will create additional subdirectories as part of their workflow (e.g., 'outputs'); these subdirectories are not provided here.
The core workflow contained in dataset-records-retrieval.py makes use of four REST APIs in order to conduct a large-scale initial sweep for university-affiliated datasets based on a set of permutations for the institutional name: DataCite; Dataverse; Dryad; and Zenodo. Note that the Dataverse code is configured specifically for the Texas Data Repository (TDR)'s instance. Other APIs have been incorporated into secondary workflows or accessory scripts: Crossref; Figshare; and OpenAlex. Finally, a few APIs have been explored for dataset retrieval but are not currently incorporated here: Mendeley Data; and Open Science Framework (OSF).
The code is designed to maximize the potential retrieval scope of a given API query, specifically as it relates to the fields in which an affiliation may be found (which is not always the 'affiliation' field) and the various permutations for UT Austin specifically (e.g., 'University of Texas at Austin' versus 'University of Texas, Austin'). Even though the three repositories that are integrated into the primary workflow (Dataverse, Dryad, Zenodo) all mint their DOIs through DataCite and should thus be discoverable collectively through the DataCite API, the individual repository APIs were queried as both a cross-validation process and an exploration of whether there might be some important variability in metadata cross-walks; an initial inability to perfectly cross-validate all three repositories records in the early stages of this code's development facilitated refinement of the workflow and identified edge-case scenarios. An additional benefit of exploring repository-specific APIs is the potential to identify additional metadata that are not cross-walked to DataCite (possibly because they are not supported in the present schema), such as certain controlled vocabularies.
The primary script consists of four major components:
- API query construction and calls;
- Filtering of the JSON response and conversion to a pandas dataframe;
- Cross-validation checks of the responses from individual repositories' API against their equivalent output as retrieved from the DataCite API; and
- Concatenation and de-duplication, with the 'original' (specific repository API) source preferred when a dataset was returned by both the repository API and the DataCite API.
The cross-validation step is optional and can be enabled/disabled with a single Boolean variable in the config file; if disabled, DataCite will be the exclusive source of retrieved information. De-duplication is necessary regardless of whether cross-validation is implemented or not, primarily due to variable granularity of DOI assignment between repositories. The current process also handles 'double-minting' of DOIs for one deposit, a practice found in some repositories (e.g., Zenodo), and includes a toggle to de-duplicate Dataverse deposits that have the same list of authors and affiliations, the same publication date, and the same rights/licensing. This accounts for the tendency of some users to oversplit materials for one manuscript into multiple DOI-backed datasets, all nested under a non-DOI-backed dataverse (see ticket). This is a highly conservative approach that should avoid accidental removal of linked deposits that were separated intentionally (e.g., data in one deposit, software in another; different authors for different datasets for one study).
Based on the results of early testing with the primary workflow, I am now developing targeted secondary workflows that attempt to fill known gaps (e.g., paucity of Figshare deposits). There several secondary workflows aimed at finding Figshare deposits, which largely lack affiliation metadata but which can be discovered through a number of different means, especially when they have been automatically created through a partner journal's manuscript submission process/portal. The dataset-records-retrieval.py script contains two secondary workflows that can be toggled on or off with Boolean variables.
-
The first workflow (figshareWorkflow1) takes advantage of the fact that for many partner journals, mediated Figshare deposits are listed with the publisher in the 'publisher' metadata field, rather than Figshare. This workflow thus retrieves all datasets with a publisher listing like 'Taylor & Francis' through DataCite, retrieves university-affiliated articles published by that same publisher from OpenAlex (which uses ROR to standardize affiliation metadata), and looks for matches. It should be noted that there is widespread variation in how mediated Figshare deposits are classified in the DataCite resource type schema, so not all objects labeled as 'dataset' are perhaps datasets, and not all objects containing 'data proper' will be labeled as 'dataset' (other options include 'text', 'component', and 'collection').
-
The second workflow (figshareWorkflow2) takes advantage of a different configuration in certain journals in which mediated Figshare deposits are minted through Crossref with a DOI that appends '.s00x' (or sometimes '.t00x') to the end of the associated article DOI where 'x' is a sequential number. This workflow retrieves all university-affiliated articles from a publisher that is known to do this (e.g., PLOS) with journal-list.json, constructs a hypothetical mediated Figshare DOI by adding '.s001' to the article DOI, and then tests whether that link exists. This is a more time-intensive process and will only identify that there is some sort of Figshare deposit - this may not be classified as a 'dataset' in the metadata or in a conceptual sense.
There is also an accessory script called Figshare-deposits-linked-articles.py, which takes Figshare deposits that were recovered with affiliation metadata and 'Figshare' listed as the publisher and searches for article metadata in order to identify patterns (e.g., for UT Austin, all such deposits are linked to articles in Springer Nature journals). This was primarily used to explore how Figshare deposits' metadata is created (or not) and does not directly increase the number of identified datasets (it proved useful for better understanding of how to increase retrieval through the above methods).
A second accessory scripted called Figshare-deposits-additional-metadata.py takes Figshare deposits that use the standard Figshare DOI construction and loops them through the Figshare API (the numerical string at the end of the DOI is the 'ID' that is accepted by the API), returning information on the files within each deposit. Some basic rules are then implemented to flag deposits that contain software (e.g., R scripts), deposits that only comprise software, and deposits that only contain formats that are less likely to be data (e.g., MS Word). The current logic is rather basic, in part because there are not many unique file formats recognized in UT-affiliated deposits, but more robust criteria could be established (e.g., using the filenames). It's likely that for another institution, a reuser would encounter other file formats, as well as formats where the mimetype assignment is incorrect (e.g., one accounted for here is the labeling of CSV files as 'text/plain' rather than 'text/csv'.)
There is also a highly specific script, plos-osi-search.py, which retrieves the PLOS Open Science Indicators (OSI) dataset through the Figshare API. This dataset encompasses all PLOS articles (through a certain timeframe) and has identified locations of data sharing. This dataset contains information on data sharing in many forms, not just through Figshare, but because PLOS uses the mediated Figshare process, any article listed as having shared data as Supplemental Information has created a mediated Figshare deposit. This workflow (which runs in about 30 seconds), will identify all such articles, retrieve a list of university-affiliated PLOS articles through OpenAlex, and find matches. This script was mostly used as an initial test of concept for developing the above workarounds for Figshare deposits that can be applied across more publishers, but it can be useful for quickly getting an estimate of what proportion of articles have generated a Figshare deposit. Refer to the PLOS OSI methodology documentation for more details on how their dataset was generated; the same caveats of whether a deposit is a 'dataset proper' remain, although they attempted to infer whether SI was data or not.
There are also two secondary workflows that retrieve deposits from other sources. The first is retrieval of NCBI Bioprojects, which is contained within the primary script. NCBI does not use digital PIDs, instead issuing collection/accession/project IDs that while persistent, do not have a persistent-resolving URL like an ARK or handle, let alone a DOI. There is also no API specifically designed for metadata records retrieval for institutions, but the Entrez system can be queried through various modules for a general search for a string matching an affiliation. This workflow provides an option to use either this method or to use a Selenium library developed for Python. Selenium is normally used for developers to test functionality, but the ability to program a series of interactions with a webpage (e.g., click this dropdown, then select this radio button) allows for the automation of retrieving an XML of query results and integration into the larger codebase. This workflow requires a separate installation of a browser-specific WebDriver proxy. The Selenium-based workflow was developed for Mozilla Firefox and thus uses GeckoDriver, but other browsers have their own proxies (e.g., ChromeDriver). You may need to do some basic path mapping for the installed proxy (e.g., for GeckoDriver). If you use a browser other than Firefox, some modifications will be required for the code - an AI chatbot can probably convert the relevant chunks for you with minimal prompting since a chatbot facilitated the development of this workflow - but you don't need to have Firefox configured in any particular way or to use it on any regular basis, so as long as it's installed on your computer, you may not need or want to rework the script. The code for other browsers may be added in the future with a toggle to indicate which codeblock to use.
The second secondary workflow is externalized in a separate script, crossref-query.py. It makes a general query to the Crossref REST API in the single affiliation field for a combination of terms (university+of+texas+austin). For UT Austin, this retrieved an unwieldly number of results (>600,000), most of which were not actually related to UT Austin and even fewer of which were actually datasets (i.e. some UT Austin-affiliated objects were mislabeled as 'dataset' in the Crossref schema). This query is separated out since it is a high-cost, low-reward process that will not be run as routinely as the DataCite query (although runtime is only about 15-20 minutes when parameters are constrained by resource type).
The DataCite Citation Corpus was explored in early 2025 but is not presently incorporated because it returned very few results for UT Austin (fewer than 150).
Five files will always be saved regardless of whether cross-validation or Figshare workflows are used. If multiple resource types are queried, they will be listed in a hyphen-delimited string in that 'field' of the filename.
-
date_resource type_datacite-initial-output.csv: This file is exported immediately after subsetting the API response for select fields and has had no additional processing (e.g., deduplication) performed; as a result, it can be quite large if your query retrieves a large number of file-level DOIs or other forms of overgranularized datasets.
-
date_resource type_datacite-output-for-affiliation-source.csv: This file is exported relatively early in the process with fields retrieved from DataCite and is used exclusively to summarize which field an institutional abbreviation was detected in (affiliation_source) and what permutation of the institutional name was detected (affiliation_permutation). This categorization is hierarchical; the script first looks in the creator.affiliation.name field, and if it finds a focal affiliation, will be recorded as 'creator.affiliation.' If no affiliation is found, it will then check contributor.affiliation.name field, and if it it finds a focal affiliation, will be recorded as 'contributor.affiliation.' For remaining entries, it will check creator.name and contributor.name. Whenever a match is found, the entry is removed from further consideration, so a dataset with the affiliation listed in both the creator.affiliation.name and contributor.affiliation.name fields will be listed here as 'creator.affiliation.' There are also additional fields recorded here for metadata assessment that are similarly not carried through to the final output. Note: as of 2025/11/23, this file is now essentially fully redundant with the fourth file and maybe deprecated in the future.
-
date_resource type_datacite-output-for-metadata-assessment.csv: A nearly identical file to date_datacite-output-for-metadata-assessment.csv; currently, the information unique to this file is a column that converts mimeTypes to standardized, 'friendly' file formats, columns for whether a dataset contains any code format and whether it contains only code formats, and columns where the calculated number of descriptive and non-descriptive words in the dataset title are recorded. Note: as of 2025/11/23, this file is now essentially fully redundant with the fourth file and maybe deprecated in the future.
-
date_resource type_full-concatenated-dataframe.csv: This file is the final output of the main workflow. It has applied all filtering and de-duplication steps, trimmed the listed variables to a select few for readability, and has added some categorical variables (e.g., whether a repository is part of GREI).
-
date_resource type_unique-affiliated-researchers.csv: This file returns a dataframe of unique researchers that were listed as being affiliated with the focal institution (across any of the queried metadata fields, so this will include the institution and likely some subsidiary units being listed as a researcher itself). The current script is set to combine entries using RapidFuzz (a fuzzy matching module for text) to circumvent minor variation in name construction, but it is intrinsically imperfect because of commonality of some names and will fail to identify some instances of multiple entries for one person. Requiring an exact match to combine author entries can be easily done, however, and the output will show which names were combined in the fuzzy matching.
If you run the cross-validation process, the script will return a list of datasets from a specific repository that were retrieved from DataCite but not from that repository's API and vice versa (e.g., date_DataCite-into-Dryad_joint-unmatched-dataframe.csv). It will also combine all of the affiliated datasets that were only retrieved from the repository-specific API into one file (date_datacite-additional-cross-validation.csv).
If you run Figshare-deposits-linked-articles.py, it will output three files.
- The first file is date_institution-affiliated-Figshare-datasets-expanded-metadata.csv, which contains enriched metadata from DataCite for those Figshare deposits.
- The second file is date_institution-affiliated-Figshare-associated-articles.csv and will contain information on the linked articles as retrieved from the Crossref API.
- The third file is date_institution-affiliated-Figshare-associated-articles-merged.csv and combines these two files.
If you run figshareWorkflow1, it will output three different files.
- The first file is the output of every mediated Figshare deposit as identified by the presence of a scholarly publisher in the publisher field - most of these will not be affiliated with your institution. The second is the result of merging together the list of datasets from DataCite with the list of articles from OpenAlex, date_resource type_figshare-discovery-affiliated.csv.
- The second file contains a mix of fields from DataCite (e.g., dataset DOI) and OpenAlex (e.g., DOI and title of the article) and is restricted to only those Figshare deposits that are linked to an affiliated article, date_resource type_figshare-discovery-deduplicated.csv.
- The third file is date_full-concatenated-dataframe-plus-Figshare.csv, which appends the newly discovered Figshare records that do not record affiliation metadata but are linked to university-affiliated articles to the date_full-concatenated-dataframe.csv file. The source for the new Figshare deposits is listed as 'Datacite+' to account for the nuanced workflow.
If you run figshareWorkflow2, it will output one of two files depending on whether you use OpenAlex or Crossref; the file will look like this: date_indexer-articles-with-hypothetical-deposits.csv. This will return a list of all articles from the queried publisher(s) with a column ('Valid') indicating whether or not a hypothetical DOI was found for an article. There is no assurance that this object is a 'dataset' (in metadata classification or general classification), and there may be other objects of similar or different form that are related to the same article (e.g., .s001 is a supplemental figure and .s002 is a character matrix). For this reason, these records are not appended to the date_full-concatenated-dataframe.csv file as they will require more steps to assess.
If you run the Figshare validator, it will output one file, date_figshare-discovery-all-metadata_combined, which merges the basic dataframe returned from DataCite for these deposits with the filtered dataframe returned from the Figshare API. This is not reconcatenated with the larger dataset of all affiliated datasets.
If you run the NCBI workflow, it will always output the XML download from NCBI (bioproject_results.xml) and is set to overwrite any previous version. The converted dataframe is edited to be as similar as possible to the main dataframe to ensure alignment (e.g., NCBI identifiers are listed in the DOI column). The filtered dataframe that has the same columns as the main dataframe is exported as date_NCBI-select-output-aligned.csv. If you run the second Figshare workflow or pull in a previously generated output from that used the second Figshare workflow, the NCBI dataframe will be concatenated and output a date_full-concatenated-dataframe-plus-Figshare-ncbi.csv file. If you don't run the second Figshare workflow, the NCBI dataframe will be concatenated with only the main output and return a date_full-concatenated-dataframe-plus-ncbi.csv file.
If you run the accessory-scripts/plos-osi-search.py script, it will return two files. The first is date_PLOS-articles-with-data-in-SI.csv and the second is date_PLOS-articles-with-data-in-NCBI.csv. Both represent the subset of the PLOS OSI dataset that could be linked to affiliated articles of the focal institution and for which data location was indicated to be either in SI or NCBI.
If you run the accessory-scripts/datacite-ror-query.py script, it will return four files. The first is date_datacite-ror-retrieval.csv, which returns all of the ROR-affiliated deposits in DataCite. The second is a filtered version of this file that handles de-duplication in the same fashion as the main workflow: date_datacite-ror-retrieval-filtered.csv. The third and fourth files are the same, but use a single-permutation-affiliation query parameter instead of the ROR identifier.
If you run the accessory-scripts/datacite-Figshare-partner-query.py script, it will return four files. The first is date_publisher_figshare-discovery-all.csv, which is the result of the initial DataCite query being cross-referenced with the OpenAlex output and thus returns a list of all Figshare deposits that can be linked to an affiliated university. This file includes all deposits when there are multiple files associated with one article that were each split into a separate deposit (e.g., Supplemental Table 1 gets a DOI, Supplemental Table 2 gets a different DOI) and is the one that is looped through the Figshare API. If you want a de-duplicated version that retains only one deposit per article (i.e. how many unique articles have mediated Figshare deposits), that is generated as date_publisher_figshare-discovery-deduplicated.csv. The third file is the first output from the Figshare API and contains a row for each file from the linked deposits with a few metadata fields: date_publisher_figshare-discovery-all_metadata.csv. The final file is date_publisher_figshare-discovery-all_metadata_combined.csv. This merges the original DataCite/OpenAlex dataframe with the extra metadata retrieved from the Figshare API, with several columns added to identify whether a deposit contains any software, only contains software, is labeled 'dataset' but contains only formats with a low probability of being data, and is not labeled 'dataset' but could be a dataset.
If you run the accessory-scripts/crossref-query.py script, it will return two files. date_crossref-institution-objects.csv contains all of the DOIs that are positively identified as being linked to the focal institution. date_crossref-institution-true-datasets.csv is a subset of that file and removes entries for any 'repository' that is not a data repository (e.g., Authorea Preprints).
The primary workflow and certain secondary workflows only collect items that are labeled as a 'Dataset' in the DataCite metadata schema (for some repositories, this is the only allowable object type), although many resource types can be queried at once. The same is true of the accessory Crossref script; Crossref explicitly recommends using 'dataset' for non-datasets if another label is not available or more appropriate. It is a given that most of these Crossref objects do not meet the criteria for 'data' proper, in part or in whole, and may include other materials like appendices or software; the present workflow does not attempt to make inferences on the precise nature of content (although this is planned). Conversely, some deposits that do constitute 'data' proper are labeled as another object type (e.g., 'Component,' 'Text'), and these are not presently detected. Retrieving objects through the DataCite API requires downstream processing, as some objects that labeled as 'datasets' are either individual files within a DOI-backed deposit (common to Dataverse installations) or are versions of the same deposit (Zenodo, which mints a parent DOI and then a separate DOI for each version). The primary script omits individual files that are part of a larger project and restricts the Zenodo deposits to a single record per 'lineage' of deposits.
There are additional considerations to keep in mind related to how research organize materials within a single project. In some instances, distinct deposits with separate DOIs may in fact be part of the same project (e.g., associated with a single manuscript), and some calculations might wish to further consolidate these to attempt to capture the number of 'unique projects' with at least one dataset. For example, Dataverse has the relatively unique 'dataverse' object, a non-DOI-backed structure in which other dataverses and DOI-backed datasets can be nested. For this reason, some researchers will separate the materials for a single manuscript along some logical delineation (e.g., by data format; data vs. software) into multiple DOI-backed deposits that are housed within a single dataverse (example in the Texas Data Repository), whereas if those materials had been deposited in a different repository without an equivalent higher-level structure, they might have been deposited together in one PID-backed deposit.
Consolidation along these lines can be done by deduplicating with a stricter combination of attributes on the assumption that related deposits likely have nearly identical metadata (e.g., publication date, author list); it may also be possible to use relations to other objects, if provided (this is more likely to be exclusively recorded in a repository-specific API). The theoretical concept of consolidation that is given above for Dataverse could be accomplished with the Dataverse API since the dataverse in which a dataset is housed can be retrieved, but this would not be possible through the DataCite API since information about dataverses does not cross-walk (likely because dataverses do not receive DOIs and an equivalent structure is otherwise rare in other repositories). The present version of the primary workflow attempst to consolidate Dataverse deposits using a combination of variables.
The Zenodo process of minting two DOIs for an initial release of a deposit was previously noted. Other repositories also do the same (e.g., Figshare, ICPSR, Mendeley Data) and need to be de-duplicated in the same fashion. Whether all versions of a single dataset should be counted as separate datasets may vary between institutions, but this workflow usually treats a 'lineage' of many versions as a single dataset. A final consideration with Figshare deposits (with or without affiliation metadata) is that there is variation in whether a journal-mediated process will create one deposit for all files associated with one manuscript or one deposit for each file associated with that manuscript. The latter is considered to be overly granular since those objects probably would not be deposited as separate deposits (e.g., two supplemental tables) in a human-controlled process. The workflow also accounts for this and uses the relatedIdentifier field to consolidate entries.
It goes without saying that how affiliation metadata are cross-walked and exposed impacts the scale of what can and cannot be retrieved with this workflow. Specifically for Dataverse, the Search API, which casts a wide net to retrieve many records (efficient, not impacted by rate limiting), does not return affiliation metadata for the 'creator' (author) field, only for the 'contact.' It is possible to get the 'creator' affiliation metadata through the Native API, but this requires passing a list of DOIs of interest to the Native API, which then makes a request for each DOI (less efficient, can be impacted by rate limiting). The Dataverse component of the cross-validation process thus returns the listed 'contact' affiliation; it is inferred that in the overwhelming majority of cases, the point of contact(s) is probably also listed as an author or, at the very least, is from the same institution (i.e. that 'contact' affiliation is a good proxy for 'creator' affiliation).
This script can be freely re-used, re-distributed, and modified in line with the associated MIT license. If a re-user is only seeking to replicate a UT-Austin-specific output or to retrieve an equivalent output for a different institution, the script will require very little modification - essentially only the defining of affiliation parameters will be necessary. For other Dataverse-based platforms that have significantly altered the metadata framework, it is possible that additional edits to the API call and subsetting of the response will be necessary if you run cross-validation. If additional fields or processing of the output are desired, the script will require more substantive modification and knowledge of the specific structure of a given API response.
This workflow is, and likely will always be, perpetually under development. Because of the marked heterogeneity in how datasets are shared (e.g., lack of persistent identifiers; use of identifiers other than DOIs; variation in affiliation metadata), it is practically assured that not all datasets will be captured by this workflow or any other and that substantial gaps may exist for certain platforms/avenues for data sharing. Reusers should be cognizant of these limitations in determining how data gained from this workflow may inform decision-making. The creator(s) and contributor(s) of this repository and any entities to which they are affiliated are not responsible for any decisions, policies, or other actions that are made on the basis of obtained data.
API keys and numerical API query parameters (e.g., records per page, page limit) are defined in a config.json file. The file included in this repository called config-template.json should be populated with API keys (see below) and renamed.
Users will need to create accounts for Dataverse and Zenodo in order to obtain personalized API keys, add those to the config-template.json file, and rename it as config.json. Note that if you wish to query multiple Dataverse installations (e.g., a non-Harvard institutional dataverse and Harvard Dataverse), you will need to create an account and get a separate API key for each installation. Crossref, DataCite, Dryad, Figshare, and OpenAlex do not require API keys for standard access.
If users need to modify the existing API queries, they should refer to the previously linked API documentation for specific APIs. For targeting a different institution (or set of institutions), users will need to identify a list of possible permutations of the institutional name; the use of of ROR identifiers in either the DataCite API or most repositories' specific API will fail to retrieve most related deposits because most repositories have not implemented ROR into their platforms given its relatively recent added support in the DataCite schema (Dryad is a notable example as an early adopter of ROR). It may also not be feasible for platforms to retroactively add ROR identifiers for all previously published deposits in an efficient programmatic fashion without potentially introducing errors. The optional cross-validation step can facilitate identification of some permutations if querying an API that does not require an exact string match for retrieval based on affiliation. Another approach is to compile a list of known affiliated deposits within and across different repositories and then to examine their metadata in the DataCite API; testing this on some of my own datasets led to the discovery of a lack of recording of affiliation in Figshare metadata, for example. A third approach would be to survey affiliated scholarly articles, books, and preprints (e.g., through the Crossref API, which does not require an exact string match for affiliation).
If the Dataverse cross-validation step is enabled for a different installation than the Texas Data Repository, the DOI prefix should be modified accordingly. The 'subtree' parameter can probably be removed as well since most Dataverse installations are not multi-institutional (each Texas-based partner has a separate subtree within TDR).
A Boolean variable called test, located in the config file, can be used to create a 'test environment.' If this setting is set to TRUE, the script is set to only retrieve 5 pages from the DataCite API (currently a full run searching for 'dataset' objects only requires more than 77 pages with page size of 1,000 records for UT Austin). Currently, the number of records retrieved from the three other APIs utilized here (Dataverse, Dryad, Zenodo) is significantly smaller, so different page limits for a test run are not defined for these (but could be added).
Similar to the test environment, a Boolean value called crossValidate (located in the config file) can be used to toggle the cross-validation component on and off (TRUE will retrieve records from other APIs and cross-validate against DataCite). Another variable in the config file called dataverse will enable or disable the cross-validation for a Dataverse-based repository specifically (would require an API key and is not relevant for all potential users).
- For UT Austin users: there should be no reason why you need to run the cross-validation process since I have used it to refine the workflow into the present state (e.g., to account for different permutations of the institutional name).
- For non-UT Austin users: if you are at another institution and want to adapt this workflow, running the cross-validation is recommended to identify alternative/unexpected permutations of the name, especially if you are at an institution that similarly is part of a broader system or that has its name listed in a wide variety of ways. However, you may be able to intuit these on your own or have the fortune of being at an institution with few options (e.g., Stanford University).
In the present configuration, any rate limiting is unlikely to affect the workflows or require modification because of how queries are not targeting specific DOIs (i.e. a few requests return many records). However, potential/planned expansion may necessitate the use of targeted single-object retrieval, and users should be aware that many public APIs impose some kind of rate limiting (e.g., Dryad; OpenAlex; Zenodo). Note that Zenodo also restricts the total number of records that can be retrieved with one query to 10,000. Dataverse installations may or may not have rate limits; for users attempting to retrieve data from the Texas Data Repository, there are currently no rate limits, although a to-be-determined limit is planned in the near future.
Exact runtime will vary on local internet speed and external server traffic. Typically, a run of the primary workflow without cross-validation (only retrieving from DataCite) or any of the secondary Figshare workflows should complete in under 20 minutes for UT Austin or an institution of predicted similar research output. If cross-validation is employed, a run should complete in under 25 minutes for UT Austin or an institution of predicted similar research output. The test environment without cross-validation should complete in about 1 minute; if cross-validation is employed, it should complete in about 8-9 minutes.
Adding one of the Figshare workflows can significantly increase the runtime, especially if many publishers are queried; significant variation between institutions in publishing volume with a certain publisher is expected. Typical runtime of the combined main workflow, first Figshare workflow (DataCite + OpenAlex), and NCBI step is around 40-55 minutes at present. The NCBI workflow on its own should complete in about 30-40 seconds.
For UT Austin, the full search process, with cross-validation, Figshare workflow #1, and the NCBI workflow takes about 40-45 minutes to complete at present. If a script finishes significantly faster than you expect or have experienced in previous runs, check the number of returned records; sometimes server instability or high traffic leads to incomplete retrieval.
This workflow is intended to be continually developed by members of the Research Data Services team at the UT Libraries in order to continually refine the process and expand the capture potential. Product development ideas/plans are listed as 'Issues'. The projected timelines are listed in a linked Project.
The use of OAI-PMH protocols and large "data dumps" (like the 200 GB Crossref public data file) are under consideration for future incorporation, but a secondary objective of this workflow is to employ code and data sources that are both accessible and computationally tractable for a wide range of potential users who may not have access to above-average storage or computing capacities or even much exposure to code.
One area intended for immediate exploration is expanded flexible search functionality in the DataCite API (see December 2025 release notes here).
If you have any questions or comments that don't warrant creating an issue, feel free to email me - I'm happy to help with any trouble-shooting for re-users. As a note, I plan on writing up at least a preprint describing the conceptual basis for the workflow and all of its nuances, so hopefully that will help to provide more robust documentation beyond this README and the in-script annotations.
A previous version added a CONTRIBUTING.md file that has been temporarily removed after internal discussion about university policies on distributing open source software. The gist is that there are still some policy details to be sorted out (not just for this project) with our technology transfer office, so we're unable to accept external pull requests at the moment (this does not affect others' ability to fork and modify the repository). We hope this will get clarity in the near future so that it can be opened up. In the interim, if you have ideas of how to improve something (excluding aesthetics), feel free to reach out by email or make an issue.
The current version scheme follows a MAJOR.MINOR.PATCH format, with a 'major' change involving added functionality or significant revisions to the workflow; a 'minor' change involving addition of accessory files or minor revisions to the workflow (e.g., refactoring); and a 'patch' is a bug fix.
Version 2.2.0 makes minor, mostly non-functional syntax changes for code standardization across nearly all scripts. This primarily includes more consistent casing of variables and the externalization of common functions into a utils.py file. Certain functions that are unique to a script are retained internally. All scripts have been tested with at least a 'test' run to ensure that the call to this utilities file is working correctly. Some additional cleaning steps specific to entries retrieved for UT Austin have been added to the primary script (standardizing repository listings). Minor edits to more functional code have been applied to handle minor changes to fields in the Figshare API and to the structure of the PLOS OSI dataset in the most recent version. The corresponding Jupyter notebook for the primary workflow has also been updated. The README now links to the published JeSLIB article.
Version 2.1.0 makes minor, mostly non-functional syntax changes for code standardization in the core script and the data visualization script. Some minor bugs (not all found in the previous version) were fixed. Some metadata fields were added or modified in the output files, and additional cleaning steps have been added for UT-specific records. Logging capacity has now been introduced.
Version 2.0.0 is identical to version 1.2.1 and is only labeled as such for alignment with Zenodo releases (i.e. it is considered a major change from the initial release that was ported to Zenodo).
Version 1.2.1 involves refactoring of some code and some new functionality. The dataset-records-retrieval.py has been modified to handle when the cross-validation process is enabled but the retrieval from one specific repository's API fails (this seems to happen not infrequently with Zenodo's) to allow the script to continue. A bug fix has also been applied related to querying of the OpenAlex API as part of the secondary Figshare workflow (failed retrieval of Taylor & Francis records, failure to retrieve publication year, de-duplication of initial records). New functionality has also been added here to generate a dataframe of all unique author names, with select columns combined in order to get, for example, a total count of datasets. Note that this relies on exact matching of author names, so common names are more likely to represent multiple individuals, and the same individual may be represented by multiple entries (e.g., Last, First format vs. First Last; inclusion of middle initial). Additional functionality has also been added to begin logging script runs in two forms, a text file unique to each run and a composite CSV file that appends all logs. Both record the same information (e.g., timestamp, runtime, certain parameters from the config file). Additional work on logging is planned.
Version 1.1.1 makes a few minor modifications to accessory scripts and some more sizable modifications to the primary dataset-records-retrieval.py script. The crossref-query.py script was updated to add additional fields for concatenation with the DataCite output (most are 'filler' uniform values as no equivalent field exists); the script was also modified to have a toggle for enabling or disabling a filter for a particular resource type (see issue #82). The datacite-ror-query.py script was updated to fix a bug in which some Figshare deposits were inadvertently being removed from the dataframe after consolidation. The dataset-records-retrieval-visualization.py script was updated in response to feedback from peer review of the manuscript; several plots have been slightly altered or rearranged. The config-template.json file has been updated to reflect the externalization of toggles for the primary dataset-records-retrieval.py script and the addition of a list of compressed file formats in order to identify datasets with such files (if file information is provided). The changes to the primary dataset-records-retrieval.py script include:
- externalizing toggles to the config.json file
- fixing a bug in which certain metadata assessments were contingent on running certain (optional) steps
- increasing robusticity to handle author information retrieved from the DataCite API
- setting all CSVs to be exported in UTF-8 format specifically
- increasing metadata subsetting from the API response and exporting in the most common dfs to improve efficiency of internal reporting for UT Austin (this will likely lead some files to be deprecated in the future)
- harmonzining metadata fields between different dfs for concatenation
Version 1.0.1 adds some minor functionality to more clearly indicate what resource type(s) were queried in a DataCite call to output filenames and adds a generalist/specialist categorization based on the repository a deposit is in. It also fixes some issues with toggles to enable/disable certain parts of the workflow.
Version 1.0.0 is the first release that is also synced to Zenodo. It makes a few minor adjustments for refactoring the primary workflow and some of the accessory workflows. The most substantive changes are made to the dataset-records-retrieval-visualization.py and the accessory-scripts/preprint-rads-dataset-reanalysis.py files in preparation for the preprint. The accessory-scripts/plos-webscraping-supp-info.py file is newly added; see file list for description. Some very minor edits are made to a few files: (1) the license is updated to indicate the copyright holder is the UT Board of Regents (https://www.utsystem.edu/board-of-regents/rules/90101-intellectual-property); (2) the accessory-scripts/crossref-query.py file is updated to account for changes to Authorea deposit metadata; (3) some annotations are updated in the accessory-scripts/preprint-dryad-date-comparison.py file; and (4) some additional file export lines are added to the dataset-records-retrieval.py (and Jupyter) files. The versioning is reset to synchronize with planned periodic release on Zenodo.
---previous version history notes are available in older README files with the older version system before it was reset---