This README provides an overview of the formulation and calculation of our S-index
Despite growing adoption of data sharing, there is still no standardized, transparent, and equitable way to measure and incentivize it. Our team is one of the finalists of the NIH S-index Challenge established to address this problem. As part of the Challenge, we propose the S-index, a metric that evaluates the data sharing impact of a researcher based on dataset-level signals of FAIRness, citations, and alternative mentions. Addressing limitations of existing metrics, our S-index is simple to interpret, incentivizing data sharing, field-sensitive, responsive to evolving practices, and feasible to implement sustainably using existing infrastructure.
We provide below an overview of the formulation and calculation of our S-index. We refer to the S-index Hub for more information about our S-index and related resources. Details about the NIH S-index Challenge are available here and here.
We introduce a framework in which a dataset
Here,
Here,
An example of S-index calculation is provided in Fig. 1. Calculation only requires dataset metadata from existing infrastructure (Table 1), enabling large-scale calculation regardless of dataset size, format, reuse license, and access conditions. Because no existing infrastructure aggregates these signals, we are building automated pipelines that find datasets, compute FAIR scores, identify citations and alternative mentions, and periodically update Dataset and S-indices (Fig. 2). As part of our testing and validation for Phase 2 of the NIH S-index Challenge, we calculated the Dataset-Index of 49M+ datasets and 1M+ S-index. They are accessible at beta.scholardata.io, the Google-Scholar like platform we are building for researchers to find datasets and track their S-index the same way they track their manuscripts and h-index.
Fig. 1. Example of the calculation of the Dataset Index of a dataset and the S-index of its author. Steps 1-5 need to be repeated for each dataset of the researcher to calculate their S-index. Steps 3B to 6 need to be repeated periodically (e.g., monthly) to account for new citations and mentions (the FAIR score will change if there are major updates to the data repository’s metadata practices, and normalization factors will change if we identify citations or mentions we may have missed in prior years).
Fig. 2. Overview of the pipelines for collecting dataset-level data to compute the S-index of researchers
Table 1. Major resources needed to calculate the proposed Dataset Index and S-index. All are free, and the majority are open source. All are existing infrastructure, except for the custom domain classification model we have fine-tuned to fill a gap for datasets without a domain assigned in OpenAlex. We have already used most of them in our large-scale testing and validation as part of our submission for Phase 2 of the NIH S-index Challenge.
| Purpose | Resources | Free? | Used in testing? |
|---|---|---|---|
| Discover new datasets with/without DOIs and register their metadata in the Dataset Registry | DataCite API (datasets with DOIs) | ✅ | ✅ |
| re3data API (finding non-DOI repositories) | ✅ | ❌ | |
| Repository-specific APIs | Likely | ❌ | |
| Web scraping tool like Scrapy (when allowed) | ✅ | ❌ | |
| Compute FAIR scores | F-UJI | ✅ | ✅ |
| Find citations | MDC Data Citation Corpus | ✅ | ✅ |
| OpenAlex snapshot | ✅ | ✅ | |
| DataCite API | ✅ | ✅ | |
| ScholeXplorer | ✅ | ❌ | |
| Find mentions in code (reuse in computational models, AI/ML, teaching course) | Software Heritage Graph Dataset | ✅ | ✅ |
| Hugging Face API (rate-limit) | ✅ | ✅ | |
| Kaggle API (rate-limit) | ✅ | ❌ | |
| Find mentions in patents (commercial reuse) | United States Patent and Trademark Office bulk downloads datasets | ✅ | ✅ |
| Lens.org for international patents (has a free tier) | ✅ | ❌ | |
| Find mentions in policies (websites like WHO, UN, NIH, CDC) | Web scraping tool like Scrapy (when allowed) | ✅ | ❌ |
| PDF extractors like pdfminer.six | ✅ | ❌ | |
| Assign research domain | OpenAlex taxonomy of research domains | ✅ | ✅ |
| OpenAlex snapshot | ✅ | ✅ | |
| Custom domain classification model (self-hosted, for datasets not in OpenAlex) | ✅ | ✅ |