22 fetch metadata from apis by ClaireHzl · Pull Request #24 · dataforgoodfr/14_EUFactForce

ClaireHzl · 2026-03-16T13:09:33Z

Script that retrieves metadata for a specific article using various APIs based on its DOI and downloads its PDF if it is open access.

…r/14_EUFactForce into 22-fetch-metadata-from-apis

cgoudet

Merci pour ce travail!

Il faudrait tout mettre en "prod" plutot qu'en exploration.

cgoudet · 2026-03-23T08:10:33Z

eu_fact_force/ingestion/data_collection/parsers/arxiv.py

Puisque l'on va les utiliser en prod, ces parsers doivent être dans la section prod et pas exploration du projet.

cgoudet · 2026-03-23T08:12:05Z

eu_fact_force/ingestion/data_collection/parsers/arxiv.py

+        return [article.pdf_url] if article else []
+
+
+if __name__ == "__main__":


A plutot mettre comme un test d'intégration mais mettre un skip pour qu'ils ne soit jamais lancé dans la CI.

cgoudet · 2026-03-23T08:13:17Z

eu_fact_force/ingestion/data_collection/parsers/arxiv.py

+ARXIV_DOI_PREFIX = "10.48550/arXiv."
+
+
+class ArxivMetadataParser(MetadataParser):


Je ne me souviens plus des discussions. PAs de téléchargement possible sur pdf sur arxiv?

Téléchargement possible sur Arxiv mais pas sur Pubmed à première vue

Du coup ici arxiv utilise le téléchargement par défaut du MetadataParser?

cgoudet · 2026-03-23T08:14:59Z

eu_fact_force/exploration/data_collection/parsers/base.py

+            return False
+        try:
+            for pdf_url in pdf_urls:
+                response = requests.get(pdf_url, timeout=30)


Pour un poil plus de clarté, peut être créer une fonction dédié pour télécharger 1 fichier.

Quand tu dis une fonction dédiée, tu parles d'une sous-fonction de cette fonction qui s'occupe uniquement du téléchargement en tant que tel (pour que la fonction soit moins complexe), ou de faire une fonction spécifique pour chaque classe fille ?

cgoudet · 2026-03-23T08:15:38Z

eu_fact_force/exploration/data_collection/parsers/base.py

+                if not response.content.startswith(b"%PDF"):
+                    print(f"Content at {pdf_url} is not a valid PDF (possibly a paywall page).")
+                    continue
+                with open(output_path, "wb") as f:


Si tu as plusieurs fichiers, ils vont tous s'écraser mutuellement et seul le dernier sera disponible.

Dans la fonction, on télécharge uniquement le pdf de la première url qui n'est pas une interface de paiement. Ca peut cependant s'écraser entre les différentes API, mais dans l'idée on ne veut qu'un seul pdf par DOI non ?

eu_fact_force/exploration/data_collection/parsers/base.py

cgoudet · 2026-03-23T08:17:28Z

eu_fact_force/ingestion/data_collection/parsers/crossref.py

+            return []
+
+
+if __name__ == "__main__":


Pareil, mettre ca dans un TU

eu_fact_force/ingestion/data_collection/main.py

cgoudet · 2026-03-23T08:21:12Z

eu_fact_force/ingestion/data_collection/README.md

+## Usage
+
+```bash
+python3 main.py --doi 10.1128/mbio.01735-25


C'est une méthode d'exploration pour l'exploration, pas pour la prod qui doit utiliser l'API

pyproject.toml

cgoudet

Les tests unitaires ne passent pas. Il faut corriger ca avant de pouvoir passer en prod.

eu_fact_force/ingestion/data_collection/parsers/arxiv.py

cgoudet · 2026-03-26T11:21:18Z

eu_fact_force/ingestion/data_collection/parsers/arxiv.py

+ARXIV_DOI_PREFIX = "10.48550/arXiv."
+
+
+class ArxivMetadataParser(MetadataParser):


Du coup ici arxiv utilise le téléchargement par défaut du MetadataParser?

cgoudet · 2026-03-26T11:22:19Z

eu_fact_force/ingestion/data_collection/parsers/arxiv.py

+
+from .base import MetadataParser
+
+ARXIV_DOI_PREFIX = "10.48550/arXiv."


Je ne suis pas sur de comprendre le role de ce prefix.

Oui on récupère l'url du pdf, et ensuite ça utilise le téléchargement par défaut de la classe mère.
Pour le prefix, c'est que l'API utilise l'ID interne d'Arxiv. Le DOI d'un article arxiv c'est le prefixe + l'ID. Donc si c'est sous cette forme là, on va utiliser directement l'ID. Mais parfois un journal publie via arXiv et le DOI ne respecte pas la convention, donc on va faire une recherche par champ DOI (mais ça marche moins bien)

ClaireHzl and others added 5 commits March 10, 2026 19:29

List the metadata for each API.

1307b04

Add first version of fetching from api to metadata dictionnary.

34777f8

Add of metadata.

995a30a

Fix new metadata error.

3eff7d3

Add the pdf downloading.

20e9f66

ClaireHzl linked an issue Mar 16, 2026 that may be closed by this pull request

Fetch metadata from apis. #22

Closed

ClaireHzl marked this pull request as draft March 16, 2026 13:10

ClaireHzl and others added 10 commits March 17, 2026 19:38

Add HAL class.

d674a33

Add pubmed api calls.

3af0303

Simplify parser.

0df0900

Merge branch '22-fetch-metadata-from-apis' of github.com:dataforgoodf…

f8bf7f4

…r/14_EUFactForce into 22-fetch-metadata-from-apis

Add pubmed and openalex parsers.

89d749c

Add document type and doi arg for pubmed parser.

2ac41f6

Add download of pdf.

6fd7fe6

Add main and group parsers.

8173e8c

Fix typo issues.

4545893

Merge branch 'main' into 22-fetch-metadata-from-apis

c17e38a

cgoudet requested changes Mar 23, 2026

View reviewed changes

ClaireHzl and others added 7 commits March 23, 2026 17:50

Move into prod.

cd2b0c4

Add api name attribute.

5f0a3c9

Add arxiv lib in dependencies.

7a00bae

Update testing doi for openalex, crossref and pubmed.

b94beed

Integration in services.

d30cb56

Update Readme.

e5ec68f

Merge branch 'main' into 22-fetch-metadata-from-apis

5c67634

cgoudet reviewed Mar 26, 2026

View reviewed changes

ClaireHzl and others added 3 commits March 26, 2026 14:24

Change for absolute path for import.

efbed4a

Delete api name in pdf.

308251e

Merge branch 'main' into 22-fetch-metadata-from-apis

54bdc99

cgoudet marked this pull request as ready for review March 29, 2026 09:31

cgoudet merged commit 5f496f3 into main Mar 29, 2026
1 check failed

cgoudet deleted the 22-fetch-metadata-from-apis branch March 29, 2026 09:51

		return [article.pdf_url] if article else []


		if __name__ == "__main__":

		ARXIV_DOI_PREFIX = "10.48550/arXiv."


		class ArxivMetadataParser(MetadataParser):


		from .base import MetadataParser

		ARXIV_DOI_PREFIX = "10.48550/arXiv."

Conversation

ClaireHzl commented Mar 16, 2026

Uh oh!

cgoudet left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ClaireHzl Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cgoudet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ClaireHzl Mar 23, 2026 •

edited

Loading