Small program to scrape model repository data from the Hugging Face Hub, including the full history of README.md files.
Kudos to @Fresh-P who is responsible for a bulk of the inital implementation.
For an exploratory analysis of the data see my website.
To run the program consider the following information:
- Download statistics refer to downloads over the past 30 days.
- The program leverages the
requestspackage to obtain the HTML pages. Consider using cookies to ensure that the request considers your login information. That way, it will be possible to scrape repositories that require access permission which can be requested beforehand. The cookies ensure that you identify as a user with granted permission rights. The cookies file should be stored in the main folder and namedcookies. - If the field
commit_historyin the meta-file is empty, the repository likely requires permission rights - If the field
commit_historyin the meta-file contains a4xxstatus code, it is likely the result of arequestserror. - The first time you run
main.pyit collects a list of all available model repositories. It will keep that exact same list unless you delete thelinks.txtfile. - Every time you run
main.py, it checks which repositories from thelinks.txthave already been scraped (by cross-checking with the meta-file(s)). It only retains the repository links which have not yet been scraped. In addition, it retries scraping all links where thecommit_historyfield in the meta-file contains an error code or is empty (that way, you may request permission to access certain repositories and retry scraping that repository).