This code is currently under development. The goal is to bring together APIs and/or scrapers for working with publication websites (and related material).
When I have some functional examples of this in action I'll add them!
Roadmap:
- expose Pubmed (started)
- CrossRef
- ElSevier API
- other publishers?
- Mendeley Client
Currently, Pypub supports information retrieval from ScienceDirect, Springer, Wiley Online, and Nature Reviews Genetics. Taylor & Francis recently moved to a new article page format, so the corresponding file needs to be updated.
The easiest way to use this repo is with get_paper_info, a top-level function. It takes two optional keyword arguments, url, and doi. It can be called with paper_info = pypub.get_paper_info(doi='enter_doi_here', url='or_enter_url_here').
The get_paper_info method returns a PaperInfo object, the details for which can be found in paper_info.py. A PaperInfo object has three main attributes of interest: entry, references, and pdf_link. entry will contain all descriptive information about the paper, such as title, authors, journal, year, etc. references is a list of references, and pdf_link is a string, which gives the direct link to the paper PDF, if it could be retrieved.
Within the scrapers/base_objects.py file, there are several classes that each publisher inherits from to return information. The entry attribute will be a [Publisher]Entry class, which inherits from BaseEntry. Similarly, the references attribute is a list of [Publisher]Ref class instances that inherit from BaseRef.
Documentation Standards, I'm trying to follow this: https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt
Encapsulation! Encapsulation! Encapsulation! Ideally each module should have a well defined purpose that doesn't work with data that is not its own.
Within the tests folder, there are separate test modules for each of the scrapers. They are written for nosetests.