PyPI - paperscraper - Versions diffs - 0.2.16__tar.gz → 0.3.0__tar.gz - Mend

paperscraper 0.2.16tar.gz → 0.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (66) hide show

{paperscraper-0.2.16 → paperscraper-0.3.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
-Metadata-Version: 2.2
+Metadata-Version: 2.4
 Name: paperscraper
-Version: 0.2.16
+Version: 0.3.0
 Summary: paperscraper: Package to scrape papers.
 Home-page: https://github.com/jannisborn/paperscraper
 Author: Jannis Born, Matteo Manica
@@ -34,6 +34,7 @@ Requires-Dist: thefuzz
 Requires-Dist: pytest
 Requires-Dist: tldextract
 Requires-Dist: semanticscholar
+Requires-Dist: pydantic
 Dynamic: author
 Dynamic: author-email
 Dynamic: classifier
@@ -42,6 +43,7 @@ Dynamic: description-content-type
 Dynamic: home-page
 Dynamic: keywords
 Dynamic: license
+Dynamic: license-file
 Dynamic: requires-dist
 Dynamic: summary
@@ -56,12 +58,27 @@ MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.or
 [![codecov](https://codecov.io/github/jannisborn/paperscraper/branch/main/graph/badge.svg?token=Clwi0pu61a)](https://codecov.io/github/jannisborn/paperscraper)
 # paperscraper
-`paperscraper` is a `python` package for scraping publication metadata or full PDF files from
+`paperscraper` is a `python` package for scraping publication metadata or full text files (PDF or XML) from
 **PubMed** or preprint servers such as **arXiv**, **medRxiv**, **bioRxiv** and **chemRxiv**.
 It provides a streamlined interface to scrape metadata, allows to retrieve citation counts
 from Google Scholar, impact factors from journals and comes with simple postprocessing functions
 and plotting routines for meta-analysis.
+## Table of Contents
+1. [Getting Started](#getting-started)
+   - [Download X-rxiv Dumps](#download-x-rxiv-dumps)
+   - [Arxiv Local Dump](#arxiv-local-dump)
+2. [Examples](#examples)
+   - [Publication Keyword Search](#publication-keyword-search)
+   - [Full-Text Retrieval (PDFs & XMLs)](#full-text-retrieval-pdfs--xmls)
+   - [Citation Search](#citation-search)
+   - [Journal Impact Factor](#journal-impact-factor)
+3. [Plotting](#plotting)
+   - [Barplots](#barplots)
+   - [Venn Diagrams](#venn-diagrams)
+4. [Citation](#citation)
+5. [Contributions](#contributions)
 ## Getting started
@@ -90,6 +107,21 @@ medrxiv(start_date="2023-04-01", end_date="2023-04-08")
 ```
 But watch out. The resulting `.jsonl` file will be labelled according to the current date and all your subsequent searches will be based on this file **only**. If you use this option you might want to keep an eye on the source files (`paperscraper/server_dumps/*jsonl`) to ensure they contain the paper metadata for all papers you're interested in.
+#### Arxiv local dump
+If you prefer local search rather than using the arxiv API:
+```py
+from paperscraper.get_dumps import arxiv
+arxiv(start_date='2024-01-01', end_date=None) # scrapes all metadata from 2024 until today.
+```
+Afterwards you can search the local arxiv dump just like the other x-rxiv dumps.
+The direct endpoint is `paperscraper.arxiv.get_arxiv_papers_local`. You can also specify the
+backend directly in the `get_and_dump_arxiv_papers` function:
+```py
+from paperscraper.arxiv import get_and_dump_arxiv_papers
+get_and_dump_arxiv_papers(..., backend='local')
+```
 ## Examples
@@ -158,10 +190,15 @@ from paperscraper.scholar import get_and_dump_scholar_papers
 topic = 'Machine Learning'
 get_and_dump_scholar_papers(topic)
 ```
+*NOTE*: The scholar endpoint does not require authentication but since it regularly prompts with captchas, it's difficult to apply large scale.
+### Full-Text Retrieval (PDFs & XMLs)
-### Scrape PDFs
+`paperscraper` allows you to download full text of publications using DOIs. The basic functionality works reliably for preprint servers (arXiv, bioRxiv, medRxiv, chemRxiv), but retrieving papers from PubMed dumps is more challenging due to publisher restrictions and paywalls.
-`paperscraper` also allows you to download the PDF files.
+#### Standard Usage
+The main download functions work for all paper types with automatic fallbacks:
 ```py
 from paperscraper.pdf import save_pdf
@@ -169,31 +206,71 @@ paper_data = {'doi': "10.48550/arXiv.2207.03928"}
 save_pdf(paper_data, filepath='gt4sd_paper.pdf')
 ```
-If you want to batch download all PDFs for your previous metadata search, use the wrapper.
-Here we scrape the PDFs for the metadata obtained in the previous example.
+To batch download full texts from your metadata search results:
 ```py
 from paperscraper.pdf import save_pdf_from_dump
-# Save PDFs in current folder and name the files by their DOI
+# Save PDFs/XMLs in current folder and name the files by their DOI
 save_pdf_from_dump('medrxiv_covid_ai_imaging.jsonl', pdf_path='.', key_to_save='doi')
 ```
-*NOTE*: This works robustly for preprint servers, but if you use it on a PubMed dump, dont expect to obtain all PDFs.
-Many publishers detect and block scraping and many publications are simply behind paywalls.
+#### Automatic Fallback Mechanisms
+When the standard text retrieval fails, `paperscraper` automatically tries these fallbacks:
+- **BioC-PMC**: For biomedical papers in [PubMed Central](https://pmc.ncbi.nlm.nih.gov/) (open-access repository), it retrieves open-access full-text XML from the [BioC-PMC API](https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PMC/).
+- **eLife Papers**: For [eLife](https://elifesciences.org/) journal papers, it fetches XML files from eLife's open [GitHub repository](https://github.com/elifesciences/elife-article-xml).
+These fallbacks are tried automatically without requiring any additional configuration.
+#### Enhanced Retrieval with Publisher APIs
+For more comprehensive access to papers from major publishers, you can provide API keys for:
+- **Wiley TDM API**: Enables access to [Wiley](https://onlinelibrary.wiley.com/library-info/resources/text-and-datamining) publications (2,000+ journals).
+- **Elsevier TDM API**: Enables access to [Elsevier](https://www.elsevier.com/about/policies-and-standards/text-and-data-mining) publications (The Lancet, Cell, ...).
+To use publisher APIs:
+1. Create a file with your API keys:
+```
+WILEY_TDM_API_TOKEN=your_wiley_token_here
+ELSEVIER_TDM_API_KEY=your_elsevier_key_here
+```
+2. Pass the file path when calling retrieval functions:
+```py
+from paperscraper.pdf import save_pdf_from_dump
+save_pdf_from_dump(
+    'pubmed_query_results.jsonl',
+    pdf_path='./papers',
+    key_to_save='doi',
+    api_keys='path/to/your/api_keys.txt'
+)
+```
+For obtaining API keys:
+- Wiley TDM API: Visit [Wiley Text and Data Mining](https://onlinelibrary.wiley.com/library-info/resources/text-and-datamining) (free for academic users with institutional subscription)
+- Elsevier TDM API: Visit [Elsevier's Text and Data Mining](https://www.elsevier.com/about/policies-and-standards/text-and-data-mining) (free for academic users with institutional subscription)
+*NOTE*: While these fallback mechanisms improve retrieval success rates, they cannot guarantee access to all papers due to various access restrictions.
 ### Citation search
-A plus of the Scholar endpoint is that the number of citations of a paper can be fetched:
+You can fetch the number of citations of a paper from its title or DOI
 ```py
-from paperscraper.scholar import get_citations_from_title
+from paperscraper.citations import get_citations_from_title, get_citations_by_doi
 title = 'Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I.'
-get_citations_from_title(title)
-```
+print(get_citations_from_title(title))
-*NOTE*: The scholar endpoint does not require authentication but since it regularly
-prompts with captchas, it's difficult to apply large scale.
+doi = '10.1021/acs.jcim.3c00132'
+get_citations_by_doi(doi)
+```
 ### Journal impact factor
@@ -231,28 +308,13 @@ i.search("quantum information", threshold=90, return_all=True)
 # ]
 ```
-## Arxiv local dump
-If you prefer local search rather than using the arxiv API:
-```py
-from paperscraper.get_dumps import arxiv
-arxiv(start_date='2024-01-01', end_date=None) # scrapes all metadata from 2024 until today.
-```
-Afterwards you can search the local arxiv dump just like the other x-rxiv dumps.
-The direct endpoint is `paperscraper.arxiv.get_arxiv_papers_local`. You can also specify the
-backend directly in the `get_and_dump_arxiv_papers` function:
-```py
-from paperscraper.arxiv import get_and_dump_arxiv_papers
-get_and_dump_arxiv_papers(..., backend='local')
-```
-### Plotting
+## Plotting
 When multiple query searches are performed, two types of plots can be generated
 automatically: Venn diagrams and bar plots.
-#### Barplots
+### Barplots
 Compare the temporal evolution of different queries across different servers.
@@ -310,7 +372,7 @@ plot_comparison(
 ![molreps](https://github.com/jannisborn/paperscraper/blob/main/assets/molreps.png?raw=true "MolReps")
-#### Venn Diagrams
+### Venn Diagrams
 ```py
 from paperscraper.plotting import (
@@ -369,6 +431,7 @@ If you use `paperscraper`, please cite a paper that motivated our development of
 ## Contributions
 Thanks to the following contributors:
+- [@mathinic](https://github.com/mathinic): Since `v0.3.0` improved PubMed full text retrieval with additional fallback mechanisms (BioC-PMC, eLife and optional Wiley/Elsevier APIs).
 - [@memray](https://github.com/memray): Since `v0.2.12` there are automatic retries when downloading the {med/bio/chem}rxiv dumps.
 - [@achouhan93](https://github.com/achouhan93): Since `v0.2.5` {med/bio/chem}rxiv can be scraped for specific dates!
 - [@daenuprobst](https://github.com/daenuprobst): Since  `v0.2.4` PDF files can be scraped directly (`paperscraper.pdf.save_pdf`)

{paperscraper-0.2.16 → paperscraper-0.3.0}/README.md RENAMED Viewed

@@ -9,12 +9,27 @@ MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.or
 [![codecov](https://codecov.io/github/jannisborn/paperscraper/branch/main/graph/badge.svg?token=Clwi0pu61a)](https://codecov.io/github/jannisborn/paperscraper)
 # paperscraper
-`paperscraper` is a `python` package for scraping publication metadata or full PDF files from
+`paperscraper` is a `python` package for scraping publication metadata or full text files (PDF or XML) from
 **PubMed** or preprint servers such as **arXiv**, **medRxiv**, **bioRxiv** and **chemRxiv**.
 It provides a streamlined interface to scrape metadata, allows to retrieve citation counts
 from Google Scholar, impact factors from journals and comes with simple postprocessing functions
 and plotting routines for meta-analysis.
+## Table of Contents
+1. [Getting Started](#getting-started)
+   - [Download X-rxiv Dumps](#download-x-rxiv-dumps)
+   - [Arxiv Local Dump](#arxiv-local-dump)
+2. [Examples](#examples)
+   - [Publication Keyword Search](#publication-keyword-search)
+   - [Full-Text Retrieval (PDFs & XMLs)](#full-text-retrieval-pdfs--xmls)
+   - [Citation Search](#citation-search)
+   - [Journal Impact Factor](#journal-impact-factor)
+3. [Plotting](#plotting)
+   - [Barplots](#barplots)
+   - [Venn Diagrams](#venn-diagrams)
+4. [Citation](#citation)
+5. [Contributions](#contributions)
 ## Getting started
@@ -43,6 +58,21 @@ medrxiv(start_date="2023-04-01", end_date="2023-04-08")
 ```
 But watch out. The resulting `.jsonl` file will be labelled according to the current date and all your subsequent searches will be based on this file **only**. If you use this option you might want to keep an eye on the source files (`paperscraper/server_dumps/*jsonl`) to ensure they contain the paper metadata for all papers you're interested in.
+#### Arxiv local dump
+If you prefer local search rather than using the arxiv API:
+```py
+from paperscraper.get_dumps import arxiv
+arxiv(start_date='2024-01-01', end_date=None) # scrapes all metadata from 2024 until today.
+```
+Afterwards you can search the local arxiv dump just like the other x-rxiv dumps.
+The direct endpoint is `paperscraper.arxiv.get_arxiv_papers_local`. You can also specify the
+backend directly in the `get_and_dump_arxiv_papers` function:
+```py
+from paperscraper.arxiv import get_and_dump_arxiv_papers
+get_and_dump_arxiv_papers(..., backend='local')
+```
 ## Examples
@@ -111,10 +141,15 @@ from paperscraper.scholar import get_and_dump_scholar_papers
 topic = 'Machine Learning'
 get_and_dump_scholar_papers(topic)
 ```
+*NOTE*: The scholar endpoint does not require authentication but since it regularly prompts with captchas, it's difficult to apply large scale.
+### Full-Text Retrieval (PDFs & XMLs)
-### Scrape PDFs
+`paperscraper` allows you to download full text of publications using DOIs. The basic functionality works reliably for preprint servers (arXiv, bioRxiv, medRxiv, chemRxiv), but retrieving papers from PubMed dumps is more challenging due to publisher restrictions and paywalls.
-`paperscraper` also allows you to download the PDF files.
+#### Standard Usage
+The main download functions work for all paper types with automatic fallbacks:
 ```py
 from paperscraper.pdf import save_pdf
@@ -122,31 +157,71 @@ paper_data = {'doi': "10.48550/arXiv.2207.03928"}
 save_pdf(paper_data, filepath='gt4sd_paper.pdf')
 ```
-If you want to batch download all PDFs for your previous metadata search, use the wrapper.
-Here we scrape the PDFs for the metadata obtained in the previous example.
+To batch download full texts from your metadata search results:
 ```py
 from paperscraper.pdf import save_pdf_from_dump
-# Save PDFs in current folder and name the files by their DOI
+# Save PDFs/XMLs in current folder and name the files by their DOI
 save_pdf_from_dump('medrxiv_covid_ai_imaging.jsonl', pdf_path='.', key_to_save='doi')
 ```
-*NOTE*: This works robustly for preprint servers, but if you use it on a PubMed dump, dont expect to obtain all PDFs.
-Many publishers detect and block scraping and many publications are simply behind paywalls.
+#### Automatic Fallback Mechanisms
+When the standard text retrieval fails, `paperscraper` automatically tries these fallbacks:
+- **BioC-PMC**: For biomedical papers in [PubMed Central](https://pmc.ncbi.nlm.nih.gov/) (open-access repository), it retrieves open-access full-text XML from the [BioC-PMC API](https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PMC/).
+- **eLife Papers**: For [eLife](https://elifesciences.org/) journal papers, it fetches XML files from eLife's open [GitHub repository](https://github.com/elifesciences/elife-article-xml).
+These fallbacks are tried automatically without requiring any additional configuration.
+#### Enhanced Retrieval with Publisher APIs
+For more comprehensive access to papers from major publishers, you can provide API keys for:
+- **Wiley TDM API**: Enables access to [Wiley](https://onlinelibrary.wiley.com/library-info/resources/text-and-datamining) publications (2,000+ journals).
+- **Elsevier TDM API**: Enables access to [Elsevier](https://www.elsevier.com/about/policies-and-standards/text-and-data-mining) publications (The Lancet, Cell, ...).
+To use publisher APIs:
+1. Create a file with your API keys:
+```
+WILEY_TDM_API_TOKEN=your_wiley_token_here
+ELSEVIER_TDM_API_KEY=your_elsevier_key_here
+```
+2. Pass the file path when calling retrieval functions:
+```py
+from paperscraper.pdf import save_pdf_from_dump
+save_pdf_from_dump(
+    'pubmed_query_results.jsonl',
+    pdf_path='./papers',
+    key_to_save='doi',
+    api_keys='path/to/your/api_keys.txt'
+)
+```
+For obtaining API keys:
+- Wiley TDM API: Visit [Wiley Text and Data Mining](https://onlinelibrary.wiley.com/library-info/resources/text-and-datamining) (free for academic users with institutional subscription)
+- Elsevier TDM API: Visit [Elsevier's Text and Data Mining](https://www.elsevier.com/about/policies-and-standards/text-and-data-mining) (free for academic users with institutional subscription)
+*NOTE*: While these fallback mechanisms improve retrieval success rates, they cannot guarantee access to all papers due to various access restrictions.
 ### Citation search
-A plus of the Scholar endpoint is that the number of citations of a paper can be fetched:
+You can fetch the number of citations of a paper from its title or DOI
 ```py
-from paperscraper.scholar import get_citations_from_title
+from paperscraper.citations import get_citations_from_title, get_citations_by_doi
 title = 'Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I.'
-get_citations_from_title(title)
-```
+print(get_citations_from_title(title))
-*NOTE*: The scholar endpoint does not require authentication but since it regularly
-prompts with captchas, it's difficult to apply large scale.
+doi = '10.1021/acs.jcim.3c00132'
+get_citations_by_doi(doi)
+```
 ### Journal impact factor
@@ -184,28 +259,13 @@ i.search("quantum information", threshold=90, return_all=True)
 # ]
 ```
-## Arxiv local dump
-If you prefer local search rather than using the arxiv API:
-```py
-from paperscraper.get_dumps import arxiv
-arxiv(start_date='2024-01-01', end_date=None) # scrapes all metadata from 2024 until today.
-```
-Afterwards you can search the local arxiv dump just like the other x-rxiv dumps.
-The direct endpoint is `paperscraper.arxiv.get_arxiv_papers_local`. You can also specify the
-backend directly in the `get_and_dump_arxiv_papers` function:
-```py
-from paperscraper.arxiv import get_and_dump_arxiv_papers
-get_and_dump_arxiv_papers(..., backend='local')
-```
-### Plotting
+## Plotting
 When multiple query searches are performed, two types of plots can be generated
 automatically: Venn diagrams and bar plots.
-#### Barplots
+### Barplots
 Compare the temporal evolution of different queries across different servers.
@@ -263,7 +323,7 @@ plot_comparison(
 ![molreps](https://github.com/jannisborn/paperscraper/blob/main/assets/molreps.png?raw=true "MolReps")
-#### Venn Diagrams
+### Venn Diagrams
 ```py
 from paperscraper.plotting import (
@@ -322,6 +382,7 @@ If you use `paperscraper`, please cite a paper that motivated our development of
 ## Contributions
 Thanks to the following contributors:
+- [@mathinic](https://github.com/mathinic): Since `v0.3.0` improved PubMed full text retrieval with additional fallback mechanisms (BioC-PMC, eLife and optional Wiley/Elsevier APIs).
 - [@memray](https://github.com/memray): Since `v0.2.12` there are automatic retries when downloading the {med/bio/chem}rxiv dumps.
 - [@achouhan93](https://github.com/achouhan93): Since `v0.2.5` {med/bio/chem}rxiv can be scraped for specific dates!
 - [@daenuprobst](https://github.com/daenuprobst): Since  `v0.2.4` PDF files can be scraped directly (`paperscraper.pdf.save_pdf`)

{paperscraper-0.2.16 → paperscraper-0.3.0}/paperscraper/__init__.py RENAMED Viewed

@@ -1,7 +1,7 @@
 """Initialize the module."""
 __name__ = "paperscraper"
-__version__ = "0.2.16"
+__version__ = "0.3.0"
 import logging
 import os

paperscraper-0.3.0/paperscraper/citations/__init__.py ADDED Viewed

@@ -0,0 +1,3 @@
+from .citations import get_citations_by_doi, get_citations_from_title
+from .core import SelfLinkClient
+from .self_references import self_references, self_references_paper

paperscraper-0.3.0/paperscraper/citations/citations.py ADDED Viewed

@@ -0,0 +1,63 @@
+import logging
+import sys
+from time import sleep
+from scholarly import scholarly
+from semanticscholar import SemanticScholar, SemanticScholarException
+logging.basicConfig(stream=sys.stdout, level=logging.INFO)
+logger = logging.getLogger(__name__)
+sch = SemanticScholar()
+def get_citations_by_doi(doi: str) -> int:
+    """
+    Get the number of citations of a paper according to semantic scholar.
+    Args:
+        doi: the DOI of the paper.
+    Returns:
+        The number of citations
+    """
+    try:
+        paper = sch.get_paper(doi)
+        citations = len(paper["citations"])
+    except SemanticScholarException.ObjectNotFoundException:
+        logger.warning(f"Could not find paper {doi}, assuming 0 citation.")
+        citations = 0
+    except ConnectionRefusedError as e:
+        logger.warning(f"Waiting for 10 sec since {doi} gave: {e}")
+        sleep(10)
+        citations = len(sch.get_paper(doi)["citations"])
+    finally:
+        return citations
+def get_citations_from_title(title: str) -> int:
+    """
+    Args:
+        title (str): Title of paper to be searched on Scholar.
+    Raises:
+        TypeError: If sth else than str is passed.
+    Returns:
+        int: Number of citations of paper.
+    """
+    if not isinstance(title, str):
+        raise TypeError(f"Pass str not {type(title)}")
+    # Search for exact match
+    title = '"' + title.strip() + '"'
+    matches = scholarly.search_pubs(title)
+    counts = list(map(lambda p: int(p["num_citations"]), matches))
+    if len(counts) == 0:
+        logger.warning(f"Found no match for {title}.")
+        return 0
+    if len(counts) > 1:
+        logger.warning(f"Found {len(counts)} matches for {title}, returning first one.")
+    return counts[0]

paperscraper-0.3.0/paperscraper/citations/tests/test_citations.py ADDED Viewed

@@ -0,0 +1,19 @@
+import logging
+from paperscraper.citations import get_citations_by_doi
+logging.disable(logging.INFO)
+class TestCitations:
+    def test_citations(self):
+        num = get_citations_by_doi("10.1038/s42256-023-00639-z")
+        assert isinstance(num, int) and num > 50
+        # Try invalid DOI
+        num = get_citations_by_doi("10.1035348/s42256-023-00639-z")
+        assert isinstance(num, int) and num == 0
+num = get_citations_by_doi("10.1035348/s42256-023-00639-z")
+assert isinstance(num, int) and num == 0

paperscraper 0.2.16__tar.gz → 0.3.0__tar.gz

paperscraper 0.2.16tar.gz → 0.3.0tar.gz