PyPI - paperscraper - Versions diffs - 0.3.1__tar.gz → 0.3.3__tar.gz - Mend

paperscraper 0.3.1tar.gz → 0.3.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (72) hide show

{paperscraper-0.3.1 → paperscraper-0.3.3}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: paperscraper
-Version: 0.3.1
+Version: 0.3.3
 Summary: paperscraper: Package to scrape papers.
 Home-page: https://github.com/jannisborn/paperscraper
 Author: Jannis Born, Matteo Manica
@@ -52,12 +52,11 @@ Dynamic: summary
 [![build](https://github.com/jannisborn/paperscraper/actions/workflows/test_tip.yml/badge.svg?branch=main)](https://github.com/jannisborn/paperscraper/actions/workflows/test_tip.yml?query=branch%3Amain)
 [![build](https://github.com/jannisborn/paperscraper/actions/workflows/test_pypi.yml/badge.svg?branch=main)](https://github.com/jannisborn/paperscraper/actions/workflows/test_pypi.yml?query=branch%3Amain)
+[![build](https://github.com/jannisborn/paperscraper/actions/workflows/docs.yml/badge.svg?branch=main)](https://jannisborn.github.io/paperscraper/)
 [![License:
 MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 [![PyPI version](https://badge.fury.io/py/paperscraper.svg)](https://badge.fury.io/py/paperscraper)
 [![Downloads](https://static.pepy.tech/badge/paperscraper)](https://pepy.tech/project/paperscraper)
-[![Downloads](https://static.pepy.tech/badge/paperscraper/month)](https://pepy.tech/project/paperscraper)
-[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
 [![codecov](https://codecov.io/github/jannisborn/paperscraper/branch/main/graph/badge.svg?token=Clwi0pu61a)](https://codecov.io/github/jannisborn/paperscraper)
 # paperscraper
@@ -67,6 +66,7 @@ It provides a streamlined interface to scrape metadata, allows to retrieve citat
 from Google Scholar, impact factors from journals and comes with simple postprocessing functions
 and plotting routines for meta-analysis.
 ## Table of Contents
 1. [Getting Started](#getting-started)
@@ -93,16 +93,16 @@ This is enough to query PubMed, arXiv or Google Scholar.
 #### Download X-rxiv Dumps
-However, to scrape publication data from the preprint servers [biorxiv](https://www.biorxiv.org), [medrxiv](https://www.medrxiv.org) and [chemrxiv](https://www.chemrxiv.org), the setup is different. The entire dump is downloaded and stored in the `server_dumps` folder in a `.jsonl` format (one paper per line).
+However, to scrape publication data from the preprint servers [biorxiv](https://www.biorxiv.org), [medrxiv](https://www.medrxiv.org) and [chemrxiv](https://www.chemrxiv.org), the setup is different. The entire history of papers is downloaded and stored in the `server_dumps` folder in a `.jsonl` format (one paper per line). This takes a while, as of November 2025:
 ```py
 from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
-medrxiv()  #  Takes ~30min and should result in ~35 MB file
-biorxiv()  # Takes ~1h and should result in ~350 MB file
-chemrxiv()  #  Takes ~45min and should result in ~20 MB file
+chemrxiv()  #  Takes 30min -> +30K papers (~50 MB file)
+medrxiv()  #  Takes <1h -> +90K papers (~200 MB file)
+biorxiv()  # Up to 6h -> +400K papers (~800 MB file)
 ```
 *NOTE*: Once the dumps are stored, please make sure to restart the python interpreter so that the changes take effect.
-*NOTE*: If you experience API connection issues (`ConnectionError`), since v0.2.12 there are automatic retries which you can even control and raise from the default of 10, as in `biorxiv(max_retries=20)`.
+*NOTE*: If you experience API connection issues, since v0.2.12 there are automatic retries which you can even control and raise from the default of 10, as in `biorxiv(max_retries=20)`.
 Since v0.2.5 `paperscraper` also allows to scrape {med/bio/chem}rxiv for specific dates.
 ```py
@@ -424,7 +424,7 @@ plot_multiple_venn(
 ## Citation
 If you use `paperscraper`, please cite a paper that motivated our development of this tool.
-```bib
+```bibtex
 @article{born2021trends,
   title={Trends in Deep Learning for Property-driven Drug Design},
   author={Born, Jannis and Manica, Matteo},
@@ -440,9 +440,15 @@ If you use `paperscraper`, please cite a paper that motivated our development of
 ## Contributions
 Thanks to the following contributors:
 - [@mathinic](https://github.com/mathinic): Since `v0.3.0` improved PubMed full text retrieval with additional fallback mechanisms (BioC-PMC, eLife and optional Wiley/Elsevier APIs).
 - [@memray](https://github.com/memray): Since `v0.2.12` there are automatic retries when downloading the {med/bio/chem}rxiv dumps.
 - [@achouhan93](https://github.com/achouhan93): Since `v0.2.5` {med/bio/chem}rxiv can be scraped for specific dates!
 - [@daenuprobst](https://github.com/daenuprobst): Since  `v0.2.4` PDF files can be scraped directly (`paperscraper.pdf.save_pdf`)
 - [@oppih](https://github.com/oppih): Since `v0.2.3` chemRxiv API also provides DOI and URL if available
-- [@lukasschwab](https://github.com/lukasschwab): Bumped `arxiv` dependency to >`1.4.2` in paperscraper `v0.1.0`.
+- [@lukasschwab](https://github.com/lukasschwab): Enabled support for `arxiv` >`1.4.2` in paperscraper `v0.1.0`.
 - [@juliusbierk](https://github.com/juliusbierk): Bugfixes

{paperscraper-0.3.1 → paperscraper-0.3.3}/README.md RENAMED Viewed

@@ -1,11 +1,10 @@
 [![build](https://github.com/jannisborn/paperscraper/actions/workflows/test_tip.yml/badge.svg?branch=main)](https://github.com/jannisborn/paperscraper/actions/workflows/test_tip.yml?query=branch%3Amain)
 [![build](https://github.com/jannisborn/paperscraper/actions/workflows/test_pypi.yml/badge.svg?branch=main)](https://github.com/jannisborn/paperscraper/actions/workflows/test_pypi.yml?query=branch%3Amain)
+[![build](https://github.com/jannisborn/paperscraper/actions/workflows/docs.yml/badge.svg?branch=main)](https://jannisborn.github.io/paperscraper/)
 [![License:
 MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 [![PyPI version](https://badge.fury.io/py/paperscraper.svg)](https://badge.fury.io/py/paperscraper)
 [![Downloads](https://static.pepy.tech/badge/paperscraper)](https://pepy.tech/project/paperscraper)
-[![Downloads](https://static.pepy.tech/badge/paperscraper/month)](https://pepy.tech/project/paperscraper)
-[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
 [![codecov](https://codecov.io/github/jannisborn/paperscraper/branch/main/graph/badge.svg?token=Clwi0pu61a)](https://codecov.io/github/jannisborn/paperscraper)
 # paperscraper
@@ -15,6 +14,7 @@ It provides a streamlined interface to scrape metadata, allows to retrieve citat
 from Google Scholar, impact factors from journals and comes with simple postprocessing functions
 and plotting routines for meta-analysis.
 ## Table of Contents
 1. [Getting Started](#getting-started)
@@ -41,16 +41,16 @@ This is enough to query PubMed, arXiv or Google Scholar.
 #### Download X-rxiv Dumps
-However, to scrape publication data from the preprint servers [biorxiv](https://www.biorxiv.org), [medrxiv](https://www.medrxiv.org) and [chemrxiv](https://www.chemrxiv.org), the setup is different. The entire dump is downloaded and stored in the `server_dumps` folder in a `.jsonl` format (one paper per line).
+However, to scrape publication data from the preprint servers [biorxiv](https://www.biorxiv.org), [medrxiv](https://www.medrxiv.org) and [chemrxiv](https://www.chemrxiv.org), the setup is different. The entire history of papers is downloaded and stored in the `server_dumps` folder in a `.jsonl` format (one paper per line). This takes a while, as of November 2025:
 ```py
 from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
-medrxiv()  #  Takes ~30min and should result in ~35 MB file
-biorxiv()  # Takes ~1h and should result in ~350 MB file
-chemrxiv()  #  Takes ~45min and should result in ~20 MB file
+chemrxiv()  #  Takes 30min -> +30K papers (~50 MB file)
+medrxiv()  #  Takes <1h -> +90K papers (~200 MB file)
+biorxiv()  # Up to 6h -> +400K papers (~800 MB file)
 ```
 *NOTE*: Once the dumps are stored, please make sure to restart the python interpreter so that the changes take effect.
-*NOTE*: If you experience API connection issues (`ConnectionError`), since v0.2.12 there are automatic retries which you can even control and raise from the default of 10, as in `biorxiv(max_retries=20)`.
+*NOTE*: If you experience API connection issues, since v0.2.12 there are automatic retries which you can even control and raise from the default of 10, as in `biorxiv(max_retries=20)`.
 Since v0.2.5 `paperscraper` also allows to scrape {med/bio/chem}rxiv for specific dates.
 ```py
@@ -372,7 +372,7 @@ plot_multiple_venn(
 ## Citation
 If you use `paperscraper`, please cite a paper that motivated our development of this tool.
-```bib
+```bibtex
 @article{born2021trends,
   title={Trends in Deep Learning for Property-driven Drug Design},
   author={Born, Jannis and Manica, Matteo},
@@ -388,9 +388,15 @@ If you use `paperscraper`, please cite a paper that motivated our development of
 ## Contributions
 Thanks to the following contributors:
 - [@mathinic](https://github.com/mathinic): Since `v0.3.0` improved PubMed full text retrieval with additional fallback mechanisms (BioC-PMC, eLife and optional Wiley/Elsevier APIs).
 - [@memray](https://github.com/memray): Since `v0.2.12` there are automatic retries when downloading the {med/bio/chem}rxiv dumps.
 - [@achouhan93](https://github.com/achouhan93): Since `v0.2.5` {med/bio/chem}rxiv can be scraped for specific dates!
 - [@daenuprobst](https://github.com/daenuprobst): Since  `v0.2.4` PDF files can be scraped directly (`paperscraper.pdf.save_pdf`)
 - [@oppih](https://github.com/oppih): Since `v0.2.3` chemRxiv API also provides DOI and URL if available
-- [@lukasschwab](https://github.com/lukasschwab): Bumped `arxiv` dependency to >`1.4.2` in paperscraper `v0.1.0`.
+- [@lukasschwab](https://github.com/lukasschwab): Enabled support for `arxiv` >`1.4.2` in paperscraper `v0.1.0`.
 - [@juliusbierk](https://github.com/juliusbierk): Bugfixes

{paperscraper-0.3.1 → paperscraper-0.3.3}/paperscraper/__init__.py RENAMED Viewed

@@ -1,7 +1,7 @@
 """Initialize the module."""
 __name__ = "paperscraper"
-__version__ = "0.3.1"
+__version__ = "0.3.3"
 import logging
 import os

{paperscraper-0.3.1 → paperscraper-0.3.3}/paperscraper/arxiv/arxiv.py RENAMED Viewed

@@ -94,7 +94,7 @@ def get_arxiv_papers_api(
     fields as desired.
     Args:
-        query Query to arxiv API. Needs to match the arxiv API notation.
+        query: Query to arxiv API. Needs to match the arxiv API notation.
         fields: List of strings with fields to keep in output.
         max_results: Maximal number of results, defaults to 99999.
         client_options: Optional arguments for `arxiv.Client`. E.g.:
@@ -144,7 +144,7 @@ def get_and_dump_arxiv_papers(
         keywords: List of keywords for arxiv search.
             The outer list level will be considered as AND separated keys, the
             inner level as OR separated.
-        filepath: Path where the dump will be saved.
+        output_filepath: Path where the dump will be saved.
         fields: List of strings with fields to keep in output.
             Defaults to ['title', 'authors', 'date', 'abstract',
             'journal', 'doi'].

{paperscraper-0.3.1 → paperscraper-0.3.3}/paperscraper/citations/tests/test_self_citations.py RENAMED Viewed

@@ -62,8 +62,7 @@ class TestSelfCitations:
             f"Synchronous execution time (independent calls): {sync_duration:.2f} seconds"
         )
-        # Assert that async execution (batch) is faster or at least not slower
-        assert 0.9 * async_duration <= sync_duration, (
+        assert 0.1 * async_duration <= sync_duration, (
             f"Async execution ({async_duration:.2f}s) is slower than sync execution "
             f"({sync_duration:.2f}s)"
         )

{paperscraper-0.3.1 → paperscraper-0.3.3}/paperscraper/citations/tests/test_self_references.py RENAMED Viewed

@@ -14,7 +14,6 @@ class TestSelfReferences:
     @pytest.fixture
     def dois(self):
         return [
-            "10.1038/s43586-024-00334-2",
             "10.1038/s41586-023-06600-9",
             "10.1016/j.neunet.2014.09.003",
         ]

paperscraper-0.3.3/paperscraper/get_dumps/utils/chemrxiv/chemrxiv_api.py ADDED Viewed

@@ -0,0 +1,216 @@
+import logging
+import os
+import sys
+from datetime import datetime
+from time import sleep
+from typing import Dict, Optional
+from urllib.parse import urljoin
+import requests
+from requests.exceptions import (
+    ChunkedEncodingError,
+    ConnectionError,
+    ContentDecodingError,
+    JSONDecodeError,
+    ReadTimeout,
+)
+from urllib3.exceptions import DecodeError
+logging.basicConfig(stream=sys.stdout, level=logging.INFO)
+logger = logging.getLogger(__name__)
+now_datetime = datetime.now()
+launch_dates = {"chemrxiv": "2017-01-01"}
+class ChemrxivAPI:
+    """Handle OpenEngage API requests, using access.
+    Adapted from https://github.com/fxcoudert/tools/blob/master/chemRxiv/chemRxiv.py.
+    """
+    base = "https://chemrxiv.org/engage/chemrxiv/public-api/v1/"
+    def __init__(
+        self,
+        start_date: Optional[str] = None,
+        end_date: Optional[str] = None,
+        page_size: Optional[int] = None,
+        max_retries: int = 10,
+    ):
+        """
+        Initialize API class.
+        Args:
+            start_date (Optional[str], optional): begin date expressed as YYYY-MM-DD.
+                Defaults to None.
+            end_date (Optional[str], optional): end date expressed as YYYY-MM-DD.
+                Defaults to None.
+            page_size (int, optional): The batch size used to fetch the records from chemrxiv.
+            max_retries (int): Number of retries in case of error
+        """
+        self.page_size = page_size or 50
+        self.max_retries = max_retries
+        # Begin Date and End Date of the search
+        launch_date = launch_dates["chemrxiv"]
+        launch_datetime = datetime.fromisoformat(launch_date)
+        if start_date:
+            start_datetime = datetime.fromisoformat(start_date)
+            if start_datetime < launch_datetime:
+                self.start_date = launch_date
+                logger.warning(
+                    f"Begin date {start_date} is before chemrxiv launch date. Will use {launch_date} instead."
+                )
+            else:
+                self.start_date = start_date
+        else:
+            self.start_date = launch_date
+        if end_date:
+            end_datetime = datetime.fromisoformat(end_date)
+            if end_datetime > now_datetime:
+                logger.warning(
+                    f"End date {end_date} is in the future. Will use {now_datetime} instead."
+                )
+                self.end_date = now_datetime.strftime("%Y-%m-%d")
+            else:
+                self.end_date = end_date
+        else:
+            self.end_date = now_datetime.strftime("%Y-%m-%d")
+    def request(self, url, method, params=None, parse_json: bool = False):
+        """Send an API request to open Engage."""
+        headers = {"Accept-Encoding": "identity", "Accept": "application/json"}
+        retryable = (
+            ChunkedEncodingError,
+            ContentDecodingError,
+            DecodeError,
+            ReadTimeout,
+            ConnectionError,
+        )
+        transient_status = {429, 500, 502, 503, 504}
+        backoff = 0.1
+        for attempt in range(self.max_retries):
+            try:
+                if method.casefold() == "get":
+                    response = requests.get(
+                        url, params=params, headers=headers, timeout=(5, 30)
+                    )
+                elif method.casefold() == "post":
+                    response = requests.post(
+                        url, json=params, headers=headers, timeout=(5, 30)
+                    )
+                else:
+                    raise ConnectionError(f"Unknown method for query: {method}")
+                if response.status_code in transient_status:
+                    logger.warning(
+                        f"{response.status_code} for {url} (attempt {attempt + 1}/{self.max_retries}); retrying in {backoff:.1f}s"
+                    )
+                    if attempt + 1 == self.max_retries:
+                        response.raise_for_status()
+                    sleep(backoff)
+                    backoff = min(60.0, backoff * 2)
+                    continue
+                elif 400 <= response.status_code < 500:
+                    response.raise_for_status()
+                if not parse_json:
+                    return response
+                try:
+                    return response.json()
+                except JSONDecodeError:
+                    logger.warning(
+                        f"JSONDecodeError for {response.url} "
+                        f"(attempt {attempt + 1}/{self.max_retries}); retrying in {backoff:.1f}s"
+                    )
+                    if attempt + 1 == self.max_retries:
+                        raise
+                    sleep(backoff)
+                    backoff = min(60.0, backoff * 2)
+                    continue
+            except retryable as e:
+                logger.warning(
+                    f"{e.__class__.__name__} for {url} (attempt {attempt + 1}/{self.max_retries}); "
+                    f"retrying in {backoff:.1f}s"
+                )
+                if attempt + 1 == self.max_retries:
+                    raise
+                sleep(backoff)
+                backoff = min(60.0, backoff * 2)
+    def query(self, query, method="get", params=None):
+        """Perform a direct query."""
+        return self.request(
+            urljoin(self.base, query), method, params=params, parse_json=True
+        )
+    def query_generator(
+        self, query, method: str = "get", params: Optional[Dict] = None
+    ):
+        """Query for a list of items, with paging. Returns a generator."""
+        start_datetime = datetime.fromisoformat(self.start_date)
+        end_datetime = datetime.fromisoformat(self.end_date)
+        def year_windows():
+            year = start_datetime.year
+            while year <= end_datetime.year:
+                year_start = datetime(year, 1, 1)
+                year_end = datetime(year, 12, 31)
+                win_start = max(start_datetime, year_start)
+                win_end = min(end_datetime, year_end)
+                yield win_start.strftime("%Y-%m-%d"), win_end.strftime("%Y-%m-%d")
+                year += 1
+        params = (params or {}).copy()
+        for year_from, year_to in year_windows():
+            logger.info(f"Starting to scrape data from {year_from} to {year_to}")
+            page = 0
+            while True:
+                params.update(
+                    {
+                        "limit": self.page_size,
+                        "skip": page * self.page_size,
+                        "searchDateFrom": year_from,
+                        "searchDateTo": year_to,
+                    }
+                )
+                try:
+                    data = self.request(
+                        urljoin(self.base, query),
+                        method,
+                        params=params,
+                        parse_json=True,
+                    )
+                except requests.HTTPError as e:
+                    status = getattr(e.response, "status_code", None)
+                    logger.warning(
+                        f"Stopping year window {year_from}..{year_to} at skip={page * self.page_size} "
+                        f"due to HTTPError {status}"
+                    )
+                    break
+                items = data.get("itemHits", [])
+                if not items:
+                    break
+                for item in items:
+                    yield item
+                page += 1
+    def all_preprints(self):
+        """Return a generator to all the chemRxiv articles."""
+        return self.query_generator("items")
+    def preprint(self, article_id):
+        """Information on a given preprint.
+        .. seealso:: https://docs.figshare.com/#public_article
+        """
+        return self.query(os.path.join("items", article_id))
+    def number_of_preprints(self):
+        return self.query("items")["totalCount"]

{paperscraper-0.3.1 → paperscraper-0.3.3}/paperscraper/get_dumps/utils/chemrxiv/utils.py RENAMED Viewed

@@ -7,9 +7,15 @@ import sys
 from datetime import datetime
 from typing import Dict, List, Optional
-from requests.exceptions import SSLError
+from requests.exceptions import (
+    ChunkedEncodingError,
+    ContentDecodingError,
+    JSONDecodeError,
+    SSLError,
+)
 from requests.models import HTTPError
 from tqdm import tqdm
+from urllib3.exceptions import DecodeError
 from .chemrxiv_api import ChemrxivAPI
@@ -49,7 +55,7 @@ def get_date(datestring: str) -> str:
     """Get the date of a chemrxiv dump enry.
     Args:
-        date (str): String in the format: 2021-10-15T05:12:32.356Z
+        datestring: String in the format: 2021-10-15T05:12:32.356Z
     Returns:
         str: Date in the format: YYYY-MM-DD.
@@ -84,7 +90,7 @@ def parse_dump(source_path: str, target_path: str) -> None:
     NOTE: This is a lazy parser trying to store all data in memory.
     Args:
-        path (str): Path to the source dump
+        source_path: Path to the source dump
     """
     dump = []
@@ -127,20 +133,21 @@ def parse_dump(source_path: str, target_path: str) -> None:
 def download_full(save_dir: str, api: Optional[ChemrxivAPI] = None) -> None:
     if api is None:
         api = ChemrxivAPI()
     os.makedirs(save_dir, exist_ok=True)
     for preprint in tqdm(api.all_preprints()):
-        path = os.path.join(save_dir, f"{preprint['item']['id']}.json")
+        item = preprint["item"]
+        path = os.path.join(save_dir, f"{item['id']}.json")
         if os.path.exists(path):
             continue
-        preprint = preprint["item"]
-        preprint_id = preprint["id"]
-        try:
-            preprint = api.preprint(preprint_id)
-        except HTTPError:
-            logger.warning(f"HTTP API Client error for ID: {preprint_id}")
-        except SSLError:
-            logger.warning(f"SSLError for ID: {preprint_id}")
+        if not item.get("title") or "authors" not in item:
+            try:
+                item = api.preprint(item["id"])
+            except Exception as e:
+                logger.warning(
+                    f"Enrich failed for {item['id']}: {e}; writing listing payload"
+                )
         with open(path, "w") as file:
-            json.dump(preprint, file, indent=2)
+            json.dump(item, file, indent=2)

{paperscraper-0.3.1 → paperscraper-0.3.3}/paperscraper/pdf/fallbacks.py RENAMED Viewed

@@ -14,6 +14,7 @@ from typing import Any, Callable, Dict, Union
 import boto3
 import requests
+from botocore.client import BaseClient
 from lxml import etree
 from tqdm import tqdm
@@ -323,7 +324,7 @@ def month_folder(doi: str) -> str:
     return date.strftime("%B_%Y")
-def list_meca_keys(s3_client, bucket: str, prefix: str) -> list:
+def list_meca_keys(s3_client: BaseClient, bucket: str, prefix: str) -> list:
     """
     List all .meca object keys under a given prefix in a requester-pays bucket.
@@ -346,7 +347,9 @@ def list_meca_keys(s3_client, bucket: str, prefix: str) -> list:
     return keys
-def find_meca_for_doi(s3_client, bucket: str, key: str, doi_token: str) -> bool:
+def find_meca_for_doi(
+    s3_client: BaseClient, bucket: str, key: str, doi_token: str
+) -> bool:
     """
     Efficiently inspect manifest.xml within a .meca zip by fetching only necessary bytes.
     Parse via ZipFile to read manifest.xml and match DOI token.
@@ -375,7 +378,7 @@ def find_meca_for_doi(s3_client, bucket: str, key: str, doi_token: str) -> bool:
         manifest = z.read("manifest.xml")
     # Extract the last part of the DOI (newer DOIs that contain date fail otherwise)
-    doi_token = doi_token.split('.')[-1]
+    doi_token = doi_token.split(".")[-1]
     return doi_token.encode("utf-8") in manifest.lower()

{paperscraper-0.3.1 → paperscraper-0.3.3}/paperscraper/pdf/pdf.py RENAMED Viewed

@@ -135,7 +135,7 @@ def save_pdf(
             logger.info(
                 "DOI contains eLife, attempting fallback to eLife XML repository on GitHub."
             )
-            if not FALLBACKS["elive"](doi, output_path):
+            if not FALLBACKS["elife"](doi, output_path):
                 logger.warning(
                     f"eLife XML fallback failed for {paper_metadata['doi']}."
                 )

{paperscraper-0.3.1 → paperscraper-0.3.3}/paperscraper/plotting.py RENAMED Viewed

@@ -1,7 +1,6 @@
 import logging
 import math
-import os
-from typing import Iterable, List
+from typing import Iterable, List, Optional
 import matplotlib.pyplot as plt
 import numpy as np
@@ -19,13 +18,13 @@ def plot_comparison(
     x_ticks: List[str] = ["2015", "2016", "2017", "2018", "2019", "2020"],
     show_preprint: bool = False,
     title_text: str = "",
-    keyword_text=None,
+    keyword_text: Optional[List[str]] = None,
     figpath: str = "comparison_plot.pdf",
 ) -> None:
     """Plot temporal evolution of number of papers per keyword
     Args:
-        data_dict (dict): A dictionary with keywords as keys. Each value should be a
+        data_dict: A dictionary with keywords as keys. Each value should be a
             dictionary itself, with keys for the different APIs. For example
             data_dict = {
                 'covid_19.jsonl': {
@@ -39,18 +38,15 @@ def plot_comparison(
                     ...
                 }
             }
-        keys (List[str]): List of keys which should be plotted. This has to be a
-           subset of data_dict.keys().
-        x_ticks (List[str]): List of strings to be used for the x-ticks. Should have
-            same length as data_dict[key][database]. Defaults to ['2015', '2016',
-            '2017', '2018', '2019', '2020'], meaning that papers are aggregated per
-            year.
-        show_preprint (bool, optional): Whether preprint servers are aggregated or not.
+        keys: List of keys which should be plotted. This has to be a subset of data_dict.keys().
+        x_ticks: List of strings to be used for the x-ticks. Should have same length as
+            data_dict[key][database]. Defaults to ['2015', '2016', '2017', '2018', '2019', '2020'],
+            meaning that papers are aggregated per year.
+        show_preprint: Whether preprint servers are aggregated or not.
             Defaults to False.
-        title_text (str, optional): Title for the produced figure. Defaults to ''.
-        keyword_text ([type], optional): Figure caption per keyword. Defaults to None,
-            i.e. empty strings will be used.
-        figpath (str, optional): Name under which figure is saved. Relative or absolute
+        title_text: Title for the produced figure. Defaults to ''.
+        keyword_text: Figure caption per keyword. Defaults to None, i.e. empty strings will be used.
+        figpath: Name under which figure is saved. Relative or absolute
             paths can be given. Defaults to 'comparison_plot.pdf'.
     Raises:
@@ -184,12 +180,12 @@ def plot_single(
     show_preprint: bool = False,
     title_text: str = "",
     figpath: str = "comparison_plot.pdf",
-    logscale=False,
+    logscale: bool = False,
 ) -> None:
     """Plot temporal evolution of number of papers per keyword
     Args:
-        data_dict (dict): A dictionary with keywords as keys. Each value should be a
+        data_dict: A dictionary with keywords as keys. Each value should be a
             dictionary itself, with keys for the different APIs. For example
             data_dict = {
                 'covid_19.jsonl': {
@@ -203,19 +199,17 @@ def plot_single(
                     ...
                 }
             }
-        keys (str): A key which should be plotted. This has to be a
-           subset of data_dict.keys().
+        keys: A key which should be plotted. This has to be a subset of data_dict.keys().
         x_ticks (List[str]): List of strings to be used for the x-ticks. Should have
             same length as data_dict[key][database]. Defaults to ['2015', '2016',
             '2017', '2018', '2019', '2020'], meaning that papers are aggregated per
             year.
-        show_preprint (bool, optional): Whether preprint servers are aggregated or not.
+        show_preprint: Whether preprint servers are aggregated or not.
             Defaults to False.
-        title_text (str, optional): Title for the produced figure. Defaults to ''.
+        title_text: Title for the produced figure. Defaults to ''.
         figpath (str, optional): Name under which figure is saved. Relative or absolute
             paths can be given. Defaults to 'comparison_plot.pdf'.
-        logscale (bool, optional): Whether y-axis is plotted on logscale. Defaults
-            to False.
+        logscale: Whether y-axis is plotted on logscale. Defaults to False.
     Raises:
         KeyError: If a database is missing in data_dict.

{paperscraper-0.3.1 → paperscraper-0.3.3}/paperscraper/postprocessing.py RENAMED Viewed

@@ -37,7 +37,7 @@ def aggregate_paper(
             title or abstract. Only applies if filtering is True.
         return_filtered (bool, optional): Whether the filtered matches are also
             returned. Only applies if filtering is True. Defaults to False.
-        filer_abstract (bool, optional): Whether the keyword is searched in the abstract
+        filter_abstract (bool, optional): Whether the keyword is searched in the abstract
             or not. Defaults to True.
         last_year (int, optional): Most recent year for the aggregation. Defaults
             to current year. All newer entries are discarded.
@@ -112,8 +112,7 @@ def aggregate_paper(
         if len(date.split("-")) < 2:
             logger.warning(
-                f"Paper without month {date}, randomly assigned month."
-                f"{paper['title']}"
+                f"Paper without month {date}, randomly assigned month.{paper['title']}"
             )
             month = np.random.choice(12)
         else:

{paperscraper-0.3.1 → paperscraper-0.3.3}/paperscraper/pubmed/pubmed.py RENAMED Viewed

@@ -42,15 +42,15 @@ def get_pubmed_papers(
     fields as desired.
     Args:
-        query (str): Query to PubMed API. Needs to match PubMed API notation.
-        fields (list[str]): List of strings with fields to keep in output.
+        query: Query to PubMed API. Needs to match PubMed API notation.
+        fields: List of strings with fields to keep in output.
             NOTE: If 'emails' is passed, an attempt is made to extract author mail
             addresses.
-        max_results (int): Maximal number of results retrieved from DB. Defaults
+        max_results: Maximal number of results retrieved from DB. Defaults
             to 9998, higher values likely raise problems due to PubMedAPI, see:
             https://stackoverflow.com/questions/75353091/biopython-entrez-article-limit
-        NOTE: *args, **kwargs are additional arguments for pubmed.query
+        args: additional arguments for pubmed.query
+        kwargs: additional arguments for pubmed.query
     Returns:
         pd.DataFrame. One paper per row.
@@ -100,19 +100,19 @@ def get_and_dump_pubmed_papers(
     Combines get_pubmed_papers and dump_papers.
     Args:
-        keywords (List[Union[str, List[str]]]): List of keywords to request
-            pubmed API. The outer list level will be considered as AND
-            separated keys, the inner level as OR separated.
-        filepath (str): Path where the dump will be saved.
-        fields (List, optional): List of strings with fields to keep in output.
+        keywords: List of keywords to request pubmed API.
+            The outer list level will be considered as AND separated keys.
+            The inner level as OR separated.
+        output_filepath: Path where the dump will be saved.
+        fields: List of strings with fields to keep in output.
             Defaults to ['title', 'authors', 'date', 'abstract',
             'journal', 'doi'].
             NOTE: If 'emails' is passed, an attempt is made to extract author mail
             addresses.
-        start_date (str): Start date for the search. Needs to be in format:
+        start_date: Start date for the search. Needs to be in format:
             YYYY/MM/DD, e.g. '2020/07/20'. Defaults to 'None', i.e. no specific
             dates are used.
-        end_date (str): End date for the search. Same notation as start_date.
+        end_date: End date for the search. Same notation as start_date.
     """
     # Translate keywords into query.
     query = get_query_from_keywords_and_date(

{paperscraper-0.3.1 → paperscraper-0.3.3}/paperscraper/scholar/scholar.py RENAMED Viewed

@@ -28,12 +28,12 @@ def get_scholar_papers(
     **kwargs,
 ) -> pd.DataFrame:
     """
-    Performs Google Scholar API request of a given query and returns list of papers with
+    Performs Google Scholar API request of a given title and returns list of papers with
     fields as desired.
     Args:
-        query (str): Query to arxiv API. Needs to match the arxiv API notation.
-        fields (list[str]): List of strings with fields to keep in output.
+        title: Query to arxiv API. Needs to match the arxiv API notation.
+        fields: List of strings with fields to keep in output.
     Returns:
         pd.DataFrame. One paper per row.
@@ -74,13 +74,9 @@ def get_and_dump_scholar_papers(
     Combines get_scholar_papers and dump_papers.
     Args:
-        keywords (List[str, List[str]]): List of keywords to request arxiv API.
-            The outer list level will be considered as AND separated keys, the
-            inner level as OR separated.
-        filepath (str): Path where the dump will be saved.
-        fields (List, optional): List of strings with fields to keep in output.
-            Defaults to ['title', 'authors', 'date', 'abstract',
-            'journal', 'doi'].
+        title: Paper to search for on Google Scholar.
+        output_filepath: Path where the dump will be saved.
+        fields: List of strings with fields to keep in output.
     """
     papers = get_scholar_papers(title, fields)
     dump_papers(papers, output_filepath)

paperscraper-0.3.3/paperscraper/server_dumps/__init__.py ADDED Viewed

@@ -0,0 +1,4 @@
+"""
+Folder for the metadata dumps from biorxiv, medrxiv and chemrxiv API.
+No code here but will be populated with your local `.jsonl` files.
+"""

{paperscraper-0.3.1 → paperscraper-0.3.3}/paperscraper/tests/test_pdf.py RENAMED Viewed

@@ -41,14 +41,7 @@ class TestPDF:
         if os.path.exists("taskload.pdf"):
             os.remove("taskload.pdf")
         paper_data = {"doi": "10.1101/798496"}
-        os.environ.pop("AWS_ACCESS_KEY_ID", None)
-        os.environ.pop("AWS_SECRET_ACCESS_KEY", None)
-        save_pdf(paper_data, filepath="taskload.pdf", save_metadata=True)
-        # NOTE: Locally this fails but surprisingly the CI does not need to fight with Cloudflare for the moment
-        assert os.path.exists("taskload.pdf")
-        assert os.path.exists("taskload.json")
-        os.remove("taskload.pdf")
-        os.remove("taskload.json")
+        # NOTE: biorxiv is cloudflare controlled so standard scraping fails
         # Now try with S3 routine
         keys = load_api_keys("api_keys.txt")
@@ -71,13 +64,13 @@ class TestPDF:
         assert os.path.exists("taskload.pdf")
         os.remove("taskload.pdf")
-        # medrxiv
-        paper_data = {"doi": "10.1101/2020.09.02.20187096"}
-        save_pdf(paper_data, filepath="covid_review.pdf", save_metadata=True)
-        assert os.path.exists("covid_review.pdf")
-        assert os.path.exists("covid_review.json")
-        os.remove("covid_review.pdf")
-        os.remove("covid_review.json")
+        # medrxiv now also seems cloudflare-controlled. skipping test
+        # paper_data = {"doi": "10.1101/2020.09.02.20187096"}
+        # save_pdf(paper_data, filepath="covid_review.pdf", save_metadata=True)
+        # assert os.path.exists("covid_review.pdf")
+        # assert os.path.exists("covid_review.json")
+        # os.remove("covid_review.pdf")
+        # os.remove("covid_review.json")
         # journal with OA paper
         paper_data = {"doi": "10.1038/s42256-023-00639-z"}

{paperscraper-0.3.1 → paperscraper-0.3.3}/paperscraper.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: paperscraper
-Version: 0.3.1
+Version: 0.3.3
 Summary: paperscraper: Package to scrape papers.
 Home-page: https://github.com/jannisborn/paperscraper
 Author: Jannis Born, Matteo Manica
@@ -52,12 +52,11 @@ Dynamic: summary
 [![build](https://github.com/jannisborn/paperscraper/actions/workflows/test_tip.yml/badge.svg?branch=main)](https://github.com/jannisborn/paperscraper/actions/workflows/test_tip.yml?query=branch%3Amain)
 [![build](https://github.com/jannisborn/paperscraper/actions/workflows/test_pypi.yml/badge.svg?branch=main)](https://github.com/jannisborn/paperscraper/actions/workflows/test_pypi.yml?query=branch%3Amain)
+[![build](https://github.com/jannisborn/paperscraper/actions/workflows/docs.yml/badge.svg?branch=main)](https://jannisborn.github.io/paperscraper/)
 [![License:
 MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 [![PyPI version](https://badge.fury.io/py/paperscraper.svg)](https://badge.fury.io/py/paperscraper)
 [![Downloads](https://static.pepy.tech/badge/paperscraper)](https://pepy.tech/project/paperscraper)
-[![Downloads](https://static.pepy.tech/badge/paperscraper/month)](https://pepy.tech/project/paperscraper)
-[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
 [![codecov](https://codecov.io/github/jannisborn/paperscraper/branch/main/graph/badge.svg?token=Clwi0pu61a)](https://codecov.io/github/jannisborn/paperscraper)
 # paperscraper
@@ -67,6 +66,7 @@ It provides a streamlined interface to scrape metadata, allows to retrieve citat
 from Google Scholar, impact factors from journals and comes with simple postprocessing functions
 and plotting routines for meta-analysis.
 ## Table of Contents
 1. [Getting Started](#getting-started)
@@ -93,16 +93,16 @@ This is enough to query PubMed, arXiv or Google Scholar.
 #### Download X-rxiv Dumps
-However, to scrape publication data from the preprint servers [biorxiv](https://www.biorxiv.org), [medrxiv](https://www.medrxiv.org) and [chemrxiv](https://www.chemrxiv.org), the setup is different. The entire dump is downloaded and stored in the `server_dumps` folder in a `.jsonl` format (one paper per line).
+However, to scrape publication data from the preprint servers [biorxiv](https://www.biorxiv.org), [medrxiv](https://www.medrxiv.org) and [chemrxiv](https://www.chemrxiv.org), the setup is different. The entire history of papers is downloaded and stored in the `server_dumps` folder in a `.jsonl` format (one paper per line). This takes a while, as of November 2025:
 ```py
 from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
-medrxiv()  #  Takes ~30min and should result in ~35 MB file
-biorxiv()  # Takes ~1h and should result in ~350 MB file
-chemrxiv()  #  Takes ~45min and should result in ~20 MB file
+chemrxiv()  #  Takes 30min -> +30K papers (~50 MB file)
+medrxiv()  #  Takes <1h -> +90K papers (~200 MB file)
+biorxiv()  # Up to 6h -> +400K papers (~800 MB file)
 ```
 *NOTE*: Once the dumps are stored, please make sure to restart the python interpreter so that the changes take effect.
-*NOTE*: If you experience API connection issues (`ConnectionError`), since v0.2.12 there are automatic retries which you can even control and raise from the default of 10, as in `biorxiv(max_retries=20)`.
+*NOTE*: If you experience API connection issues, since v0.2.12 there are automatic retries which you can even control and raise from the default of 10, as in `biorxiv(max_retries=20)`.
 Since v0.2.5 `paperscraper` also allows to scrape {med/bio/chem}rxiv for specific dates.
 ```py
@@ -424,7 +424,7 @@ plot_multiple_venn(
 ## Citation
 If you use `paperscraper`, please cite a paper that motivated our development of this tool.
-```bib
+```bibtex
 @article{born2021trends,
   title={Trends in Deep Learning for Property-driven Drug Design},
   author={Born, Jannis and Manica, Matteo},
@@ -440,9 +440,15 @@ If you use `paperscraper`, please cite a paper that motivated our development of
 ## Contributions
 Thanks to the following contributors:
 - [@mathinic](https://github.com/mathinic): Since `v0.3.0` improved PubMed full text retrieval with additional fallback mechanisms (BioC-PMC, eLife and optional Wiley/Elsevier APIs).
 - [@memray](https://github.com/memray): Since `v0.2.12` there are automatic retries when downloading the {med/bio/chem}rxiv dumps.
 - [@achouhan93](https://github.com/achouhan93): Since `v0.2.5` {med/bio/chem}rxiv can be scraped for specific dates!
 - [@daenuprobst](https://github.com/daenuprobst): Since  `v0.2.4` PDF files can be scraped directly (`paperscraper.pdf.save_pdf`)
 - [@oppih](https://github.com/oppih): Since `v0.2.3` chemRxiv API also provides DOI and URL if available
-- [@lukasschwab](https://github.com/lukasschwab): Bumped `arxiv` dependency to >`1.4.2` in paperscraper `v0.1.0`.
+- [@lukasschwab](https://github.com/lukasschwab): Enabled support for `arxiv` >`1.4.2` in paperscraper `v0.1.0`.
 - [@juliusbierk](https://github.com/juliusbierk): Bugfixes

paperscraper-0.3.1/paperscraper/get_dumps/utils/chemrxiv/chemrxiv_api.py DELETED Viewed

@@ -1,137 +0,0 @@
-import logging
-import os
-import sys
-from datetime import datetime
-from time import time
-from typing import Dict, Optional
-from urllib.parse import urljoin
-import requests
-from requests.exceptions import ChunkedEncodingError
-logging.basicConfig(stream=sys.stdout, level=logging.INFO)
-logger = logging.getLogger(__name__)
-now_datetime = datetime.now()
-launch_dates = {"chemrxiv": "2017-01-01"}
-class ChemrxivAPI:
-    """Handle OpenEngage API requests, using access.
-    Adapted from https://github.com/fxcoudert/tools/blob/master/chemRxiv/chemRxiv.py.
-    """
-    base = "https://chemrxiv.org/engage/chemrxiv/public-api/v1/"
-    def __init__(
-        self,
-        start_date: Optional[str] = None,
-        end_date: Optional[str] = None,
-        page_size: Optional[int] = None,
-        max_retries: int = 10,
-    ):
-        """
-        Initialize API class.
-        Args:
-            start_date (Optional[str], optional): begin date expressed as YYYY-MM-DD.
-                Defaults to None.
-            end_date (Optional[str], optional): end date expressed as YYYY-MM-DD.
-                Defaults to None.
-            page_size (int, optional): The batch size used to fetch the records from chemrxiv.
-            max_retries (int): Number of retries in case of error
-        """
-        self.page_size = page_size or 50
-        self.max_retries = max_retries
-        # Begin Date and End Date of the search
-        launch_date = launch_dates["chemrxiv"]
-        launch_datetime = datetime.fromisoformat(launch_date)
-        if start_date:
-            start_datetime = datetime.fromisoformat(start_date)
-            if start_datetime < launch_datetime:
-                self.start_date = launch_date
-                logger.warning(
-                    f"Begin date {start_date} is before chemrxiv launch date. Will use {launch_date} instead."
-                )
-            else:
-                self.start_date = start_date
-        else:
-            self.start_date = launch_date
-        if end_date:
-            end_datetime = datetime.fromisoformat(end_date)
-            if end_datetime > now_datetime:
-                logger.warning(
-                    f"End date {end_date} is in the future. Will use {now_datetime} instead."
-                )
-                self.end_date = now_datetime.strftime("%Y-%m-%d")
-            else:
-                self.end_date = end_date
-        else:
-            self.end_date = now_datetime.strftime("%Y-%m-%d")
-    def request(self, url, method, params=None):
-        """Send an API request to open Engage."""
-        for attempt in range(self.max_retries):
-            try:
-                if method.casefold() == "get":
-                    return requests.get(url, params=params, timeout=10)
-                elif method.casefold() == "post":
-                    return requests.post(url, json=params, timeout=10)
-                else:
-                    raise ConnectionError(f"Unknown method for query: {method}")
-            except ChunkedEncodingError as e:
-                logger.warning(f"ChunkedEncodingError occurred for {url}: {e}")
-                if attempt + 1 == self.max_retries:
-                    raise e
-                time.sleep(3)
-    def query(self, query, method="get", params=None):
-        """Perform a direct query."""
-        r = self.request(urljoin(self.base, query), method, params=params)
-        r.raise_for_status()
-        return r.json()
-    def query_generator(self, query, method: str = "get", params: Dict = {}):
-        """Query for a list of items, with paging. Returns a generator."""
-        page = 0
-        while True:
-            params.update(
-                {
-                    "limit": self.page_size,
-                    "skip": page * self.page_size,
-                    "searchDateFrom": self.start_date,
-                    "searchDateTo": self.end_date,
-                }
-            )
-            r = self.request(urljoin(self.base, query), method, params=params)
-            if r.status_code == 400:
-                raise ValueError(r.json()["message"])
-            r.raise_for_status()
-            r = r.json()
-            r = r["itemHits"]
-            # If we have no more results, bail out
-            if len(r) == 0:
-                return
-            yield from r
-            page += 1
-    def all_preprints(self):
-        """Return a generator to all the chemRxiv articles."""
-        return self.query_generator("items")
-    def preprint(self, article_id):
-        """Information on a given preprint.
-        .. seealso:: https://docs.figshare.com/#public_article
-        """
-        return self.query(os.path.join("items", article_id))
-    def number_of_preprints(self):
-        return self.query("items")["totalCount"]