PyPI - paperscraper - Versions diffs - 0.3.2__tar.gz → 0.3.4__tar.gz - Mend

paperscraper 0.3.2tar.gz → 0.3.4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (78) hide show

{paperscraper-0.3.2 → paperscraper-0.3.4}/PKG-INFO RENAMED Viewed

@@ -1,62 +1,55 @@
 Metadata-Version: 2.4
 Name: paperscraper
-Version: 0.3.2
+Version: 0.3.4
 Summary: paperscraper: Package to scrape papers.
-Home-page: https://github.com/jannisborn/paperscraper
-Author: Jannis Born, Matteo Manica
-Author-email: jannis.born@gmx.de, drugilsberg@gmail.com
+Author-email: Jannis Born <jannis.born@gmx.de>, Matteo Manica <drugilsberg@gmail.com>
 License: MIT
+Project-URL: Homepage, https://github.com/jannisborn/paperscraper
+Project-URL: Documentation, https://jannisborn.github.io/paperscraper/
+Project-URL: Repository, https://github.com/jannisborn/paperscraper
 Keywords: Academics,Science,Publication,Search,PubMed,Arxiv,Medrxiv,Biorxiv,Chemrxiv,Google Scholar
 Classifier: Development Status :: 3 - Alpha
 Classifier: Intended Audience :: Developers
 Classifier: Intended Audience :: Science/Research
 Classifier: License :: OSI Approved :: MIT License
 Classifier: Programming Language :: Python :: 3
-Classifier: Programming Language :: Python :: 3.8
 Classifier: Programming Language :: Python :: 3.9
 Classifier: Programming Language :: Python :: 3.10
 Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
 Classifier: Topic :: Software Development :: Libraries :: Python Modules
+Requires-Python: >=3.9
 Description-Content-Type: text/markdown
 License-File: LICENSE
-Requires-Dist: arxiv>=1.4.2
+Requires-Dist: arxiv>=1.4.7
 Requires-Dist: pymed-paperscraper>=1.0.4
-Requires-Dist: pandas
-Requires-Dist: requests
-Requires-Dist: tqdm
+Requires-Dist: pandas>=1.0.4
+Requires-Dist: requests==2.32.0
+Requires-Dist: tqdm>=4.51.0
 Requires-Dist: scholarly>=1.0.0
-Requires-Dist: seaborn
-Requires-Dist: matplotlib
-Requires-Dist: matplotlib_venn
-Requires-Dist: bs4
-Requires-Dist: impact-factor>=1.1.1
-Requires-Dist: thefuzz
+Requires-Dist: seaborn>=0.11.0
+Requires-Dist: matplotlib>=3.3.2
+Requires-Dist: matplotlib-venn>=0.11.5
+Requires-Dist: bs4>=0.0.1
+Requires-Dist: impact-factor>=1.1.3
+Requires-Dist: thefuzz>=0.20.0
 Requires-Dist: pytest
 Requires-Dist: tldextract
-Requires-Dist: semanticscholar
+Requires-Dist: semanticscholar>=0.8.4
 Requires-Dist: pydantic
 Requires-Dist: unidecode
 Requires-Dist: dotenv
 Requires-Dist: boto3
-Dynamic: author
-Dynamic: author-email
-Dynamic: classifier
-Dynamic: description
-Dynamic: description-content-type
-Dynamic: home-page
-Dynamic: keywords
-Dynamic: license
 Dynamic: license-file
-Dynamic: requires-dist
-Dynamic: summary
 [![build](https://github.com/jannisborn/paperscraper/actions/workflows/test_tip.yml/badge.svg?branch=main)](https://github.com/jannisborn/paperscraper/actions/workflows/test_tip.yml?query=branch%3Amain)
 [![build](https://github.com/jannisborn/paperscraper/actions/workflows/test_pypi.yml/badge.svg?branch=main)](https://github.com/jannisborn/paperscraper/actions/workflows/test_pypi.yml?query=branch%3Amain)
+[![build](https://github.com/jannisborn/paperscraper/actions/workflows/docs.yml/badge.svg?branch=main)](https://jannisborn.github.io/paperscraper/)
 [![License:
 MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 [![PyPI version](https://badge.fury.io/py/paperscraper.svg)](https://badge.fury.io/py/paperscraper)
 [![Downloads](https://static.pepy.tech/badge/paperscraper)](https://pepy.tech/project/paperscraper)
-[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
 [![codecov](https://codecov.io/github/jannisborn/paperscraper/branch/main/graph/badge.svg?token=Clwi0pu61a)](https://codecov.io/github/jannisborn/paperscraper)
 # paperscraper
@@ -66,6 +59,7 @@ It provides a streamlined interface to scrape metadata, allows to retrieve citat
 from Google Scholar, impact factors from journals and comes with simple postprocessing functions
 and plotting routines for meta-analysis.
 ## Table of Contents
 1. [Getting Started](#getting-started)
@@ -90,18 +84,30 @@ pip install paperscraper
 This is enough to query PubMed, arXiv or Google Scholar.
+### Local development
+```console
+uv sync
+```
+This installs the project and dev tooling into `.venv`. Use `uv run` to execute commands, for example:
+```console
+uv run python -c "import paperscraper"
+```
 #### Download X-rxiv Dumps
-However, to scrape publication data from the preprint servers [biorxiv](https://www.biorxiv.org), [medrxiv](https://www.medrxiv.org) and [chemrxiv](https://www.chemrxiv.org), the setup is different. The entire dump is downloaded and stored in the `server_dumps` folder in a `.jsonl` format (one paper per line).
+However, to scrape publication data from the preprint servers [biorxiv](https://www.biorxiv.org), [medrxiv](https://www.medrxiv.org) and [chemrxiv](https://www.chemrxiv.org), the setup is different. The entire history of papers is downloaded and stored in the `server_dumps` folder in a `.jsonl` format (one paper per line). This takes a while, as of November 2025:
 ```py
 from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
-medrxiv()  #  Takes ~30min and should result in ~35 MB file
-biorxiv()  # Takes ~1h and should result in ~350 MB file
-chemrxiv()  #  Takes ~45min and should result in ~20 MB file
+chemrxiv()  #  Takes 30min -> +30K papers (~50 MB file)
+medrxiv()  #  Takes <1h -> +90K papers (~200 MB file)
+biorxiv()  # Up to 6h -> +400K papers (~800 MB file)
 ```
 *NOTE*: Once the dumps are stored, please make sure to restart the python interpreter so that the changes take effect.
-*NOTE*: If you experience API connection issues (`ConnectionError`), since v0.2.12 there are automatic retries which you can even control and raise from the default of 10, as in `biorxiv(max_retries=20)`.
+*NOTE*: If you experience API connection issues, since v0.2.12 there are automatic retries which you can even control and raise from the default of 10, as in `biorxiv(max_retries=20)`.
 Since v0.2.5 `paperscraper` also allows to scrape {med/bio/chem}rxiv for specific dates.
 ```py
@@ -279,6 +285,13 @@ doi = '10.1021/acs.jcim.3c00132'
 get_citations_by_doi(doi)
 ```
+NOTE: This uses the [Semantic Scholar API](https://www.semanticscholar.org/product/api/tutorial) which is bandwidth-limited. If you have an API Key set it via:
+```sh
+export SS_API_KEY=YOUR_API_KEY
+```
+This will increase your throughput for using `paperscraper.citations` based on the rate limits of your key.
 ### Journal impact factor
 You can also retrieve the impact factor for all journals:
@@ -423,7 +436,7 @@ plot_multiple_venn(
 ## Citation
 If you use `paperscraper`, please cite a paper that motivated our development of this tool.
-```bib
+```bibtex
 @article{born2021trends,
   title={Trends in Deep Learning for Property-driven Drug Design},
   author={Born, Jannis and Manica, Matteo},
@@ -439,9 +452,15 @@ If you use `paperscraper`, please cite a paper that motivated our development of
 ## Contributions
 Thanks to the following contributors:
 - [@mathinic](https://github.com/mathinic): Since `v0.3.0` improved PubMed full text retrieval with additional fallback mechanisms (BioC-PMC, eLife and optional Wiley/Elsevier APIs).
 - [@memray](https://github.com/memray): Since `v0.2.12` there are automatic retries when downloading the {med/bio/chem}rxiv dumps.
 - [@achouhan93](https://github.com/achouhan93): Since `v0.2.5` {med/bio/chem}rxiv can be scraped for specific dates!
 - [@daenuprobst](https://github.com/daenuprobst): Since  `v0.2.4` PDF files can be scraped directly (`paperscraper.pdf.save_pdf`)
 - [@oppih](https://github.com/oppih): Since `v0.2.3` chemRxiv API also provides DOI and URL if available
-- [@lukasschwab](https://github.com/lukasschwab): Bumped `arxiv` dependency to >`1.4.2` in paperscraper `v0.1.0`.
+- [@lukasschwab](https://github.com/lukasschwab): Enabled support for `arxiv` >`1.4.2` in paperscraper `v0.1.0`.
 - [@juliusbierk](https://github.com/juliusbierk): Bugfixes

{paperscraper-0.3.2 → paperscraper-0.3.4}/README.md RENAMED Viewed

@@ -1,10 +1,10 @@
 [![build](https://github.com/jannisborn/paperscraper/actions/workflows/test_tip.yml/badge.svg?branch=main)](https://github.com/jannisborn/paperscraper/actions/workflows/test_tip.yml?query=branch%3Amain)
 [![build](https://github.com/jannisborn/paperscraper/actions/workflows/test_pypi.yml/badge.svg?branch=main)](https://github.com/jannisborn/paperscraper/actions/workflows/test_pypi.yml?query=branch%3Amain)
+[![build](https://github.com/jannisborn/paperscraper/actions/workflows/docs.yml/badge.svg?branch=main)](https://jannisborn.github.io/paperscraper/)
 [![License:
 MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 [![PyPI version](https://badge.fury.io/py/paperscraper.svg)](https://badge.fury.io/py/paperscraper)
 [![Downloads](https://static.pepy.tech/badge/paperscraper)](https://pepy.tech/project/paperscraper)
-[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
 [![codecov](https://codecov.io/github/jannisborn/paperscraper/branch/main/graph/badge.svg?token=Clwi0pu61a)](https://codecov.io/github/jannisborn/paperscraper)
 # paperscraper
@@ -14,6 +14,7 @@ It provides a streamlined interface to scrape metadata, allows to retrieve citat
 from Google Scholar, impact factors from journals and comes with simple postprocessing functions
 and plotting routines for meta-analysis.
 ## Table of Contents
 1. [Getting Started](#getting-started)
@@ -38,18 +39,30 @@ pip install paperscraper
 This is enough to query PubMed, arXiv or Google Scholar.
+### Local development
+```console
+uv sync
+```
+This installs the project and dev tooling into `.venv`. Use `uv run` to execute commands, for example:
+```console
+uv run python -c "import paperscraper"
+```
 #### Download X-rxiv Dumps
-However, to scrape publication data from the preprint servers [biorxiv](https://www.biorxiv.org), [medrxiv](https://www.medrxiv.org) and [chemrxiv](https://www.chemrxiv.org), the setup is different. The entire dump is downloaded and stored in the `server_dumps` folder in a `.jsonl` format (one paper per line).
+However, to scrape publication data from the preprint servers [biorxiv](https://www.biorxiv.org), [medrxiv](https://www.medrxiv.org) and [chemrxiv](https://www.chemrxiv.org), the setup is different. The entire history of papers is downloaded and stored in the `server_dumps` folder in a `.jsonl` format (one paper per line). This takes a while, as of November 2025:
 ```py
 from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
-medrxiv()  #  Takes ~30min and should result in ~35 MB file
-biorxiv()  # Takes ~1h and should result in ~350 MB file
-chemrxiv()  #  Takes ~45min and should result in ~20 MB file
+chemrxiv()  #  Takes 30min -> +30K papers (~50 MB file)
+medrxiv()  #  Takes <1h -> +90K papers (~200 MB file)
+biorxiv()  # Up to 6h -> +400K papers (~800 MB file)
 ```
 *NOTE*: Once the dumps are stored, please make sure to restart the python interpreter so that the changes take effect.
-*NOTE*: If you experience API connection issues (`ConnectionError`), since v0.2.12 there are automatic retries which you can even control and raise from the default of 10, as in `biorxiv(max_retries=20)`.
+*NOTE*: If you experience API connection issues, since v0.2.12 there are automatic retries which you can even control and raise from the default of 10, as in `biorxiv(max_retries=20)`.
 Since v0.2.5 `paperscraper` also allows to scrape {med/bio/chem}rxiv for specific dates.
 ```py
@@ -227,6 +240,13 @@ doi = '10.1021/acs.jcim.3c00132'
 get_citations_by_doi(doi)
 ```
+NOTE: This uses the [Semantic Scholar API](https://www.semanticscholar.org/product/api/tutorial) which is bandwidth-limited. If you have an API Key set it via:
+```sh
+export SS_API_KEY=YOUR_API_KEY
+```
+This will increase your throughput for using `paperscraper.citations` based on the rate limits of your key.
 ### Journal impact factor
 You can also retrieve the impact factor for all journals:
@@ -371,7 +391,7 @@ plot_multiple_venn(
 ## Citation
 If you use `paperscraper`, please cite a paper that motivated our development of this tool.
-```bib
+```bibtex
 @article{born2021trends,
   title={Trends in Deep Learning for Property-driven Drug Design},
   author={Born, Jannis and Manica, Matteo},
@@ -387,9 +407,15 @@ If you use `paperscraper`, please cite a paper that motivated our development of
 ## Contributions
 Thanks to the following contributors:
 - [@mathinic](https://github.com/mathinic): Since `v0.3.0` improved PubMed full text retrieval with additional fallback mechanisms (BioC-PMC, eLife and optional Wiley/Elsevier APIs).
 - [@memray](https://github.com/memray): Since `v0.2.12` there are automatic retries when downloading the {med/bio/chem}rxiv dumps.
 - [@achouhan93](https://github.com/achouhan93): Since `v0.2.5` {med/bio/chem}rxiv can be scraped for specific dates!
 - [@daenuprobst](https://github.com/daenuprobst): Since  `v0.2.4` PDF files can be scraped directly (`paperscraper.pdf.save_pdf`)
 - [@oppih](https://github.com/oppih): Since `v0.2.3` chemRxiv API also provides DOI and URL if available
-- [@lukasschwab](https://github.com/lukasschwab): Bumped `arxiv` dependency to >`1.4.2` in paperscraper `v0.1.0`.
+- [@lukasschwab](https://github.com/lukasschwab): Enabled support for `arxiv` >`1.4.2` in paperscraper `v0.1.0`.
 - [@juliusbierk](https://github.com/juliusbierk): Bugfixes

{paperscraper-0.3.2 → paperscraper-0.3.4}/paperscraper/__init__.py RENAMED Viewed

@@ -1,7 +1,7 @@
 """Initialize the module."""
 __name__ = "paperscraper"
-__version__ = "0.3.2"
+__version__ = "0.3.4"
 import logging
 import os

{paperscraper-0.3.2 → paperscraper-0.3.4}/paperscraper/arxiv/arxiv.py RENAMED Viewed

@@ -6,17 +6,16 @@ from typing import Dict, List, Literal, Union
 import arxiv
 import pandas as pd
-import pkg_resources
 from tqdm import tqdm
-from ..utils import dump_papers
+from ..utils import dump_papers, get_server_dumps_dir
 from ..xrxiv.xrxiv_query import XRXivQuery
 from .utils import get_query_from_keywords, infer_backend
 logging.basicConfig(stream=sys.stdout, level=logging.INFO)
 logger = logging.getLogger(__name__)
-dump_root = pkg_resources.resource_filename("paperscraper", "server_dumps")
+dump_root = get_server_dumps_dir()
 global ARXIV_QUERIER
 ARXIV_QUERIER = None
@@ -94,7 +93,7 @@ def get_arxiv_papers_api(
     fields as desired.
     Args:
-        query Query to arxiv API. Needs to match the arxiv API notation.
+        query: Query to arxiv API. Needs to match the arxiv API notation.
         fields: List of strings with fields to keep in output.
         max_results: Maximal number of results, defaults to 99999.
         client_options: Optional arguments for `arxiv.Client`. E.g.:
@@ -144,7 +143,7 @@ def get_and_dump_arxiv_papers(
         keywords: List of keywords for arxiv search.
             The outer list level will be considered as AND separated keys, the
             inner level as OR separated.
-        filepath: Path where the dump will be saved.
+        output_filepath: Path where the dump will be saved.
         fields: List of strings with fields to keep in output.
             Defaults to ['title', 'authors', 'date', 'abstract',
             'journal', 'doi'].

{paperscraper-0.3.2 → paperscraper-0.3.4}/paperscraper/arxiv/utils.py RENAMED Viewed

@@ -3,7 +3,7 @@ import os
 from datetime import datetime
 from typing import List, Union
-import pkg_resources
+from ..utils import get_server_dumps_dir
 finalize_disjunction = lambda x: "(" + x[:-4] + ") AND "
 finalize_conjunction = lambda x: x[:-5]
@@ -59,6 +59,6 @@ def get_query_from_keywords(
 def infer_backend():
-    dump_root = pkg_resources.resource_filename("paperscraper", "server_dumps")
+    dump_root = get_server_dumps_dir()
     dump_paths = glob.glob(os.path.join(dump_root, "arxiv" + "*"))
     return "api" if not dump_paths else "local"

{paperscraper-0.3.2 → paperscraper-0.3.4}/paperscraper/async_utils.py RENAMED Viewed

@@ -49,14 +49,20 @@ def optional_async(
 def retry_with_exponential_backoff(
-    *, max_retries: int = 5, base_delay: float = 1.0
+    *,
+    max_retries: int = 5,
+    base_delay: float = 1.0,
+    factor: float = 1.3,
+    constant_delay: float = 0.2,
 ) -> Callable[[F], F]:
     """
     Decorator factory that retries an `async def` on HTTP 429, with exponential backoff.
     Args:
         max_retries: how many times to retry before giving up.
-        base_delay: initial delay in seconds; next delays will be duplication of previous.
+        base_delay: initial delay in seconds; next delays will be multiplied by `factor`.
+        factor: multiplier for delay after each retry.
+        constant_delay: fixed delay before each attempt.
     Usage:
@@ -70,18 +76,39 @@ def retry_with_exponential_backoff(
         @wraps(func)
         async def wrapper(*args, **kwargs) -> Any:
             delay = base_delay
-            for attempt in range(max_retries):
+            last_exception: BaseException | None = None
+            for attempt in range(1, max_retries + 1):
+                await asyncio.sleep(constant_delay)
                 try:
                     return await func(*args, **kwargs)
                 except httpx.HTTPStatusError as e:
-                    # only retry on 429
                     status = e.response.status_code if e.response is not None else None
-                    if status != 429 or attempt == max_retries - 1:
+                    if status != 429:
                         raise
-                # backoff
-                await asyncio.sleep(delay)
-                delay *= 2
-            # in theory we never reach here
+                    last_exception = e
+                    sleep_for = delay
+                    if e.response is not None:
+                        ra = e.response.headers.get("Retry-After")
+                        if ra is not None:
+                            try:
+                                sleep_for = float(ra)
+                            except ValueError:
+                                pass
+                    delay *= factor
+                except httpx.ReadError as e:
+                    last_exception = e
+                    sleep_for = delay
+                    delay *= factor
+                if attempt == max_retries:
+                    msg = (
+                        f"{func.__name__} failed after {attempt} attempts with "
+                        f"last delay {sleep_for:.2f}s"
+                    )
+                    raise RuntimeError(msg) from last_exception
+                await asyncio.sleep(sleep_for)
         return wrapper

{paperscraper-0.3.2 → paperscraper-0.3.4}/paperscraper/citations/citations.py RENAMED Viewed

@@ -1,4 +1,5 @@
 import logging
+import os
 import sys
 from time import sleep
@@ -7,7 +8,7 @@ from semanticscholar import SemanticScholar, SemanticScholarException
 logging.basicConfig(stream=sys.stdout, level=logging.INFO)
 logger = logging.getLogger(__name__)
-sch = SemanticScholar()
+sch = SemanticScholar(api_key=os.getenv("SS_API_KEY"))
 def get_citations_by_doi(doi: str) -> int:

{paperscraper-0.3.2 → paperscraper-0.3.4}/paperscraper/citations/entity/core.py RENAMED Viewed

@@ -5,14 +5,15 @@ from pydantic import BaseModel
 class EntityResult(BaseModel):
-    num_citations: int
-    num_references: int
-    # keys are authors or papers and values are absolute self links
-    self_citations: Dict[str, int] = {}
-    self_references: Dict[str, int] = {}
     # aggregated results
     self_citation_ratio: float = 0
     self_reference_ratio: float = 0
+    # total number of author citations/references
+    num_citations: int
+    num_references: int
+    # keys are papers and values are percentage of self citations/references
+    self_citations: Dict[str, float] = {}
+    self_references: Dict[str, float] = {}
 class Entity:

{paperscraper-0.3.2 → paperscraper-0.3.4}/paperscraper/citations/entity/paper.py RENAMED Viewed

@@ -68,14 +68,14 @@ class Paper(Entity):
         Extracts the self references of a paper, for each author.
         """
         if isinstance(self.doi, str):
-            self.ref_result: ReferenceResult = self_references_paper(self.doi)
+            self.self_ref: ReferenceResult = self_references_paper(self.doi)
     def self_citations(self):
         """
         Extracts the self citations of a paper, for each author.
         """
         if isinstance(self.doi, str):
-            self.citation_result: CitationResult = self_citations_paper(self.doi)
+            self.self_cite: CitationResult = self_citations_paper(self.doi)
     def get_result(self) -> Optional[PaperResult]:
         """
@@ -83,18 +83,20 @@ class Paper(Entity):
         Returns: PaperResult if available.
         """
-        if not hasattr(self, "ref_result"):
-            logger.warning(
-                f"Can't get result since no referencing result for {self.input} exists. Run `.self_references` first."
-            )
-            return
-        elif not hasattr(self, "citation_result"):
-            logger.warning(
-                f"Can't get result since no citation result for {self.input} exists. Run `.self_citations` first."
-            )
-            return
-        ref_result = self.ref_result.model_dump()
-        ref_result.pop("ssid", None)
+        if not hasattr(self, "self_ref"):
+            self.self_references()
+        if not hasattr(self, "self_cite"):
+            self.self_citations()
         return PaperResult(
-            title=self.title, **ref_result, **self.citation_result.model_dump()
+            title=self.title,
+            **{
+                k: v
+                for k, v in self.self_ref.model_dump().items()
+                if k not in ["ssid", "title"]
+            },
+            **{
+                k: v
+                for k, v in self.self_cite.model_dump().items()
+                if k not in ["title"]
+            },
         )

paperscraper 0.3.2__tar.gz → 0.3.4__tar.gz

paperscraper 0.3.2tar.gz → 0.3.4tar.gz