PyPI - paperscraper - Versions diffs - 0.3.3__tar.gz → 0.3.5__tar.gz - Mend

paperscraper 0.3.3tar.gz → 0.3.5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (77) hide show

{paperscraper-0.3.3 → paperscraper-0.3.5}/PKG-INFO RENAMED Viewed

@@ -1,54 +1,47 @@
 Metadata-Version: 2.4
 Name: paperscraper
-Version: 0.3.3
+Version: 0.3.5
 Summary: paperscraper: Package to scrape papers.
-Home-page: https://github.com/jannisborn/paperscraper
-Author: Jannis Born, Matteo Manica
-Author-email: jannis.born@gmx.de, drugilsberg@gmail.com
+Author-email: Jannis Born <jannis.born@gmx.de>, Matteo Manica <drugilsberg@gmail.com>
 License: MIT
+Project-URL: Homepage, https://github.com/jannisborn/paperscraper
+Project-URL: Documentation, https://jannisborn.github.io/paperscraper/
+Project-URL: Repository, https://github.com/jannisborn/paperscraper
 Keywords: Academics,Science,Publication,Search,PubMed,Arxiv,Medrxiv,Biorxiv,Chemrxiv,Google Scholar
 Classifier: Development Status :: 3 - Alpha
 Classifier: Intended Audience :: Developers
 Classifier: Intended Audience :: Science/Research
 Classifier: License :: OSI Approved :: MIT License
 Classifier: Programming Language :: Python :: 3
-Classifier: Programming Language :: Python :: 3.8
 Classifier: Programming Language :: Python :: 3.9
 Classifier: Programming Language :: Python :: 3.10
 Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
 Classifier: Topic :: Software Development :: Libraries :: Python Modules
+Requires-Python: >=3.9
 Description-Content-Type: text/markdown
 License-File: LICENSE
-Requires-Dist: arxiv>=1.4.2
+Requires-Dist: arxiv>=1.4.7
 Requires-Dist: pymed-paperscraper>=1.0.4
-Requires-Dist: pandas
-Requires-Dist: requests
-Requires-Dist: tqdm
+Requires-Dist: pandas>=1.0.4
+Requires-Dist: requests>=2.32.2
+Requires-Dist: tqdm>=4.51.0
 Requires-Dist: scholarly>=1.0.0
-Requires-Dist: seaborn
-Requires-Dist: matplotlib
-Requires-Dist: matplotlib_venn
-Requires-Dist: bs4
-Requires-Dist: impact-factor>=1.1.1
-Requires-Dist: thefuzz
+Requires-Dist: seaborn>=0.11.0
+Requires-Dist: matplotlib>=3.3.2
+Requires-Dist: matplotlib-venn>=0.11.5
+Requires-Dist: bs4>=0.0.1
+Requires-Dist: impact-factor>=1.1.3
+Requires-Dist: thefuzz>=0.20.0
 Requires-Dist: pytest
 Requires-Dist: tldextract
-Requires-Dist: semanticscholar
+Requires-Dist: semanticscholar>=0.8.4
 Requires-Dist: pydantic
 Requires-Dist: unidecode
 Requires-Dist: dotenv
 Requires-Dist: boto3
-Dynamic: author
-Dynamic: author-email
-Dynamic: classifier
-Dynamic: description
-Dynamic: description-content-type
-Dynamic: home-page
-Dynamic: keywords
-Dynamic: license
 Dynamic: license-file
-Dynamic: requires-dist
-Dynamic: summary
 [![build](https://github.com/jannisborn/paperscraper/actions/workflows/test_tip.yml/badge.svg?branch=main)](https://github.com/jannisborn/paperscraper/actions/workflows/test_tip.yml?query=branch%3Amain)
 [![build](https://github.com/jannisborn/paperscraper/actions/workflows/test_pypi.yml/badge.svg?branch=main)](https://github.com/jannisborn/paperscraper/actions/workflows/test_pypi.yml?query=branch%3Amain)
@@ -91,6 +84,18 @@ pip install paperscraper
 This is enough to query PubMed, arXiv or Google Scholar.
+### Local development
+```console
+uv sync
+```
+This installs the project and dev tooling into `.venv`. Use `uv run` to execute commands, for example:
+```console
+uv run python -c "import paperscraper"
+```
 #### Download X-rxiv Dumps
 However, to scrape publication data from the preprint servers [biorxiv](https://www.biorxiv.org), [medrxiv](https://www.medrxiv.org) and [chemrxiv](https://www.chemrxiv.org), the setup is different. The entire history of papers is downloaded and stored in the `server_dumps` folder in a `.jsonl` format (one paper per line). This takes a while, as of November 2025:
@@ -280,6 +285,13 @@ doi = '10.1021/acs.jcim.3c00132'
 get_citations_by_doi(doi)
 ```
+NOTE: This uses the [Semantic Scholar API](https://www.semanticscholar.org/product/api/tutorial) which is bandwidth-limited. If you have an API Key set it via:
+```sh
+export SS_API_KEY=YOUR_API_KEY
+```
+This will increase your throughput for using `paperscraper.citations` based on the rate limits of your key.
 ### Journal impact factor
 You can also retrieve the impact factor for all journals:

{paperscraper-0.3.3 → paperscraper-0.3.5}/README.md RENAMED Viewed

@@ -39,6 +39,18 @@ pip install paperscraper
 This is enough to query PubMed, arXiv or Google Scholar.
+### Local development
+```console
+uv sync
+```
+This installs the project and dev tooling into `.venv`. Use `uv run` to execute commands, for example:
+```console
+uv run python -c "import paperscraper"
+```
 #### Download X-rxiv Dumps
 However, to scrape publication data from the preprint servers [biorxiv](https://www.biorxiv.org), [medrxiv](https://www.medrxiv.org) and [chemrxiv](https://www.chemrxiv.org), the setup is different. The entire history of papers is downloaded and stored in the `server_dumps` folder in a `.jsonl` format (one paper per line). This takes a while, as of November 2025:
@@ -228,6 +240,13 @@ doi = '10.1021/acs.jcim.3c00132'
 get_citations_by_doi(doi)
 ```
+NOTE: This uses the [Semantic Scholar API](https://www.semanticscholar.org/product/api/tutorial) which is bandwidth-limited. If you have an API Key set it via:
+```sh
+export SS_API_KEY=YOUR_API_KEY
+```
+This will increase your throughput for using `paperscraper.citations` based on the rate limits of your key.
 ### Journal impact factor
 You can also retrieve the impact factor for all journals:

{paperscraper-0.3.3 → paperscraper-0.3.5}/paperscraper/__init__.py RENAMED Viewed

@@ -1,7 +1,7 @@
 """Initialize the module."""
 __name__ = "paperscraper"
-__version__ = "0.3.3"
+__version__ = "0.3.5"
 import logging
 import os

{paperscraper-0.3.3 → paperscraper-0.3.5}/paperscraper/arxiv/arxiv.py RENAMED Viewed

@@ -6,17 +6,16 @@ from typing import Dict, List, Literal, Union
 import arxiv
 import pandas as pd
-import pkg_resources
 from tqdm import tqdm
-from ..utils import dump_papers
+from ..utils import dump_papers, get_server_dumps_dir
 from ..xrxiv.xrxiv_query import XRXivQuery
 from .utils import get_query_from_keywords, infer_backend
 logging.basicConfig(stream=sys.stdout, level=logging.INFO)
 logger = logging.getLogger(__name__)
-dump_root = pkg_resources.resource_filename("paperscraper", "server_dumps")
+dump_root = get_server_dumps_dir()
 global ARXIV_QUERIER
 ARXIV_QUERIER = None

{paperscraper-0.3.3 → paperscraper-0.3.5}/paperscraper/arxiv/utils.py RENAMED Viewed

@@ -3,7 +3,7 @@ import os
 from datetime import datetime
 from typing import List, Union
-import pkg_resources
+from ..utils import get_server_dumps_dir
 finalize_disjunction = lambda x: "(" + x[:-4] + ") AND "
 finalize_conjunction = lambda x: x[:-5]
@@ -59,6 +59,6 @@ def get_query_from_keywords(
 def infer_backend():
-    dump_root = pkg_resources.resource_filename("paperscraper", "server_dumps")
+    dump_root = get_server_dumps_dir()
     dump_paths = glob.glob(os.path.join(dump_root, "arxiv" + "*"))
     return "api" if not dump_paths else "local"

{paperscraper-0.3.3 → paperscraper-0.3.5}/paperscraper/async_utils.py RENAMED Viewed

@@ -49,14 +49,20 @@ def optional_async(
 def retry_with_exponential_backoff(
-    *, max_retries: int = 5, base_delay: float = 1.0
+    *,
+    max_retries: int = 5,
+    base_delay: float = 1.0,
+    factor: float = 1.3,
+    constant_delay: float = 0.2,
 ) -> Callable[[F], F]:
     """
     Decorator factory that retries an `async def` on HTTP 429, with exponential backoff.
     Args:
         max_retries: how many times to retry before giving up.
-        base_delay: initial delay in seconds; next delays will be duplication of previous.
+        base_delay: initial delay in seconds; next delays will be multiplied by `factor`.
+        factor: multiplier for delay after each retry.
+        constant_delay: fixed delay before each attempt.
     Usage:
@@ -70,18 +76,39 @@ def retry_with_exponential_backoff(
         @wraps(func)
         async def wrapper(*args, **kwargs) -> Any:
             delay = base_delay
-            for attempt in range(max_retries):
+            last_exception: BaseException | None = None
+            for attempt in range(1, max_retries + 1):
+                await asyncio.sleep(constant_delay)
                 try:
                     return await func(*args, **kwargs)
                 except httpx.HTTPStatusError as e:
-                    # only retry on 429
                     status = e.response.status_code if e.response is not None else None
-                    if status != 429 or attempt == max_retries - 1:
+                    if status != 429:
                         raise
-                # backoff
-                await asyncio.sleep(delay)
-                delay *= 2
-            # in theory we never reach here
+                    last_exception = e
+                    sleep_for = delay
+                    if e.response is not None:
+                        ra = e.response.headers.get("Retry-After")
+                        if ra is not None:
+                            try:
+                                sleep_for = float(ra)
+                            except ValueError:
+                                pass
+                    delay *= factor
+                except httpx.ReadError as e:
+                    last_exception = e
+                    sleep_for = delay
+                    delay *= factor
+                if attempt == max_retries:
+                    msg = (
+                        f"{func.__name__} failed after {attempt} attempts with "
+                        f"last delay {sleep_for:.2f}s"
+                    )
+                    raise RuntimeError(msg) from last_exception
+                await asyncio.sleep(sleep_for)
         return wrapper

{paperscraper-0.3.3 → paperscraper-0.3.5}/paperscraper/citations/citations.py RENAMED Viewed

@@ -1,4 +1,5 @@
 import logging
+import os
 import sys
 from time import sleep
@@ -7,7 +8,7 @@ from semanticscholar import SemanticScholar, SemanticScholarException
 logging.basicConfig(stream=sys.stdout, level=logging.INFO)
 logger = logging.getLogger(__name__)
-sch = SemanticScholar()
+sch = SemanticScholar(api_key=os.getenv("SS_API_KEY"))
 def get_citations_by_doi(doi: str) -> int:

{paperscraper-0.3.3 → paperscraper-0.3.5}/paperscraper/citations/entity/core.py RENAMED Viewed

@@ -5,14 +5,15 @@ from pydantic import BaseModel
 class EntityResult(BaseModel):
-    num_citations: int
-    num_references: int
-    # keys are authors or papers and values are absolute self links
-    self_citations: Dict[str, int] = {}
-    self_references: Dict[str, int] = {}
     # aggregated results
     self_citation_ratio: float = 0
     self_reference_ratio: float = 0
+    # total number of author citations/references
+    num_citations: int
+    num_references: int
+    # keys are papers and values are percentage of self citations/references
+    self_citations: Dict[str, float] = {}
+    self_references: Dict[str, float] = {}
 class Entity:

{paperscraper-0.3.3 → paperscraper-0.3.5}/paperscraper/citations/entity/paper.py RENAMED Viewed

@@ -68,14 +68,14 @@ class Paper(Entity):
         Extracts the self references of a paper, for each author.
         """
         if isinstance(self.doi, str):
-            self.ref_result: ReferenceResult = self_references_paper(self.doi)
+            self.self_ref: ReferenceResult = self_references_paper(self.doi)
     def self_citations(self):
         """
         Extracts the self citations of a paper, for each author.
         """
         if isinstance(self.doi, str):
-            self.citation_result: CitationResult = self_citations_paper(self.doi)
+            self.self_cite: CitationResult = self_citations_paper(self.doi)
     def get_result(self) -> Optional[PaperResult]:
         """
@@ -83,18 +83,20 @@ class Paper(Entity):
         Returns: PaperResult if available.
         """
-        if not hasattr(self, "ref_result"):
-            logger.warning(
-                f"Can't get result since no referencing result for {self.input} exists. Run `.self_references` first."
-            )
-            return
-        elif not hasattr(self, "citation_result"):
-            logger.warning(
-                f"Can't get result since no citation result for {self.input} exists. Run `.self_citations` first."
-            )
-            return
-        ref_result = self.ref_result.model_dump()
-        ref_result.pop("ssid", None)
+        if not hasattr(self, "self_ref"):
+            self.self_references()
+        if not hasattr(self, "self_cite"):
+            self.self_citations()
         return PaperResult(
-            title=self.title, **ref_result, **self.citation_result.model_dump()
+            title=self.title,
+            **{
+                k: v
+                for k, v in self.self_ref.model_dump().items()
+                if k not in ["ssid", "title"]
+            },
+            **{
+                k: v
+                for k, v in self.self_cite.model_dump().items()
+                if k not in ["title"]
+            },
         )

paperscraper-0.3.5/paperscraper/citations/entity/researcher.py ADDED Viewed

@@ -0,0 +1,221 @@
+import asyncio
+import os
+from typing import Any, List, Literal, Optional, Tuple
+from semanticscholar import SemanticScholar
+from ..orcid import orcid_to_author_name
+from ..self_citations import CitationResult, self_citations_paper
+from ..self_references import ReferenceResult, self_references_paper
+from ..utils import author_name_to_ssaid, get_papers_for_author
+from .core import Entity, EntityResult
+class ResearcherResult(EntityResult):
+    name: str
+    ssaid: int
+    orcid: Optional[str] = None
+    def _ordered_items(self) -> List[Tuple[str, Any]]:
+        # enforce specific ordering
+        return [
+            ("name", self.name),
+            ("self_reference_ratio", self.self_reference_ratio),
+            ("self_citation_ratio", self.self_citation_ratio),
+            ("num_references", self.num_references),
+            ("num_citations", self.num_citations),
+            ("self_references", self.self_references),
+            ("self_citations", self.self_citations),
+            ("ssaid", self.ssaid),
+            ("orcid", self.orcid),
+        ]
+    def __repr__(self) -> str:
+        inner = ", ".join(f"{k}={v!r}" for k, v in self._ordered_items())
+        return f"{self.__class__.__name__}({inner})"
+    def __str__(self) -> str:
+        return " ".join(f"{k}={v!r}" for k, v in self._ordered_items())
+ModeType = Literal[tuple(MODES := ("name", "orcid", "ssaid", "infer"))]
+sch = SemanticScholar(api_key=os.getenv("SS_API_KEY"))
+class Researcher(Entity):
+    name: str
+    ssaid: int
+    orcid: Optional[str] = None
+    ssids: List[int] = []
+    def __init__(self, input: str, mode: ModeType = "infer"):
+        """
+        Construct researcher object for self citation/reference analysis.
+        Args:
+            input: A researcher to search for, identified by name, ORCID iD, or Semantic Scholar Author ID.
+            mode: This can be a `name` `orcid` (ORCID iD) or `ssaid` (Semantic Scholar Author ID).
+                Defaults to "infer".
+        Raises:
+            ValueError: Unknown mode
+        """
+        if mode not in MODES:
+            raise ValueError(f"Unknown mode {mode} chose from {MODES}.")
+        input = input.strip()
+        if mode == "infer":
+            if input.isdigit():
+                mode = "ssaid"
+            elif (
+                input.count("-") == 3
+                and len(input) == 19
+                and all([x.isdigit() for x in input.split("-")])
+            ):
+                mode = "orcid"
+            else:
+                mode = "name"
+        if mode == "ssaid":
+            self.name = sch.get_author(input)._name
+            self.ssaid = input
+        elif mode == "orcid":
+            orcid_name = orcid_to_author_name(input)
+            self.orcid = input
+            self.ssaid, self.name = author_name_to_ssaid(orcid_name)
+        elif mode == "name":
+            self.name = input
+            self.ssaid, self.name = author_name_to_ssaid(input)
+        self.result = ResearcherResult(
+            name=self.name,
+            ssaid=int(self.ssaid),
+            orcid=self.orcid,
+            num_citations=-1,
+            num_references=-1,
+        )
+    async def _self_references_async(
+        self, verbose: bool = False
+    ) -> List[ReferenceResult]:
+        """Async version of self_references."""
+        if self.ssaid == "-1":
+            return []
+        if self.ssids == []:
+            self.ssids = await get_papers_for_author(self.ssaid)
+        results: List[ReferenceResult] = await self_references_paper(
+            self.ssids, verbose=verbose
+        )
+        # Remove papers with zero references or that are erratum/corrigendum
+        results = [
+            r
+            for r in results
+            if r.num_references > 0
+            and "erratum" not in r.title.lower()
+            and "corrigendum" not in r.title.lower()
+        ]
+        return results
+    def self_references(self, verbose: bool = False) -> ResearcherResult:
+        """
+        Sifts through all papers of a researcher and extracts the self references.
+        Args:
+            verbose: If True, logs detailed information for each paper.
+        Returns:
+            A ResearcherResult containing aggregated self-reference data.
+        """
+        reference_results = asyncio.run(self._self_references_async(verbose=verbose))
+        individual_self_references = {
+            getattr(result, "title"): getattr(result, "self_references").get(
+                self.name, 0.0
+            )
+            for result in reference_results
+        }
+        reference_ratio = sum(individual_self_references.values()) / max(
+            1, len(individual_self_references)
+        )
+        self.result = self.result.model_copy(
+            update={
+                "num_references": sum(r.num_references for r in reference_results),
+                "self_references": dict(
+                    sorted(
+                        individual_self_references.items(),
+                        key=lambda x: x[1],
+                        reverse=True,
+                    )
+                ),
+                "self_reference_ratio": round(reference_ratio, 3),
+            }
+        )
+        return self.result
+    async def _self_citations_async(
+        self, verbose: bool = False
+    ) -> List[CitationResult]:
+        """Async version of self_citations."""
+        if self.ssaid == "-1":
+            return []
+        if self.ssids == []:
+            self.ssids = await get_papers_for_author(self.ssaid)
+        results: List[CitationResult] = await self_citations_paper(
+            self.ssids, verbose=verbose
+        )
+        # Remove papers with zero references or that are erratum/corrigendum
+        results = [
+            r
+            for r in results
+            if r.num_citations > 0
+            and "erratum" not in r.title.lower()
+            and "corrigendum" not in r.title.lower()
+        ]
+        return results
+    def self_citations(self, verbose: bool = False) -> ResearcherResult:
+        """
+        Sifts through all papers of a researcher and finds how often they are self-cited.
+        """
+        citation_results = asyncio.run(self._self_citations_async(verbose=verbose))
+        individual_self_citations = {
+            getattr(result, "title"): getattr(result, "self_citations").get(
+                self.name, 0.0
+            )
+            for result in citation_results
+        }
+        citation_ratio = sum(individual_self_citations.values()) / max(
+            1, len(individual_self_citations)
+        )
+        self.result = self.result.model_copy(
+            update={
+                "num_citations": sum(r.num_citations for r in citation_results),
+                "self_citations": dict(
+                    sorted(
+                        individual_self_citations.items(),
+                        key=lambda x: x[1],
+                        reverse=True,
+                    )
+                ),
+                "self_citation_ratio": round(citation_ratio, 3),
+            }
+        )
+        return self.result
+    def get_result(self) -> ResearcherResult:
+        """
+        Provides the result of the analysis.
+        """
+        if not hasattr(self, "self_ref"):
+            self.self_references()
+        if not hasattr(self, "self_cite"):
+            self.self_citations()
+        return self.result

{paperscraper-0.3.3 → paperscraper-0.3.5}/paperscraper/citations/self_citations.py RENAMED Viewed

@@ -18,11 +18,13 @@ logging.getLogger("httpx").setLevel(logging.WARNING)
 class CitationResult(BaseModel):
     ssid: str  # semantic scholar paper id
+    title: str
     num_citations: int
     self_citations: Dict[str, float] = {}
     citation_score: float
+@retry_with_exponential_backoff(max_retries=14, base_delay=1.0)
 async def _fetch_citation_data(
     client: httpx.AsyncClient, suffix: str
 ) -> Dict[str, Any]:
@@ -87,6 +89,7 @@ async def _process_single(client: httpx.AsyncClient, identifier: str) -> Citatio
     return CitationResult(
         ssid=identifier,
+        title=paper.get("title", ""),
         num_citations=total_cites,
         self_citations=ratios,
         citation_score=avg_score,
@@ -94,7 +97,7 @@ async def _process_single(client: httpx.AsyncClient, identifier: str) -> Citatio
 @optional_async
-@retry_with_exponential_backoff(max_retries=4, base_delay=1.0)
+@retry_with_exponential_backoff(max_retries=10, base_delay=1.0)
 async def self_citations_paper(
     inputs: Union[str, List[str]], verbose: bool = False
 ) -> Union[CitationResult, List[CitationResult]]:
@@ -118,7 +121,7 @@ async def self_citations_paper(
     if verbose:
         for res in results:
             logger.info(
-                f'Self-citations in "{res.ssid}": N={res.num_citations}, Score={res.citation_score}%'
+                f'Self-citations in "{res.title}": N={res.num_citations}, Score={res.citation_score}%'
             )
             for author, pct in res.self_citations.items():
                 logger.info(f"  {author}: {pct}%")

paperscraper 0.3.3__tar.gz → 0.3.5__tar.gz

paperscraper 0.3.3tar.gz → 0.3.5tar.gz