PyPI - pubmed-research-classifier - Versions diffs - 0.1.0__tar.gz - Mend

pubmed-research-classifier 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

pubmed_research_classifier-0.1.0/.gitignore ADDED Viewed

@@ -0,0 +1,30 @@
+# Byte-compiled / cache
+__pycache__/
+*.py[cod]
+*$py.class
+.Python
+.mypy_cache/
+.pytest_cache/
+.ruff_cache/
+# Virtual environments
+.venv/
+venv/
+env/
+# IDE
+.idea/
+.vscode/
+*.swp
+*.swo
+.DS_Store
+# Env / secrets
+.env
+.env.local
+# Large data files (JSONL candidate downloads — can be hundreds of MB)
+data/*.jsonl
+# Keep the curated labels file (small, valuable)
+!dataset/labels.csv

pubmed_research_classifier-0.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 EMBO
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

pubmed_research_classifier-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,224 @@
+Metadata-Version: 2.4
+Name: pubmed-research-classifier
+Version: 0.1.0
+Summary: Classify PubMed articles as research or non-research using a trained MLP + ModernBERT.
+Project-URL: Repository, https://github.com/embo-press/pubmed-research-classifier
+Author-email: Jorge Abreu <jorge.abreu@embo.org>
+License: MIT License
+        Copyright (c) 2026 EMBO
+        Permission is hereby granted, free of charge, to any person obtaining a copy
+        of this software and associated documentation files (the "Software"), to deal
+        in the Software without restriction, including without limitation the rights
+        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+        copies of the Software, and to permit persons to whom the Software is
+        furnished to do so, subject to the following conditions:
+        The above copyright notice and this permission notice shall be included in all
+        copies or substantial portions of the Software.
+        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+        SOFTWARE.
+License-File: LICENSE
+Keywords: classification,nlp,pubmed,scientometrics
+Classifier: Intended Audience :: Science/Research
+Classifier: Programming Language :: Python :: 3
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Requires-Python: >=3.9
+Requires-Dist: joblib>=1.3
+Requires-Dist: numpy>=1.24
+Requires-Dist: scikit-learn>=1.3
+Requires-Dist: torch>=2.0
+Provides-Extra: embed
+Requires-Dist: sentence-transformers>=3.0; extra == 'embed'
+Description-Content-Type: text/markdown
+# pubmed-research-classifier
+Classify PubMed articles as **research** or **non-research** using a trained
+MLP on top of [EMBO/ModernBERT-neg-sampling-PubMed](https://huggingface.co/EMBO/ModernBERT-neg-sampling-PubMed) embeddings.
+Model weights, StandardScaler, and publication-type vocabulary are bundled in
+the package — no external model downloads are needed for the embedding-mode API.
+## Installation
+```bash
+# Embedding mode only (no sentence-transformers required)
+pip install pubmed-research-classifier
+# Text mode (package embeds internally)
+pip install "pubmed-research-classifier[embed]"
+```
+## Quick start
+### Text mode
+```python
+from pubmed_research_classifier import classify
+result = classify({
+    "title": "Structural basis of CRISPR-Cas9 activity",
+    "abstract": "We report crystal structures of Cas9 ...",
+    "pub_types": ["Journal Article"],
+    "n_authors": 8,
+    "n_refs": 42,
+})
+# {"label": "research", "p_nr": 0.018}
+```
+### Embedding mode
+Pre-compute embeddings with `EMBO/ModernBERT-neg-sampling-PubMed`
+using `normalize_embeddings=True`, then pass them directly:
+```python
+from pubmed_research_classifier import classify
+import numpy as np
+result = classify({
+    "title_emb":     title_embedding,    # np.ndarray, shape (768,)
+    "abstract_emb":  abstract_embedding, # np.ndarray, shape (768,); zeros if absent
+    "has_abstract":  True,
+    "length_title":  52,
+    "length_abstract": 1240,
+    "pub_types":     ["Journal Article"],
+    "n_authors":     8,
+    "n_refs":        42,
+})
+```
+### Batch — millions of records
+```python
+results = classify(records, batch_size=128)
+# Returns a list in the same order as the input.
+```
+## Input fields
+| Field | Type | Mode | Notes |
+|---|---|---|---|
+| `title` | str | text | |
+| `abstract` | str or None | text | empty/None → treated as absent |
+| `title_emb` | array (768,) | embed | L2-normalised |
+| `abstract_emb` | array (768,) | embed | L2-normalised; zeros if absent |
+| `has_abstract` | bool | embed | |
+| `length_title` | int | embed | auto-derived from `title` in text mode |
+| `length_abstract` | int | embed | auto-derived from `abstract` in text mode |
+| `pub_types` | list[str] or str | both | PubMed PT tags; comma-sep string accepted |
+| `n_authors` | int | both | |
+| `n_refs` | int | both | |
+| `has_funding` | bool | both | optional; inferred from "Research Support" PTs if omitted |
+## Output
+```python
+{"label": "research",     "p_nr": 0.018}
+{"label": "non-research", "p_nr": 0.921}
+```
+`p_nr` is P(non-research) from the model.
+Default threshold: 0.75 (configurable via `classify(..., threshold=0.75)`).
+## Obtaining `has_funding` from PubMed XML
+`has_funding` is `True` when the article's PubMed XML record contains at least
+one `<Grant>` element inside a `<GrantList>`.  It is **not** the same as the
+"Research Support, …" publication type tags (those are a separate, coarser
+signal also used by the model via `pub_types`).
+If you fetch articles via the NCBI E-utilities API (efetch, XML format), you
+can extract it like this:
+```python
+import xml.etree.ElementTree as ET
+def has_funding_from_xml(article_xml: str) -> bool:
+    """Return True if the PubMed XML contains at least one <Grant> entry."""
+    root = ET.fromstring(article_xml)
+    return len(root.findall(".//Grant")) > 0
+```
+Or, if you are working with a parsed `xml.etree.ElementTree.Element` object
+(e.g. the `<PubmedArticle>` node returned by your ETL pipeline):
+```python
+has_funding = len(article_element.findall(".//Grant")) > 0
+```
+If you do not have access to the raw XML and only have the metadata fields,
+omit `has_funding` entirely — the package will fall back to checking whether
+any of the `pub_types` start with `"Research Support"`, which is a reasonable
+proxy and is already captured separately in the model's publication-type
+features.
+## Publishing a new version to PyPI
+The built artifacts live in `pubmed-research-classifier/dist/`.
+### Workflow for every new release
+1. **Update the model weights** — copy new `mlp_best.pt`, `scaler.joblib`,
+   and/or `mlp_config.json` into
+   `src/pubmed_research_classifier/_data/` and overwrite the old files.
+2. **Bump the version** in two places:
+   ```toml
+   # pyproject.toml
+   version = "0.2.0"
+   ```
+   ```python
+   # src/pubmed_research_classifier/__init__.py
+   __version__ = "0.2.0"
+   ```
+3. **Rebuild the wheel:**
+   ```bash
+   cd pubmed-research-classifier
+   pip install build          # first time only
+   python -m build
+   # produces dist/pubmed_research_classifier-0.2.0-py3-none-any.whl
+   #          and dist/pubmed_research_classifier-0.2.0.tar.gz
+   ```
+4. **Upload to PyPI:**
+   ```bash
+   pip install twine           # first time only
+   twine upload dist/pubmed_research_classifier-0.2.0*
+   # Username: __token__
+   # Password: <your PyPI API token>
+   ```
+   PyPI API tokens are managed at <https://pypi.org/manage/account/token/>.
+   Use a project-scoped token (not account-wide) for safety.
+5. **Verify the release:**
+   ```bash
+   pip install "pubmed-research-classifier==0.2.0" --force-reinstall
+   python -c "from pubmed_research_classifier import classify; print('ok')"
+   ```
+### First-time PyPI setup
+If the package does not yet exist on PyPI, the first upload creates it
+automatically.  You will need a PyPI account and a project-scoped (or
+account-scoped) API token.  Test releases can go to
+<https://test.pypi.org> first:
+```bash
+twine upload --repository testpypi dist/pubmed_research_classifier-0.1.0*
+pip install --index-url https://test.pypi.org/simple/ pubmed-research-classifier
+```

pubmed_research_classifier-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,183 @@
+# pubmed-research-classifier
+Classify PubMed articles as **research** or **non-research** using a trained
+MLP on top of [EMBO/ModernBERT-neg-sampling-PubMed](https://huggingface.co/EMBO/ModernBERT-neg-sampling-PubMed) embeddings.
+Model weights, StandardScaler, and publication-type vocabulary are bundled in
+the package — no external model downloads are needed for the embedding-mode API.
+## Installation
+```bash
+# Embedding mode only (no sentence-transformers required)
+pip install pubmed-research-classifier
+# Text mode (package embeds internally)
+pip install "pubmed-research-classifier[embed]"
+```
+## Quick start
+### Text mode
+```python
+from pubmed_research_classifier import classify
+result = classify({
+    "title": "Structural basis of CRISPR-Cas9 activity",
+    "abstract": "We report crystal structures of Cas9 ...",
+    "pub_types": ["Journal Article"],
+    "n_authors": 8,
+    "n_refs": 42,
+})
+# {"label": "research", "p_nr": 0.018}
+```
+### Embedding mode
+Pre-compute embeddings with `EMBO/ModernBERT-neg-sampling-PubMed`
+using `normalize_embeddings=True`, then pass them directly:
+```python
+from pubmed_research_classifier import classify
+import numpy as np
+result = classify({
+    "title_emb":     title_embedding,    # np.ndarray, shape (768,)
+    "abstract_emb":  abstract_embedding, # np.ndarray, shape (768,); zeros if absent
+    "has_abstract":  True,
+    "length_title":  52,
+    "length_abstract": 1240,
+    "pub_types":     ["Journal Article"],
+    "n_authors":     8,
+    "n_refs":        42,
+})
+```
+### Batch — millions of records
+```python
+results = classify(records, batch_size=128)
+# Returns a list in the same order as the input.
+```
+## Input fields
+| Field | Type | Mode | Notes |
+|---|---|---|---|
+| `title` | str | text | |
+| `abstract` | str or None | text | empty/None → treated as absent |
+| `title_emb` | array (768,) | embed | L2-normalised |
+| `abstract_emb` | array (768,) | embed | L2-normalised; zeros if absent |
+| `has_abstract` | bool | embed | |
+| `length_title` | int | embed | auto-derived from `title` in text mode |
+| `length_abstract` | int | embed | auto-derived from `abstract` in text mode |
+| `pub_types` | list[str] or str | both | PubMed PT tags; comma-sep string accepted |
+| `n_authors` | int | both | |
+| `n_refs` | int | both | |
+| `has_funding` | bool | both | optional; inferred from "Research Support" PTs if omitted |
+## Output
+```python
+{"label": "research",     "p_nr": 0.018}
+{"label": "non-research", "p_nr": 0.921}
+```
+`p_nr` is P(non-research) from the model.
+Default threshold: 0.75 (configurable via `classify(..., threshold=0.75)`).
+## Obtaining `has_funding` from PubMed XML
+`has_funding` is `True` when the article's PubMed XML record contains at least
+one `<Grant>` element inside a `<GrantList>`.  It is **not** the same as the
+"Research Support, …" publication type tags (those are a separate, coarser
+signal also used by the model via `pub_types`).
+If you fetch articles via the NCBI E-utilities API (efetch, XML format), you
+can extract it like this:
+```python
+import xml.etree.ElementTree as ET
+def has_funding_from_xml(article_xml: str) -> bool:
+    """Return True if the PubMed XML contains at least one <Grant> entry."""
+    root = ET.fromstring(article_xml)
+    return len(root.findall(".//Grant")) > 0
+```
+Or, if you are working with a parsed `xml.etree.ElementTree.Element` object
+(e.g. the `<PubmedArticle>` node returned by your ETL pipeline):
+```python
+has_funding = len(article_element.findall(".//Grant")) > 0
+```
+If you do not have access to the raw XML and only have the metadata fields,
+omit `has_funding` entirely — the package will fall back to checking whether
+any of the `pub_types` start with `"Research Support"`, which is a reasonable
+proxy and is already captured separately in the model's publication-type
+features.
+## Publishing a new version to PyPI
+The built artifacts live in `pubmed-research-classifier/dist/`.
+### Workflow for every new release
+1. **Update the model weights** — copy new `mlp_best.pt`, `scaler.joblib`,
+   and/or `mlp_config.json` into
+   `src/pubmed_research_classifier/_data/` and overwrite the old files.
+2. **Bump the version** in two places:
+   ```toml
+   # pyproject.toml
+   version = "0.2.0"
+   ```
+   ```python
+   # src/pubmed_research_classifier/__init__.py
+   __version__ = "0.2.0"
+   ```
+3. **Rebuild the wheel:**
+   ```bash
+   cd pubmed-research-classifier
+   pip install build          # first time only
+   python -m build
+   # produces dist/pubmed_research_classifier-0.2.0-py3-none-any.whl
+   #          and dist/pubmed_research_classifier-0.2.0.tar.gz
+   ```
+4. **Upload to PyPI:**
+   ```bash
+   pip install twine           # first time only
+   twine upload dist/pubmed_research_classifier-0.2.0*
+   # Username: __token__
+   # Password: <your PyPI API token>
+   ```
+   PyPI API tokens are managed at <https://pypi.org/manage/account/token/>.
+   Use a project-scoped token (not account-wide) for safety.
+5. **Verify the release:**
+   ```bash
+   pip install "pubmed-research-classifier==0.2.0" --force-reinstall
+   python -c "from pubmed_research_classifier import classify; print('ok')"
+   ```
+### First-time PyPI setup
+If the package does not yet exist on PyPI, the first upload creates it
+automatically.  You will need a PyPI account and a project-scoped (or
+account-scoped) API token.  Test releases can go to
+<https://test.pypi.org> first:
+```bash
+twine upload --repository testpypi dist/pubmed_research_classifier-0.1.0*
+pip install --index-url https://test.pypi.org/simple/ pubmed-research-classifier
+```

pubmed_research_classifier-0.1.0/pyproject.toml ADDED Viewed

@@ -0,0 +1,42 @@
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[project]
+name = "pubmed-research-classifier"
+version = "0.1.0"
+description = "Classify PubMed articles as research or non-research using a trained MLP + ModernBERT."
+readme = "README.md"
+license = { file = "LICENSE" }
+requires-python = ">=3.9"
+authors = [{ name = "Jorge Abreu", email = "jorge.abreu@embo.org" }]
+keywords = ["pubmed", "nlp", "scientometrics", "classification"]
+classifiers = [
+    "Programming Language :: Python :: 3",
+    "Intended Audience :: Science/Research",
+    "Topic :: Scientific/Engineering :: Artificial Intelligence",
+]
+dependencies = [
+    "torch>=2.0",
+    "numpy>=1.24",
+    "scikit-learn>=1.3",
+    "joblib>=1.3",
+]
+[project.optional-dependencies]
+embed = ["sentence-transformers>=3.0"]
+[project.urls]
+Repository = "https://github.com/embo-press/pubmed-research-classifier"
+[tool.pytest.ini_options]
+testpaths = ["tests"]
+[tool.hatch.build.targets.wheel]
+packages = ["src/pubmed_research_classifier"]
+# Ensure binary and JSON data files inside the package are included.
+artifacts = [
+    "*.pt",
+    "*.joblib",
+    "*.json",
+]

pubmed_research_classifier-0.1.0/src/pubmed_research_classifier/__init__.py ADDED Viewed

@@ -0,0 +1,52 @@
+"""pubmed-research-classifier — classify PubMed articles as research or non-research.
+Quick start
+-----------
+Text mode (package embeds internally — requires the ``embed`` extra)::
+    pip install pubmed-research-classifier[embed]
+    from pubmed_research_classifier import classify
+    result = classify({
+        "title": "Structural basis of CRISPR-Cas9 activity",
+        "abstract": "We report crystal structures ...",
+        "pub_types": ["Journal Article"],
+        "n_authors": 8,
+        "n_refs": 42,
+    })
+    # {"label": "research", "p_nr": 0.018}
+Embedding mode (pre-compute with EMBO/ModernBERT-neg-sampling-PubMed)::
+    pip install pubmed-research-classifier
+    from pubmed_research_classifier import classify
+    import numpy as np
+    result = classify({
+        "title_emb": np.zeros(768),     # shape (768,), L2-normalised
+        "abstract_emb": np.zeros(768),
+        "has_abstract": True,
+        "length_title": 52,
+        "length_abstract": 1240,
+        "pub_types": ["Journal Article"],
+        "n_authors": 8,
+        "n_refs": 42,
+    })
+Batch (millions of records)::
+    results = classify(list_of_records, batch_size=128)
+Model
+-----
+Trained MLP + ModernBERT embeddings (EMBO/ModernBERT-neg-sampling-PubMed).
+Weights, scaler, and vocabulary are bundled in the package — no external
+downloads required (except for the embedding model in text mode).
+"""
+from ._inference import classify_records as classify
+__all__ = ["classify"]
+__version__ = "0.1.0"

pubmed_research_classifier-0.1.0/src/pubmed_research_classifier/_data/__init__.py ADDED Viewed

File without changes

pubmed_research_classifier-0.1.0/src/pubmed_research_classifier/_data/mlp_best.pt ADDED Viewed

Binary file

pubmed_research_classifier-0.1.0/src/pubmed_research_classifier/_data/mlp_config.json ADDED Viewed

@@ -0,0 +1,14 @@
+{
+  "n_scalars": 200,
+  "use_funding": true,
+  "use_pub_types": true,
+  "proj_dim": 64,
+  "hidden_dims": [
+    128,
+    32,
+    16
+  ],
+  "dropout": 0.2,
+  "use_title_emb": true,
+  "use_abstract_emb": true
+}