pubmatrixpython 0.2.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- pubmatrixpython-0.2.0/.github/workflows/publish.yml +42 -0
- pubmatrixpython-0.2.0/.gitignore +26 -0
- pubmatrixpython-0.2.0/.python-version +1 -0
- pubmatrixpython-0.2.0/CHANGELOG.md +28 -0
- pubmatrixpython-0.2.0/LICENSE +2 -0
- pubmatrixpython-0.2.0/LICENSE.md +21 -0
- pubmatrixpython-0.2.0/PKG-INFO +300 -0
- pubmatrixpython-0.2.0/README.md +267 -0
- pubmatrixpython-0.2.0/docs/performance.md +30 -0
- pubmatrixpython-0.2.0/docs/troubleshooting.md +27 -0
- pubmatrixpython-0.2.0/notebooks/01_pubmatrix.ipynb +660 -0
- pubmatrixpython-0.2.0/notebooks/02_example_wnt.ipynb +348 -0
- pubmatrixpython-0.2.0/notebooks/heatmap_output.png +0 -0
- pubmatrixpython-0.2.0/notebooks/output.csv +4 -0
- pubmatrixpython-0.2.0/notebooks/wnt_obesity_matrix.csv +8 -0
- pubmatrixpython-0.2.0/pubmatrix/__init__.py +9 -0
- pubmatrixpython-0.2.0/pubmatrix/core.py +411 -0
- pubmatrixpython-0.2.0/pubmatrix/heatmap.py +213 -0
- pubmatrixpython-0.2.0/pyproject.toml +56 -0
- pubmatrixpython-0.2.0/tests/__init__.py +0 -0
- pubmatrixpython-0.2.0/tests/conftest.py +52 -0
- pubmatrixpython-0.2.0/tests/test_core.py +234 -0
- pubmatrixpython-0.2.0/tests/test_heatmap.py +195 -0
|
@@ -0,0 +1,42 @@
|
|
|
1
|
+
name: Publish to PyPI
|
|
2
|
+
|
|
3
|
+
on:
|
|
4
|
+
push:
|
|
5
|
+
tags:
|
|
6
|
+
- "v*"
|
|
7
|
+
|
|
8
|
+
jobs:
|
|
9
|
+
build:
|
|
10
|
+
name: Build distribution
|
|
11
|
+
runs-on: ubuntu-latest
|
|
12
|
+
steps:
|
|
13
|
+
- uses: actions/checkout@v6
|
|
14
|
+
with:
|
|
15
|
+
persist-credentials: false
|
|
16
|
+
- name: Install uv
|
|
17
|
+
uses: astral-sh/setup-uv@v3
|
|
18
|
+
- name: Build sdist and wheel
|
|
19
|
+
run: uv build
|
|
20
|
+
- name: Store distribution packages
|
|
21
|
+
uses: actions/upload-artifact@v5
|
|
22
|
+
with:
|
|
23
|
+
name: python-package-distributions
|
|
24
|
+
path: dist/
|
|
25
|
+
|
|
26
|
+
publish-to-pypi:
|
|
27
|
+
name: Publish to PyPI
|
|
28
|
+
needs: build
|
|
29
|
+
runs-on: ubuntu-latest
|
|
30
|
+
environment:
|
|
31
|
+
name: pypi
|
|
32
|
+
url: https://pypi.org/p/pubmatrixpython
|
|
33
|
+
permissions:
|
|
34
|
+
id-token: write
|
|
35
|
+
steps:
|
|
36
|
+
- name: Download distributions
|
|
37
|
+
uses: actions/download-artifact@v6
|
|
38
|
+
with:
|
|
39
|
+
name: python-package-distributions
|
|
40
|
+
path: dist/
|
|
41
|
+
- name: Publish to PyPI
|
|
42
|
+
uses: pypa/gh-action-pypi-publish@release/v1
|
|
@@ -0,0 +1,26 @@
|
|
|
1
|
+
# Python-generated files
|
|
2
|
+
__pycache__/
|
|
3
|
+
*.py[oc]
|
|
4
|
+
build/
|
|
5
|
+
dist/
|
|
6
|
+
wheels/
|
|
7
|
+
*.egg-info
|
|
8
|
+
|
|
9
|
+
# Virtual environments
|
|
10
|
+
.venv
|
|
11
|
+
|
|
12
|
+
# Jupyter
|
|
13
|
+
.ipynb_checkpoints/
|
|
14
|
+
|
|
15
|
+
# Environment
|
|
16
|
+
.env
|
|
17
|
+
|
|
18
|
+
# uv lock file
|
|
19
|
+
uv.lock
|
|
20
|
+
|
|
21
|
+
# dev / local-only notebooks
|
|
22
|
+
dev/
|
|
23
|
+
|
|
24
|
+
# Status bank files
|
|
25
|
+
CLAUDE*.md
|
|
26
|
+
.claude/
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
3.13
|
|
@@ -0,0 +1,28 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
## [0.2.0] — 2026-06-03
|
|
4
|
+
|
|
5
|
+
- `n_workers` parameter for concurrent NCBI queries (respects rate limits automatically)
|
|
6
|
+
- `cache_dir` parameter to cache query results to disk and skip redundant requests
|
|
7
|
+
- `timeout` parameter exposed on `pubmatrix()` (was hardcoded at 30 s)
|
|
8
|
+
- `plot_pubmatrix_heatmap()` now returns `(fig, ax)` tuple instead of `ax` alone
|
|
9
|
+
- Added `show` parameter to `plot_pubmatrix_heatmap()`; `plt.show()` no longer called automatically
|
|
10
|
+
- Replaced `print()` with `logging` throughout — output can now be suppressed or redirected
|
|
11
|
+
- Narrowed exception handling in `_fetch_count()` to `requests.RequestException`
|
|
12
|
+
- Fixed float-unsafe clustering check (`np.allclose` instead of `==`)
|
|
13
|
+
- `n_tries` and `n_workers` now validated with clear error messages
|
|
14
|
+
- `odfpy` moved to optional extra: `pip install pubmatrixpython[ods]`
|
|
15
|
+
- Requires Python ≥ 3.10 (relaxed from 3.13)
|
|
16
|
+
- Full test suite: 60 tests covering core queries, XML parsing, caching, retry logic, and heatmap rendering
|
|
17
|
+
|
|
18
|
+
## [0.1.0] — 2026-03-28
|
|
19
|
+
|
|
20
|
+
Initial release. Python port of [PubMatrixR](https://github.com/ToledoEM/PubMatrixR-v2).
|
|
21
|
+
|
|
22
|
+
- `pubmatrix()` — pairwise PubMed/PMC co-occurrence queries with progress bar
|
|
23
|
+
- `pubmatrix_from_file()` — load term lists from a plain-text file
|
|
24
|
+
- `plot_pubmatrix_heatmap()` — heatmap with optional clustering, custom colours, PNG export
|
|
25
|
+
- `pubmatrix_heatmap()` — quick wrapper with defaults
|
|
26
|
+
- Date range filtering via `daterange`
|
|
27
|
+
- CSV and ODS export with PubMed hyperlinks
|
|
28
|
+
- NCBI API key support for higher rate limits
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
# MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Enrique Toledo
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,300 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: pubmatrixpython
|
|
3
|
+
Version: 0.2.0
|
|
4
|
+
Summary: Python port of PubMatrixR — systematic literature co-occurrence analysis via NCBI PubMed
|
|
5
|
+
Project-URL: Homepage, https://toledoem.github.io/pubmatrixp/
|
|
6
|
+
Project-URL: Repository, https://github.com/ToledoEM/PubMatrixPython
|
|
7
|
+
Project-URL: Changelog, https://github.com/ToledoEM/PubMatrixPython/blob/main/CHANGELOG.md
|
|
8
|
+
Author-email: Enrique Toledo <enriquetoledo@gmail.com>
|
|
9
|
+
License-Expression: MIT
|
|
10
|
+
License-File: LICENSE
|
|
11
|
+
License-File: LICENSE.md
|
|
12
|
+
Keywords: bioinformatics,co-occurrence,literature-mining,ncbi,pubmed
|
|
13
|
+
Classifier: Development Status :: 4 - Beta
|
|
14
|
+
Classifier: Intended Audience :: Science/Research
|
|
15
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
16
|
+
Classifier: Programming Language :: Python :: 3
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
20
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
21
|
+
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
|
|
22
|
+
Classifier: Topic :: Scientific/Engineering :: Information Analysis
|
|
23
|
+
Requires-Python: >=3.10
|
|
24
|
+
Requires-Dist: matplotlib<4,>=3.10
|
|
25
|
+
Requires-Dist: pandas<4,>=2.0
|
|
26
|
+
Requires-Dist: requests<3,>=2.33
|
|
27
|
+
Requires-Dist: scipy<2,>=1.10
|
|
28
|
+
Requires-Dist: seaborn<1,>=0.13
|
|
29
|
+
Requires-Dist: tqdm<5,>=4.60
|
|
30
|
+
Provides-Extra: ods
|
|
31
|
+
Requires-Dist: odfpy>=1.4.1; extra == 'ods'
|
|
32
|
+
Description-Content-Type: text/markdown
|
|
33
|
+
|
|
34
|
+
# PubMatrixPython v0.2
|
|
35
|
+
|
|
36
|
+
<img src="https://toledoem.github.io/img/LogoPubmatrixP.png" align="right" width="150"/>
|
|
37
|
+
|
|
38
|
+

|
|
39
|
+

|
|
40
|
+

|
|
41
|
+
|
|
42
|
+
Python port of the [PubMatrixR](https://github.com/ToledoEM/PubMatrixR-v2) R package.
|
|
43
|
+
|
|
44
|
+
For every pair of search terms `(A, B)`, it counts how many PubMed or PMC publications mention both. Good for mapping relationships between genes, diseases, and pathways across the literature.
|
|
45
|
+
|
|
46
|
+
Based on: Becker et al. (2003) *PubMatrix: a tool for multiplex literature mining*. BMC Bioinformatics 4:61. https://doi.org/10.1186/1471-2105-4-61
|
|
47
|
+
|
|
48
|
+
---
|
|
49
|
+
|
|
50
|
+
## Key features
|
|
51
|
+
|
|
52
|
+
- **Pairwise literature search** — automatically searches every combination of terms from two lists
|
|
53
|
+
- **PubMed or PMC** — query MEDLINE abstracts or PMC full text via NCBI E-utilities
|
|
54
|
+
- **Heatmap visualisation** — overlap-percentage heatmaps with optional hierarchical clustering
|
|
55
|
+
- **Export to CSV or ODS** — results include clickable hyperlinks to the matching PubMed search
|
|
56
|
+
- **Date filtering** — restrict searches to a publication year range
|
|
57
|
+
- **Flexible input** — pass term lists directly, or load them from a text file
|
|
58
|
+
- **Concurrency** — `n_workers` for parallel queries, respecting NCBI rate limits
|
|
59
|
+
- **Disk caching** — `cache_dir` persists query results between runs
|
|
60
|
+
- **Progress tracking** — built-in progress bar for long searches
|
|
61
|
+
|
|
62
|
+
## Use cases
|
|
63
|
+
|
|
64
|
+
- **Gene–disease association studies** — explore literature connections between genes and diseases
|
|
65
|
+
- **Pathway analysis** — investigate co-occurrence of genes within or across biological pathways
|
|
66
|
+
- **Drug–target research** — analyse relationships between compounds and potential targets
|
|
67
|
+
- **Systematic literature reviews** — quantify research coverage across multiple topics
|
|
68
|
+
- **Knowledge gap identification** — find under-researched combinations of terms
|
|
69
|
+
- **Bibliometric analysis** — measure research activity in a domain over time
|
|
70
|
+
|
|
71
|
+
---
|
|
72
|
+
|
|
73
|
+
## Setup
|
|
74
|
+
|
|
75
|
+
Requires [uv](https://docs.astral.sh/uv/). Install it with:
|
|
76
|
+
|
|
77
|
+
```bash
|
|
78
|
+
curl -LsSf https://astral.sh/uv/install.sh | sh
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
Clone and install dependencies:
|
|
82
|
+
|
|
83
|
+
```bash
|
|
84
|
+
git clone <repo-url>
|
|
85
|
+
cd PubMatrixPython
|
|
86
|
+
uv sync --all-groups
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
---
|
|
90
|
+
|
|
91
|
+
## Running the notebooks
|
|
92
|
+
|
|
93
|
+
All `uv` commands must be run from the **project root** (`PubMatrixPython/`), where `pyproject.toml` lives.
|
|
94
|
+
|
|
95
|
+
```bash
|
|
96
|
+
cd /path/to/PubMatrixPython
|
|
97
|
+
uv run jupyter lab
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
Then open any notebook from the `notebooks/` folder in the browser.
|
|
101
|
+
|
|
102
|
+
| Notebook | What it covers |
|
|
103
|
+
|----------|---------------|
|
|
104
|
+
| `01_pubmatrix.ipynb` | Basic queries, date filtering, PMC database, file input, CSV export, heatmap visualisation |
|
|
105
|
+
| `02_example_wnt.ipynb` | Full worked example: WNT genes × obesity genes |
|
|
106
|
+
|
|
107
|
+
---
|
|
108
|
+
|
|
109
|
+
## Quick start (script or REPL)
|
|
110
|
+
|
|
111
|
+
### Interactive REPL
|
|
112
|
+
|
|
113
|
+
```bash
|
|
114
|
+
uv run python
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
```python
|
|
118
|
+
from pubmatrix import pubmatrix, plot_pubmatrix_heatmap
|
|
119
|
+
|
|
120
|
+
A = ["WNT1", "WNT2", "CTNNB1"]
|
|
121
|
+
B = ["obesity", "diabetes", "cancer"]
|
|
122
|
+
|
|
123
|
+
result = pubmatrix(A=A, B=B)
|
|
124
|
+
print(result)
|
|
125
|
+
|
|
126
|
+
plot_pubmatrix_heatmap(result, title="WNT × Disease")
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
### Running a script
|
|
130
|
+
|
|
131
|
+
Create a file `my_analysis.py`:
|
|
132
|
+
|
|
133
|
+
```python
|
|
134
|
+
from pubmatrix import pubmatrix, plot_pubmatrix_heatmap
|
|
135
|
+
|
|
136
|
+
A = ["WNT1", "WNT2", "WNT3A", "WNT5A", "CTNNB1"]
|
|
137
|
+
B = ["obesity", "diabetes", "cancer", "inflammation"]
|
|
138
|
+
|
|
139
|
+
result = pubmatrix(
|
|
140
|
+
A=A,
|
|
141
|
+
B=B,
|
|
142
|
+
database="pubmed",
|
|
143
|
+
daterange=[2010, 2024], # optional date filter
|
|
144
|
+
outfile="results",
|
|
145
|
+
export_format="csv", # saves results_result.csv with PubMed hyperlinks
|
|
146
|
+
)
|
|
147
|
+
|
|
148
|
+
print(result)
|
|
149
|
+
|
|
150
|
+
plot_pubmatrix_heatmap(
|
|
151
|
+
result,
|
|
152
|
+
title="WNT Genes × Disease",
|
|
153
|
+
filename="heatmap.png", # saves to file instead of displaying
|
|
154
|
+
)
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
Run it with:
|
|
158
|
+
|
|
159
|
+
```bash
|
|
160
|
+
uv run python my_analysis.py
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
### Loading terms from a file
|
|
164
|
+
|
|
165
|
+
Create `terms.txt`:
|
|
166
|
+
|
|
167
|
+
```
|
|
168
|
+
WNT1
|
|
169
|
+
WNT2
|
|
170
|
+
CTNNB1
|
|
171
|
+
#
|
|
172
|
+
obesity
|
|
173
|
+
diabetes
|
|
174
|
+
cancer
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
```python
|
|
178
|
+
from pubmatrix import pubmatrix_from_file
|
|
179
|
+
|
|
180
|
+
result = pubmatrix_from_file("terms.txt")
|
|
181
|
+
print(result)
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
```bash
|
|
185
|
+
uv run python my_analysis.py
|
|
186
|
+
```
|
|
187
|
+
|
|
188
|
+
---
|
|
189
|
+
|
|
190
|
+
## API reference
|
|
191
|
+
|
|
192
|
+
### `pubmatrix(A, B, ...)`
|
|
193
|
+
|
|
194
|
+
Query PubMed and return a `pandas.DataFrame` (rows = B, cols = A).
|
|
195
|
+
|
|
196
|
+
```python
|
|
197
|
+
pubmatrix(
|
|
198
|
+
A, # list of str — column terms
|
|
199
|
+
B, # list of str — row terms
|
|
200
|
+
api_key=None, # NCBI API key (10 req/s vs 3 req/s default)
|
|
201
|
+
database="pubmed", # "pubmed" or "pmc"
|
|
202
|
+
daterange=None, # e.g. [2015, 2024]
|
|
203
|
+
outfile=None, # base filename for export
|
|
204
|
+
export_format=None, # None | "csv" | "ods"
|
|
205
|
+
n_tries=2, # retries on network failure
|
|
206
|
+
n_workers=1, # parallel workers for concurrent queries
|
|
207
|
+
timeout=30, # HTTP request timeout in seconds
|
|
208
|
+
cache_dir=None, # directory to cache query results on disk
|
|
209
|
+
)
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
### `pubmatrix_from_file(filepath, ...)`
|
|
213
|
+
|
|
214
|
+
Load terms from a plain-text file and run `pubmatrix()`.
|
|
215
|
+
|
|
216
|
+
File format:
|
|
217
|
+
```
|
|
218
|
+
WNT1
|
|
219
|
+
WNT2
|
|
220
|
+
#
|
|
221
|
+
obesity
|
|
222
|
+
diabetes
|
|
223
|
+
```
|
|
224
|
+
|
|
225
|
+
```python
|
|
226
|
+
result = pubmatrix_from_file("terms.txt", database="pubmed")
|
|
227
|
+
```
|
|
228
|
+
|
|
229
|
+
### `plot_pubmatrix_heatmap(matrix, ...)`
|
|
230
|
+
|
|
231
|
+
Heatmap of overlap percentages with optional hierarchical clustering. Returns `(fig, ax)`.
|
|
232
|
+
|
|
233
|
+
```python
|
|
234
|
+
fig, ax = plot_pubmatrix_heatmap(
|
|
235
|
+
matrix, # DataFrame from pubmatrix()
|
|
236
|
+
title="PubMatrix Co-occurrence Heatmap",
|
|
237
|
+
cluster_rows=True,
|
|
238
|
+
cluster_cols=True,
|
|
239
|
+
show_numbers=True,
|
|
240
|
+
color_palette=None, # list of hex colours
|
|
241
|
+
filename=None, # save to PNG if set
|
|
242
|
+
width=10, height=8,
|
|
243
|
+
scale_font=True,
|
|
244
|
+
show=False, # call plt.show() after plotting
|
|
245
|
+
)
|
|
246
|
+
```
|
|
247
|
+
|
|
248
|
+
### `pubmatrix_heatmap(matrix, title=...)`
|
|
249
|
+
|
|
250
|
+
Quick wrapper around `plot_pubmatrix_heatmap()` with all defaults. Returns `(fig, ax)`.
|
|
251
|
+
|
|
252
|
+
---
|
|
253
|
+
|
|
254
|
+
## Output files
|
|
255
|
+
|
|
256
|
+
When `outfile` and `export_format` are set, results are written to
|
|
257
|
+
`{outfile}_result.{extension}` (`.csv` or `.ods`). Each cell contains the
|
|
258
|
+
publication count and a hyperlink to the matching PubMed search. Row names
|
|
259
|
+
come from `B`, column names from `A`.
|
|
260
|
+
|
|
261
|
+
ODS export requires the optional `odfpy` dependency:
|
|
262
|
+
|
|
263
|
+
```bash
|
|
264
|
+
pip install pubmatrixpython[ods]
|
|
265
|
+
```
|
|
266
|
+
|
|
267
|
+
---
|
|
268
|
+
|
|
269
|
+
## NCBI API key
|
|
270
|
+
|
|
271
|
+
Without a key: 3 requests/second. With a key: 10 requests/second.
|
|
272
|
+
Get one at https://account.ncbi.nlm.nih.gov/
|
|
273
|
+
|
|
274
|
+
```python
|
|
275
|
+
result = pubmatrix(A=A, B=B, api_key="YOUR_KEY_HERE")
|
|
276
|
+
```
|
|
277
|
+
|
|
278
|
+
---
|
|
279
|
+
|
|
280
|
+
## More documentation
|
|
281
|
+
|
|
282
|
+
- [Performance notes](docs/performance.md) — rate limits, caching, concurrency
|
|
283
|
+
- [Troubleshooting](docs/troubleshooting.md) — empty results, rate limiting, slow searches
|
|
284
|
+
- [Full reference notebook](https://toledoem.github.io/pubmatrixp/) — every parameter and feature, with output
|
|
285
|
+
|
|
286
|
+
---
|
|
287
|
+
|
|
288
|
+
## License & citation
|
|
289
|
+
|
|
290
|
+
This project is licensed under the MIT License — see [`LICENSE.md`](LICENSE.md).
|
|
291
|
+
|
|
292
|
+
If you use PubMatrixPython in your research, please cite:
|
|
293
|
+
|
|
294
|
+
> Becker KG, Hosack DA, Dennis G Jr, Lempicki RA, Bright TJ, Cheadle C, Engel J.
|
|
295
|
+
> *PubMatrix: a tool for multiplex literature mining.*
|
|
296
|
+
> BMC Bioinformatics. 2003 Dec 10;4:61. https://doi.org/10.1186/1471-2105-4-61
|
|
297
|
+
|
|
298
|
+
**Developers:**
|
|
299
|
+
- Tyler Laird (Author, original PubMatrixR)
|
|
300
|
+
- Enrique Toledo (Author, maintainer)
|
|
@@ -0,0 +1,267 @@
|
|
|
1
|
+
# PubMatrixPython v0.2
|
|
2
|
+
|
|
3
|
+
<img src="https://toledoem.github.io/img/LogoPubmatrixP.png" align="right" width="150"/>
|
|
4
|
+
|
|
5
|
+

|
|
6
|
+

|
|
7
|
+

|
|
8
|
+
|
|
9
|
+
Python port of the [PubMatrixR](https://github.com/ToledoEM/PubMatrixR-v2) R package.
|
|
10
|
+
|
|
11
|
+
For every pair of search terms `(A, B)`, it counts how many PubMed or PMC publications mention both. Good for mapping relationships between genes, diseases, and pathways across the literature.
|
|
12
|
+
|
|
13
|
+
Based on: Becker et al. (2003) *PubMatrix: a tool for multiplex literature mining*. BMC Bioinformatics 4:61. https://doi.org/10.1186/1471-2105-4-61
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## Key features
|
|
18
|
+
|
|
19
|
+
- **Pairwise literature search** — automatically searches every combination of terms from two lists
|
|
20
|
+
- **PubMed or PMC** — query MEDLINE abstracts or PMC full text via NCBI E-utilities
|
|
21
|
+
- **Heatmap visualisation** — overlap-percentage heatmaps with optional hierarchical clustering
|
|
22
|
+
- **Export to CSV or ODS** — results include clickable hyperlinks to the matching PubMed search
|
|
23
|
+
- **Date filtering** — restrict searches to a publication year range
|
|
24
|
+
- **Flexible input** — pass term lists directly, or load them from a text file
|
|
25
|
+
- **Concurrency** — `n_workers` for parallel queries, respecting NCBI rate limits
|
|
26
|
+
- **Disk caching** — `cache_dir` persists query results between runs
|
|
27
|
+
- **Progress tracking** — built-in progress bar for long searches
|
|
28
|
+
|
|
29
|
+
## Use cases
|
|
30
|
+
|
|
31
|
+
- **Gene–disease association studies** — explore literature connections between genes and diseases
|
|
32
|
+
- **Pathway analysis** — investigate co-occurrence of genes within or across biological pathways
|
|
33
|
+
- **Drug–target research** — analyse relationships between compounds and potential targets
|
|
34
|
+
- **Systematic literature reviews** — quantify research coverage across multiple topics
|
|
35
|
+
- **Knowledge gap identification** — find under-researched combinations of terms
|
|
36
|
+
- **Bibliometric analysis** — measure research activity in a domain over time
|
|
37
|
+
|
|
38
|
+
---
|
|
39
|
+
|
|
40
|
+
## Setup
|
|
41
|
+
|
|
42
|
+
Requires [uv](https://docs.astral.sh/uv/). Install it with:
|
|
43
|
+
|
|
44
|
+
```bash
|
|
45
|
+
curl -LsSf https://astral.sh/uv/install.sh | sh
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
Clone and install dependencies:
|
|
49
|
+
|
|
50
|
+
```bash
|
|
51
|
+
git clone <repo-url>
|
|
52
|
+
cd PubMatrixPython
|
|
53
|
+
uv sync --all-groups
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
---
|
|
57
|
+
|
|
58
|
+
## Running the notebooks
|
|
59
|
+
|
|
60
|
+
All `uv` commands must be run from the **project root** (`PubMatrixPython/`), where `pyproject.toml` lives.
|
|
61
|
+
|
|
62
|
+
```bash
|
|
63
|
+
cd /path/to/PubMatrixPython
|
|
64
|
+
uv run jupyter lab
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
Then open any notebook from the `notebooks/` folder in the browser.
|
|
68
|
+
|
|
69
|
+
| Notebook | What it covers |
|
|
70
|
+
|----------|---------------|
|
|
71
|
+
| `01_pubmatrix.ipynb` | Basic queries, date filtering, PMC database, file input, CSV export, heatmap visualisation |
|
|
72
|
+
| `02_example_wnt.ipynb` | Full worked example: WNT genes × obesity genes |
|
|
73
|
+
|
|
74
|
+
---
|
|
75
|
+
|
|
76
|
+
## Quick start (script or REPL)
|
|
77
|
+
|
|
78
|
+
### Interactive REPL
|
|
79
|
+
|
|
80
|
+
```bash
|
|
81
|
+
uv run python
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
```python
|
|
85
|
+
from pubmatrix import pubmatrix, plot_pubmatrix_heatmap
|
|
86
|
+
|
|
87
|
+
A = ["WNT1", "WNT2", "CTNNB1"]
|
|
88
|
+
B = ["obesity", "diabetes", "cancer"]
|
|
89
|
+
|
|
90
|
+
result = pubmatrix(A=A, B=B)
|
|
91
|
+
print(result)
|
|
92
|
+
|
|
93
|
+
plot_pubmatrix_heatmap(result, title="WNT × Disease")
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
### Running a script
|
|
97
|
+
|
|
98
|
+
Create a file `my_analysis.py`:
|
|
99
|
+
|
|
100
|
+
```python
|
|
101
|
+
from pubmatrix import pubmatrix, plot_pubmatrix_heatmap
|
|
102
|
+
|
|
103
|
+
A = ["WNT1", "WNT2", "WNT3A", "WNT5A", "CTNNB1"]
|
|
104
|
+
B = ["obesity", "diabetes", "cancer", "inflammation"]
|
|
105
|
+
|
|
106
|
+
result = pubmatrix(
|
|
107
|
+
A=A,
|
|
108
|
+
B=B,
|
|
109
|
+
database="pubmed",
|
|
110
|
+
daterange=[2010, 2024], # optional date filter
|
|
111
|
+
outfile="results",
|
|
112
|
+
export_format="csv", # saves results_result.csv with PubMed hyperlinks
|
|
113
|
+
)
|
|
114
|
+
|
|
115
|
+
print(result)
|
|
116
|
+
|
|
117
|
+
plot_pubmatrix_heatmap(
|
|
118
|
+
result,
|
|
119
|
+
title="WNT Genes × Disease",
|
|
120
|
+
filename="heatmap.png", # saves to file instead of displaying
|
|
121
|
+
)
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
Run it with:
|
|
125
|
+
|
|
126
|
+
```bash
|
|
127
|
+
uv run python my_analysis.py
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
### Loading terms from a file
|
|
131
|
+
|
|
132
|
+
Create `terms.txt`:
|
|
133
|
+
|
|
134
|
+
```
|
|
135
|
+
WNT1
|
|
136
|
+
WNT2
|
|
137
|
+
CTNNB1
|
|
138
|
+
#
|
|
139
|
+
obesity
|
|
140
|
+
diabetes
|
|
141
|
+
cancer
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
```python
|
|
145
|
+
from pubmatrix import pubmatrix_from_file
|
|
146
|
+
|
|
147
|
+
result = pubmatrix_from_file("terms.txt")
|
|
148
|
+
print(result)
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
```bash
|
|
152
|
+
uv run python my_analysis.py
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
---
|
|
156
|
+
|
|
157
|
+
## API reference
|
|
158
|
+
|
|
159
|
+
### `pubmatrix(A, B, ...)`
|
|
160
|
+
|
|
161
|
+
Query PubMed and return a `pandas.DataFrame` (rows = B, cols = A).
|
|
162
|
+
|
|
163
|
+
```python
|
|
164
|
+
pubmatrix(
|
|
165
|
+
A, # list of str — column terms
|
|
166
|
+
B, # list of str — row terms
|
|
167
|
+
api_key=None, # NCBI API key (10 req/s vs 3 req/s default)
|
|
168
|
+
database="pubmed", # "pubmed" or "pmc"
|
|
169
|
+
daterange=None, # e.g. [2015, 2024]
|
|
170
|
+
outfile=None, # base filename for export
|
|
171
|
+
export_format=None, # None | "csv" | "ods"
|
|
172
|
+
n_tries=2, # retries on network failure
|
|
173
|
+
n_workers=1, # parallel workers for concurrent queries
|
|
174
|
+
timeout=30, # HTTP request timeout in seconds
|
|
175
|
+
cache_dir=None, # directory to cache query results on disk
|
|
176
|
+
)
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
### `pubmatrix_from_file(filepath, ...)`
|
|
180
|
+
|
|
181
|
+
Load terms from a plain-text file and run `pubmatrix()`.
|
|
182
|
+
|
|
183
|
+
File format:
|
|
184
|
+
```
|
|
185
|
+
WNT1
|
|
186
|
+
WNT2
|
|
187
|
+
#
|
|
188
|
+
obesity
|
|
189
|
+
diabetes
|
|
190
|
+
```
|
|
191
|
+
|
|
192
|
+
```python
|
|
193
|
+
result = pubmatrix_from_file("terms.txt", database="pubmed")
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
### `plot_pubmatrix_heatmap(matrix, ...)`
|
|
197
|
+
|
|
198
|
+
Heatmap of overlap percentages with optional hierarchical clustering. Returns `(fig, ax)`.
|
|
199
|
+
|
|
200
|
+
```python
|
|
201
|
+
fig, ax = plot_pubmatrix_heatmap(
|
|
202
|
+
matrix, # DataFrame from pubmatrix()
|
|
203
|
+
title="PubMatrix Co-occurrence Heatmap",
|
|
204
|
+
cluster_rows=True,
|
|
205
|
+
cluster_cols=True,
|
|
206
|
+
show_numbers=True,
|
|
207
|
+
color_palette=None, # list of hex colours
|
|
208
|
+
filename=None, # save to PNG if set
|
|
209
|
+
width=10, height=8,
|
|
210
|
+
scale_font=True,
|
|
211
|
+
show=False, # call plt.show() after plotting
|
|
212
|
+
)
|
|
213
|
+
```
|
|
214
|
+
|
|
215
|
+
### `pubmatrix_heatmap(matrix, title=...)`
|
|
216
|
+
|
|
217
|
+
Quick wrapper around `plot_pubmatrix_heatmap()` with all defaults. Returns `(fig, ax)`.
|
|
218
|
+
|
|
219
|
+
---
|
|
220
|
+
|
|
221
|
+
## Output files
|
|
222
|
+
|
|
223
|
+
When `outfile` and `export_format` are set, results are written to
|
|
224
|
+
`{outfile}_result.{extension}` (`.csv` or `.ods`). Each cell contains the
|
|
225
|
+
publication count and a hyperlink to the matching PubMed search. Row names
|
|
226
|
+
come from `B`, column names from `A`.
|
|
227
|
+
|
|
228
|
+
ODS export requires the optional `odfpy` dependency:
|
|
229
|
+
|
|
230
|
+
```bash
|
|
231
|
+
pip install pubmatrixpython[ods]
|
|
232
|
+
```
|
|
233
|
+
|
|
234
|
+
---
|
|
235
|
+
|
|
236
|
+
## NCBI API key
|
|
237
|
+
|
|
238
|
+
Without a key: 3 requests/second. With a key: 10 requests/second.
|
|
239
|
+
Get one at https://account.ncbi.nlm.nih.gov/
|
|
240
|
+
|
|
241
|
+
```python
|
|
242
|
+
result = pubmatrix(A=A, B=B, api_key="YOUR_KEY_HERE")
|
|
243
|
+
```
|
|
244
|
+
|
|
245
|
+
---
|
|
246
|
+
|
|
247
|
+
## More documentation
|
|
248
|
+
|
|
249
|
+
- [Performance notes](docs/performance.md) — rate limits, caching, concurrency
|
|
250
|
+
- [Troubleshooting](docs/troubleshooting.md) — empty results, rate limiting, slow searches
|
|
251
|
+
- [Full reference notebook](https://toledoem.github.io/pubmatrixp/) — every parameter and feature, with output
|
|
252
|
+
|
|
253
|
+
---
|
|
254
|
+
|
|
255
|
+
## License & citation
|
|
256
|
+
|
|
257
|
+
This project is licensed under the MIT License — see [`LICENSE.md`](LICENSE.md).
|
|
258
|
+
|
|
259
|
+
If you use PubMatrixPython in your research, please cite:
|
|
260
|
+
|
|
261
|
+
> Becker KG, Hosack DA, Dennis G Jr, Lempicki RA, Bright TJ, Cheadle C, Engel J.
|
|
262
|
+
> *PubMatrix: a tool for multiplex literature mining.*
|
|
263
|
+
> BMC Bioinformatics. 2003 Dec 10;4:61. https://doi.org/10.1186/1471-2105-4-61
|
|
264
|
+
|
|
265
|
+
**Developers:**
|
|
266
|
+
- Tyler Laird (Author, original PubMatrixR)
|
|
267
|
+
- Enrique Toledo (Author, maintainer)
|