@fbraza/pi-cite 0.1.0 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +13 -0
- package/package.json +5 -1
- package/skills/literature/SKILL.md +208 -0
- package/skills/literature/references/full-text-access-guide.md +34 -0
- package/skills/literature/references/preclinical-extraction-guide.md +215 -0
- package/skills/literature/references/pubmed_api_reference.md +298 -0
- package/skills/literature/references/pubmed_common_queries.md +453 -0
- package/skills/literature/references/pubmed_routine.md +93 -0
- package/skills/literature/references/pubmed_search_syntax.md +436 -0
- package/skills/literature/references/scihub_routine.md +40 -0
- package/skills/literature/references/semanticscholar_routine.md +50 -0
- package/skills/literature/scripts/export_all.py +53 -0
- package/skills/literature/scripts/extract_experiments.py +401 -0
- package/skills/literature/scripts/generate_table.py +96 -0
- package/skills/literature/scripts/scihub_pdf_resolver.py +289 -0
- package/skills/literature/scripts/synthesis.py +93 -0
package/README.md
CHANGED
|
@@ -9,6 +9,19 @@ academic workflows. Registers four tools callable by the agent:
|
|
|
9
9
|
- **`fetch_fulltext`** — retrieve a paper PDF via PMC → publisher OA → fallback.
|
|
10
10
|
- (`semantic_scholar` helper used internally by the search tools.)
|
|
11
11
|
|
|
12
|
+
## Bundled skill
|
|
13
|
+
|
|
14
|
+
Ships with the **`literature`** skill (`skills/literature/`), which turns these
|
|
15
|
+
tools into an end-to-end review workflow: verified-citation search, full-text
|
|
16
|
+
retrieval, per-paper experiment extraction, and a structured hypothesis
|
|
17
|
+
synthesis. Its frontmatter declares `allowed-tools` covering the extension's
|
|
18
|
+
tools above, so the skill and extension are paired on purpose.
|
|
19
|
+
|
|
20
|
+
- `references/` — PubMed/Semantic Scholar query syntax, API reference, and
|
|
21
|
+
full-text access routines.
|
|
22
|
+
- `scripts/` — Python helpers (`extract_experiments.py`, `synthesis.py`,
|
|
23
|
+
`generate_table.py`, `export_all.py`) invoked by the skill.
|
|
24
|
+
|
|
12
25
|
## Install
|
|
13
26
|
|
|
14
27
|
Published on npm as `@fbraza/pi-cite`:
|
package/package.json
CHANGED
|
@@ -1,11 +1,12 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@fbraza/pi-cite",
|
|
3
|
-
"version": "0.
|
|
3
|
+
"version": "0.2.0",
|
|
4
4
|
"description": "Pi extension with PubMed, Semantic Scholar, literature search, and full-text retrieval tools.",
|
|
5
5
|
"license": "MIT",
|
|
6
6
|
"type": "module",
|
|
7
7
|
"files": [
|
|
8
8
|
"src",
|
|
9
|
+
"skills",
|
|
9
10
|
"README.md"
|
|
10
11
|
],
|
|
11
12
|
"keywords": [
|
|
@@ -18,6 +19,9 @@
|
|
|
18
19
|
"pi": {
|
|
19
20
|
"extensions": [
|
|
20
21
|
"./src/index.ts"
|
|
22
|
+
],
|
|
23
|
+
"skills": [
|
|
24
|
+
"./skills"
|
|
21
25
|
]
|
|
22
26
|
},
|
|
23
27
|
"scripts": {
|
|
@@ -0,0 +1,208 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: literature
|
|
3
|
+
description: Unified literature search, verification, full-text retrieval, and synthesis workflow for scientific questions. Use when any biological claim needs a verified citation, when reviewing a gene/pathway/disease/drug/target, when surveying preclinical evidence for a target in a disease, when checking novelty, when retrieving full text for specific papers, or when turning a paper set into a structured hypothesis synthesis.
|
|
4
|
+
allowed-tools: Read, Write, WebFetch, WebSearch, literature_search, pubmed_search, semantic_scholar_search, fetch_fulltext
|
|
5
|
+
starting-prompt: Conduct a literature review on my research topic with verified citations, structured synthesis, and a per-paper summary table.
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# Literature
|
|
9
|
+
|
|
10
|
+
Unified literature skill replacing the previous split review and preclinical workflows.
|
|
11
|
+
|
|
12
|
+
## What this skill covers
|
|
13
|
+
|
|
14
|
+
Use this skill when you need to:
|
|
15
|
+
- verify any biological claim with a real citation
|
|
16
|
+
- review literature on a gene, pathway, disease, drug, or molecular target
|
|
17
|
+
- survey preclinical evidence for a target in a disease context
|
|
18
|
+
- check whether a finding appears novel or already published
|
|
19
|
+
- retrieve full text or PDFs for key papers
|
|
20
|
+
- synthesize a paper set into hypotheses, contradictions, and evidence-weighted conclusions
|
|
21
|
+
|
|
22
|
+
Do not use this skill for:
|
|
23
|
+
- running computational omics analyses
|
|
24
|
+
- generating figures without a literature component
|
|
25
|
+
- inventing or guessing citations
|
|
26
|
+
|
|
27
|
+
## Hard rules
|
|
28
|
+
|
|
29
|
+
- Every citation must be real and verifiable.
|
|
30
|
+
- Never fabricate PMIDs, DOIs, titles, journals, years, or author lists.
|
|
31
|
+
- Distinguish human, animal, and in vitro evidence.
|
|
32
|
+
- Weight evidence quality by study design and replication.
|
|
33
|
+
- Record how full text was obtained for each paper.
|
|
34
|
+
- Use inline numbered citations like `[1]` or `[1, 2]` in narrative synthesis.
|
|
35
|
+
- Never overwrite outputs from a previous literature search.
|
|
36
|
+
- Never write literature-review outputs directly to generic shared paths under `results/`.
|
|
37
|
+
|
|
38
|
+
## Standard workflow
|
|
39
|
+
|
|
40
|
+
### Step 1 — Clarify scope
|
|
41
|
+
|
|
42
|
+
Always clarify:
|
|
43
|
+
- exact claim, topic, target, or disease
|
|
44
|
+
- desired time range
|
|
45
|
+
- species restrictions
|
|
46
|
+
- study type filters
|
|
47
|
+
- whether the task is **general review** or **preclinical extraction mode**
|
|
48
|
+
|
|
49
|
+
### Step 2 — Create a dedicated output folder
|
|
50
|
+
|
|
51
|
+
For every new literature review or literature research task, create a new dedicated folder inside `results/` before generating files.
|
|
52
|
+
|
|
53
|
+
The folder name must describe the search session/topic clearly, for example:
|
|
54
|
+
- `results/literature_multiomics_ML_biomarkers_PGD/`
|
|
55
|
+
- `results/literature_siRNA_lung_transplant_new_treatments/`
|
|
56
|
+
|
|
57
|
+
All generated files for that search session must be saved inside this dedicated folder, including:
|
|
58
|
+
- `literature_report.md`
|
|
59
|
+
- `paper_summary_table.csv`
|
|
60
|
+
- `search_log.md`
|
|
61
|
+
- `pdfs/`
|
|
62
|
+
- any optional analysis/export artifacts such as `analysis_object.pkl`
|
|
63
|
+
|
|
64
|
+
Never write directly to generic shared paths such as:
|
|
65
|
+
- `results/literature_report.md`
|
|
66
|
+
- `results/paper_summary_table.csv`
|
|
67
|
+
- `results/analysis_object.pkl`
|
|
68
|
+
- `results/literature_pdfs/`
|
|
69
|
+
|
|
70
|
+
If a folder for a previous search already exists, create a new folder with a distinct descriptive search-session title rather than using versioned filenames.
|
|
71
|
+
|
|
72
|
+
At the end of the task, clearly report the exact output folder and generated file paths to the user.
|
|
73
|
+
|
|
74
|
+
### Step 3 — Search
|
|
75
|
+
|
|
76
|
+
Use the custom literature tool as the primary search path:
|
|
77
|
+
- **Primary:** `literature_search`
|
|
78
|
+
|
|
79
|
+
When calling `literature_search`:
|
|
80
|
+
- Always construct `pubmed_query` using PubMed-specific syntax from the references below.
|
|
81
|
+
- Use MeSH terms (`[mh]` / `[majr]`), title/abstract terms (`[tiab]`), publication types (`[pt]`), substance names (`[nm]`), date filters, and Boolean logic as appropriate.
|
|
82
|
+
- Construct `semantic_scholar_query` separately as broader natural-language search terms when useful. Semantic Scholar is used automatically as supplementary search only when `SEMANTIC_SCHOLAR_API_KEY` is configured.
|
|
83
|
+
- Do not pass a generic natural-language query as `pubmed_query` when a PubMed/MeSH query can be constructed.
|
|
84
|
+
|
|
85
|
+
These extension tools are the preferred search path for this skill. Do not fall back to generic `WebFetch` / `WebSearch` first when one of these typed tools fits the task.
|
|
86
|
+
|
|
87
|
+
Read these references before constructing queries:
|
|
88
|
+
- `references/pubmed_routine.md`
|
|
89
|
+
- `references/pubmed_search_syntax.md`
|
|
90
|
+
- `references/pubmed_common_queries.md`
|
|
91
|
+
- `references/semanticscholar_routine.md`
|
|
92
|
+
|
|
93
|
+
### Step 4 — Screen and prioritise
|
|
94
|
+
|
|
95
|
+
- Deduplicate across PubMed and Semantic Scholar sources.
|
|
96
|
+
- Prioritise by relevance, recency, citation count, and study type.
|
|
97
|
+
- Default to deep reading of the top 20 papers unless the user asks otherwise.
|
|
98
|
+
- For preclinical requests, keep studies with experimental target perturbation evidence.
|
|
99
|
+
|
|
100
|
+
### Step 5 — Retrieve full text
|
|
101
|
+
|
|
102
|
+
Use `fetch_fulltext` for top papers. Prefer it over ad-hoc `WebFetch` PDF retrieval because it applies the defined PMC → publisher OA → Sci-Hub chain.
|
|
103
|
+
|
|
104
|
+
Access chain:
|
|
105
|
+
1. PMC
|
|
106
|
+
2. publisher open-access page
|
|
107
|
+
3. Sci-Hub fallback
|
|
108
|
+
|
|
109
|
+
Read:
|
|
110
|
+
- `references/full-text-access-guide.md`
|
|
111
|
+
- `references/scihub_routine.md`
|
|
112
|
+
|
|
113
|
+
### Step 6 — Synthesis
|
|
114
|
+
|
|
115
|
+
Always produce:
|
|
116
|
+
1. a narrative synthesis with inline numbered citations
|
|
117
|
+
2. a per-paper structured summary table
|
|
118
|
+
|
|
119
|
+
When in **preclinical extraction mode**, add:
|
|
120
|
+
- Experiment Type
|
|
121
|
+
- Model System
|
|
122
|
+
- Assay/Endpoint
|
|
123
|
+
- Finding Direction
|
|
124
|
+
|
|
125
|
+
Use:
|
|
126
|
+
- `scripts/synthesis.py`
|
|
127
|
+
- `scripts/generate_table.py`
|
|
128
|
+
- `scripts/export_all.py`
|
|
129
|
+
|
|
130
|
+
For preclinical extraction details, read:
|
|
131
|
+
- `references/preclinical-extraction-guide.md`
|
|
132
|
+
- `scripts/extract_experiments.py`
|
|
133
|
+
|
|
134
|
+
## Evidence quality framework
|
|
135
|
+
|
|
136
|
+
Rank evidence broadly as:
|
|
137
|
+
- **High:** replicated clinical evidence, meta-analysis, systematic review, strong human studies
|
|
138
|
+
- **Moderate:** strong animal studies, coherent multi-model evidence, robust mechanistic studies
|
|
139
|
+
- **Low/Preliminary:** single-study results, purely computational inference, unreplicated in vitro work
|
|
140
|
+
|
|
141
|
+
### What to mark as preliminary
|
|
142
|
+
- single-study findings
|
|
143
|
+
- animal-only findings for human claims
|
|
144
|
+
- in vitro findings without in vivo follow-up
|
|
145
|
+
|
|
146
|
+
### What to refuse without qualification
|
|
147
|
+
- causal claims from correlational studies
|
|
148
|
+
- claims supported only by retracted work
|
|
149
|
+
- claims contradicting the weight of evidence
|
|
150
|
+
|
|
151
|
+
## Output format
|
|
152
|
+
|
|
153
|
+
### Narrative section
|
|
154
|
+
|
|
155
|
+
Use concise prose with inline citations.
|
|
156
|
+
|
|
157
|
+
### Paper Summary Table
|
|
158
|
+
|
|
159
|
+
```markdown
|
|
160
|
+
## Paper Summary Table
|
|
161
|
+
|
|
162
|
+
| # | PMID/DOI | Authors (year) | Key Message | Key Results | Key Methods | Study Type | Evidence Quality |
|
|
163
|
+
|---|---|---|---|---|---|---|---|
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
### Extra columns for preclinical extraction mode
|
|
167
|
+
|
|
168
|
+
```markdown
|
|
169
|
+
| Experiment Type | Model System | Assay/Endpoint | Finding Direction |
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
## Hypothesis synthesis
|
|
173
|
+
|
|
174
|
+
After reviewing the core paper set, optionally produce:
|
|
175
|
+
- explicit hypotheses stated by authors
|
|
176
|
+
- implicit mechanistic hypotheses inferred from evidence
|
|
177
|
+
- contradiction matrix across papers
|
|
178
|
+
- highest-confidence next-step hypotheses
|
|
179
|
+
|
|
180
|
+
## Expected files
|
|
181
|
+
|
|
182
|
+
Typical outputs must be placed in a dedicated search-session folder under `./results/`, for example `./results/literature_<descriptive_topic>/`:
|
|
183
|
+
- `literature_report.md`
|
|
184
|
+
- `paper_summary_table.csv`
|
|
185
|
+
- `search_log.md`
|
|
186
|
+
- `pdfs/`
|
|
187
|
+
- optional `analysis_object.pkl` or other export artifacts when produced
|
|
188
|
+
|
|
189
|
+
Do not write these outputs directly to `./results/` or reuse a previous search folder.
|
|
190
|
+
|
|
191
|
+
## Companion references
|
|
192
|
+
|
|
193
|
+
- `references/pubmed_api_reference.md`
|
|
194
|
+
- `references/pubmed_routine.md`
|
|
195
|
+
- `references/pubmed_search_syntax.md`
|
|
196
|
+
- `references/pubmed_common_queries.md`
|
|
197
|
+
- `references/semanticscholar_routine.md`
|
|
198
|
+
- `references/preclinical-extraction-guide.md`
|
|
199
|
+
- `references/full-text-access-guide.md`
|
|
200
|
+
- `references/scihub_routine.md`
|
|
201
|
+
|
|
202
|
+
## Companion scripts
|
|
203
|
+
|
|
204
|
+
- `scripts/extract_experiments.py`
|
|
205
|
+
- `scripts/synthesis.py`
|
|
206
|
+
- `scripts/generate_table.py`
|
|
207
|
+
- `scripts/export_all.py`
|
|
208
|
+
- `scripts/scihub_pdf_resolver.py`
|
|
@@ -0,0 +1,34 @@
|
|
|
1
|
+
# Full-Text Access Guide
|
|
2
|
+
|
|
3
|
+
**Workflow:** literature
|
|
4
|
+
**Purpose:** Retrieve PDFs for prioritised papers using a consistent fallback chain.
|
|
5
|
+
|
|
6
|
+
## Access order
|
|
7
|
+
|
|
8
|
+
1. **PubMed Central (PMC)**
|
|
9
|
+
- Preferred for PubMed-indexed papers with open full text.
|
|
10
|
+
- Use PubMed/PMC linking first when a PMID is available.
|
|
11
|
+
|
|
12
|
+
2. **Publisher open-access page**
|
|
13
|
+
- Resolve DOI at `https://doi.org/<doi>`.
|
|
14
|
+
- Look for `citation_pdf_url`, explicit PDF links, or embedded PDF viewers.
|
|
15
|
+
|
|
16
|
+
3. **Sci-Hub fallback**
|
|
17
|
+
- Use only as the final fallback after OA routes are exhausted.
|
|
18
|
+
- Record that Sci-Hub was used.
|
|
19
|
+
|
|
20
|
+
## Per-paper logging
|
|
21
|
+
|
|
22
|
+
For each paper, record:
|
|
23
|
+
- PMID
|
|
24
|
+
- DOI
|
|
25
|
+
- source used: `pmc`, `publisher_oa`, `scihub`, or `not_found`
|
|
26
|
+
- direct PDF URL if found
|
|
27
|
+
- local saved path if downloaded
|
|
28
|
+
- access note
|
|
29
|
+
|
|
30
|
+
## Notes
|
|
31
|
+
|
|
32
|
+
- PMC and publisher OA should always be attempted before Sci-Hub.
|
|
33
|
+
- If no DOI is known but PMID exists, try resolving identifiers from PubMed metadata first.
|
|
34
|
+
- If no PDF is found, keep the paper in the synthesis and note `not_found`.
|
|
@@ -0,0 +1,215 @@
|
|
|
1
|
+
# Experiment Extraction Guide
|
|
2
|
+
|
|
3
|
+
**Workflow:** literature
|
|
4
|
+
**Purpose:** Keyword-based extraction approach, dictionaries, and limitations for classifying in vitro and in vivo experiments from paper abstracts.
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## Overview
|
|
9
|
+
|
|
10
|
+
The `extract_all_experiments()` function parses paper abstracts to classify each paper into one of three categories:
|
|
11
|
+
- **in_vitro** — only cell-based experiments reported
|
|
12
|
+
- **in_vivo** — only animal model experiments reported
|
|
13
|
+
- **both** — both in vitro and in vivo experiments reported
|
|
14
|
+
- **unclassified** — insufficient information in abstract to classify
|
|
15
|
+
|
|
16
|
+
Extraction is keyword-based (not AI-based), operating on the abstract text. It is fast and reproducible but has known limitations (see below).
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
## Extraction Dictionaries
|
|
21
|
+
|
|
22
|
+
### In Vitro Keywords
|
|
23
|
+
|
|
24
|
+
These terms in the abstract signal cell-based experiments:
|
|
25
|
+
|
|
26
|
+
```python
|
|
27
|
+
IN_VITRO_KEYWORDS = [
|
|
28
|
+
# Cell lines
|
|
29
|
+
"cell line", "cell lines", "cancer cells", "tumor cells",
|
|
30
|
+
"HeLa", "MCF-7", "A549", "HCT116", "PC-3", "LNCaP", "MDA-MB",
|
|
31
|
+
"U87", "U251", "Jurkat", "THP-1", "RAW264", "HEK293",
|
|
32
|
+
|
|
33
|
+
# Assay types
|
|
34
|
+
"in vitro", "cell viability", "cell proliferation", "MTT assay",
|
|
35
|
+
"CCK-8", "colony formation", "clonogenic", "wound healing",
|
|
36
|
+
"scratch assay", "transwell", "invasion assay", "migration assay",
|
|
37
|
+
"flow cytometry", "apoptosis", "cell cycle", "western blot",
|
|
38
|
+
"immunofluorescence", "ELISA", "qPCR", "RNA-seq", "ChIP",
|
|
39
|
+
|
|
40
|
+
# Culture conditions
|
|
41
|
+
"cultured", "culture", "monolayer", "3D culture", "organoid",
|
|
42
|
+
"spheroid", "primary cells", "iPSC", "stem cells"
|
|
43
|
+
]
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
### In Vivo Keywords
|
|
47
|
+
|
|
48
|
+
These terms signal animal model experiments:
|
|
49
|
+
|
|
50
|
+
```python
|
|
51
|
+
IN_VIVO_KEYWORDS = [
|
|
52
|
+
# Animal models
|
|
53
|
+
"in vivo", "mouse model", "murine", "mice", "rat model",
|
|
54
|
+
"xenograft", "syngeneic", "PDX", "patient-derived xenograft",
|
|
55
|
+
"transgenic", "knockout mouse", "knock-in", "GEM", "GEMM",
|
|
56
|
+
"orthotopic", "subcutaneous tumor", "allograft",
|
|
57
|
+
|
|
58
|
+
# Animal procedures
|
|
59
|
+
"tumor volume", "tumor growth", "body weight", "survival",
|
|
60
|
+
"tumor regression", "metastasis model", "lung metastasis",
|
|
61
|
+
"liver metastasis", "intracranial", "intraperitoneal injection",
|
|
62
|
+
"intravenous injection", "oral gavage", "dosing",
|
|
63
|
+
|
|
64
|
+
# Animal species
|
|
65
|
+
"C57BL/6", "BALB/c", "nude mice", "SCID", "NOD-SCID",
|
|
66
|
+
"NSG", "athymic", "immunodeficient", "Sprague-Dawley",
|
|
67
|
+
"Wistar rat", "zebrafish"
|
|
68
|
+
]
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
---
|
|
72
|
+
|
|
73
|
+
## Extraction Logic
|
|
74
|
+
|
|
75
|
+
```python
|
|
76
|
+
def classify_paper(abstract: str) -> dict:
|
|
77
|
+
"""
|
|
78
|
+
Classify a paper abstract into experiment categories.
|
|
79
|
+
Returns dict with: experiment_type, in_vitro_signals, in_vivo_signals,
|
|
80
|
+
cell_lines, animal_models, assays, endpoints, findings
|
|
81
|
+
"""
|
|
82
|
+
abstract_lower = abstract.lower()
|
|
83
|
+
|
|
84
|
+
# Count keyword matches
|
|
85
|
+
in_vitro_hits = [kw for kw in IN_VITRO_KEYWORDS if kw.lower() in abstract_lower]
|
|
86
|
+
in_vivo_hits = [kw for kw in IN_VIVO_KEYWORDS if kw.lower() in abstract_lower]
|
|
87
|
+
|
|
88
|
+
# Classify
|
|
89
|
+
has_vitro = len(in_vitro_hits) >= 1
|
|
90
|
+
has_vivo = len(in_vivo_hits) >= 1
|
|
91
|
+
|
|
92
|
+
if has_vitro and has_vivo:
|
|
93
|
+
experiment_type = "both"
|
|
94
|
+
elif has_vitro:
|
|
95
|
+
experiment_type = "in_vitro"
|
|
96
|
+
elif has_vivo:
|
|
97
|
+
experiment_type = "in_vivo"
|
|
98
|
+
else:
|
|
99
|
+
experiment_type = "unclassified"
|
|
100
|
+
|
|
101
|
+
return {
|
|
102
|
+
"experiment_type": experiment_type,
|
|
103
|
+
"in_vitro_signals": in_vitro_hits,
|
|
104
|
+
"in_vivo_signals": in_vivo_hits,
|
|
105
|
+
"cell_lines": extract_cell_lines(abstract),
|
|
106
|
+
"animal_models": extract_animal_models(abstract),
|
|
107
|
+
"assays": extract_assays(abstract),
|
|
108
|
+
"endpoints": extract_endpoints(abstract),
|
|
109
|
+
"findings": extract_findings_sentence(abstract)
|
|
110
|
+
}
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
---
|
|
114
|
+
|
|
115
|
+
## Structured Field Extraction
|
|
116
|
+
|
|
117
|
+
Beyond classification, the script extracts specific structured fields:
|
|
118
|
+
|
|
119
|
+
### Cell Lines
|
|
120
|
+
Extracted using a curated regex pattern matching common cancer cell line names:
|
|
121
|
+
```python
|
|
122
|
+
CELL_LINE_PATTERN = r'\b(MCF-7|MDA-MB-\d+|A549|HCT116|PC-3|LNCaP|HeLa|U87|U251|Jurkat|THP-1|HEK293[T]?|Caco-2|SW480|HT-29|PANC-1|MiaPaCa|BxPC-3|DU145|22Rv1)\b'
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
### Animal Models
|
|
126
|
+
```python
|
|
127
|
+
ANIMAL_MODEL_PATTERN = r'\b(xenograft|syngeneic|PDX|patient-derived xenograft|transgenic|knockout|GEMM|orthotopic|allograft|C57BL/6|BALB/c|nude mice|NSG|SCID)\b'
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
### Assay Types
|
|
131
|
+
```python
|
|
132
|
+
ASSAY_PATTERN = r'\b(MTT|CCK-8|colony formation|clonogenic|wound healing|transwell|flow cytometry|western blot|ELISA|qPCR|RNA-seq|ChIP-seq|ATAC-seq|immunofluorescence|IHC|co-immunoprecipitation)\b'
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
### Endpoints
|
|
136
|
+
```python
|
|
137
|
+
ENDPOINT_PATTERN = r'\b(tumor volume|tumor growth|survival|body weight|metastasis|apoptosis|cell viability|proliferation|migration|invasion|angiogenesis|tumor regression)\b'
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
### Key Findings Sentence
|
|
141
|
+
The script extracts the last 1-2 sentences of the abstract as a proxy for the main finding:
|
|
142
|
+
```python
|
|
143
|
+
def extract_findings_sentence(abstract: str) -> str:
|
|
144
|
+
sentences = abstract.split(". ")
|
|
145
|
+
return ". ".join(sentences[-2:]).strip()
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
---
|
|
149
|
+
|
|
150
|
+
## Output Schema
|
|
151
|
+
|
|
152
|
+
The `experiment_extraction.csv` file contains one row per paper with these columns:
|
|
153
|
+
|
|
154
|
+
| Column | Type | Description |
|
|
155
|
+
|--------|------|-------------|
|
|
156
|
+
| `pmid` | str | PubMed ID |
|
|
157
|
+
| `doi` | str | DOI |
|
|
158
|
+
| `title` | str | Paper title |
|
|
159
|
+
| `year` | int | Publication year |
|
|
160
|
+
| `experiment_type` | str | in_vitro / in_vivo / both / unclassified |
|
|
161
|
+
| `cell_lines` | str | Comma-separated cell line names found |
|
|
162
|
+
| `animal_models` | str | Comma-separated animal model terms found |
|
|
163
|
+
| `assays` | str | Comma-separated assay types found |
|
|
164
|
+
| `endpoints` | str | Comma-separated endpoints found |
|
|
165
|
+
| `in_vitro_signal_count` | int | Number of in vitro keyword matches |
|
|
166
|
+
| `in_vivo_signal_count` | int | Number of in vivo keyword matches |
|
|
167
|
+
| `key_findings` | str | Last 1-2 sentences of abstract |
|
|
168
|
+
|
|
169
|
+
---
|
|
170
|
+
|
|
171
|
+
## Known Limitations
|
|
172
|
+
|
|
173
|
+
### 1. Abstract-only extraction
|
|
174
|
+
The script only reads abstracts, not full text. Papers that describe experiments only in the methods/results sections (not the abstract) will be misclassified as "unclassified".
|
|
175
|
+
|
|
176
|
+
**Mitigation:** Step 5 of the workflow (full-text enrichment) addresses this for top papers.
|
|
177
|
+
|
|
178
|
+
### 2. Keyword sensitivity
|
|
179
|
+
- **False positives:** A paper mentioning "mouse model" in the introduction (not as an experiment performed) may be classified as in_vivo.
|
|
180
|
+
- **False negatives:** Novel cell lines or animal models not in the dictionary will be missed.
|
|
181
|
+
|
|
182
|
+
### 3. No quantitative extraction
|
|
183
|
+
The script does not extract dosing, IC50 values, tumor volumes, or statistical significance — only qualitative classification.
|
|
184
|
+
|
|
185
|
+
### 4. Language dependency
|
|
186
|
+
Only English-language abstracts are supported. Non-English papers will be classified as "unclassified".
|
|
187
|
+
|
|
188
|
+
### 5. High "unclassified" rate for some targets
|
|
189
|
+
For niche targets or highly mechanistic papers (e.g., structural biology, computational studies), most papers may be classified as "unclassified". This is expected behavior — not a bug.
|
|
190
|
+
|
|
191
|
+
---
|
|
192
|
+
|
|
193
|
+
## Improving Extraction for Your Target
|
|
194
|
+
|
|
195
|
+
If you find too many "unclassified" papers, you can extend the keyword dictionaries:
|
|
196
|
+
|
|
197
|
+
```python
|
|
198
|
+
# Add target-specific cell lines
|
|
199
|
+
CUSTOM_CELL_LINES = ["your_cell_line_1", "your_cell_line_2"]
|
|
200
|
+
IN_VITRO_KEYWORDS.extend(CUSTOM_CELL_LINES)
|
|
201
|
+
|
|
202
|
+
# Add target-specific animal models
|
|
203
|
+
CUSTOM_ANIMAL_MODELS = ["your_model_1", "your_model_2"]
|
|
204
|
+
IN_VIVO_KEYWORDS.extend(CUSTOM_ANIMAL_MODELS)
|
|
205
|
+
```
|
|
206
|
+
|
|
207
|
+
Edit `scripts/extract_experiments.py` directly and re-run Step 2.
|
|
208
|
+
|
|
209
|
+
---
|
|
210
|
+
|
|
211
|
+
## References
|
|
212
|
+
|
|
213
|
+
- Keyword extraction approach inspired by: Lever J, et al. (2019) *PLOS Biol* — text mining for biomedical literature
|
|
214
|
+
- Cell line nomenclature: ATCC (https://www.atcc.org/)
|
|
215
|
+
- Animal model terminology: NCI Thesaurus (https://ncithesaurus.nci.nih.gov/)
|