@fbraza/pi-cite 0.1.0 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +16 -6
- package/package.json +7 -4
- package/skills/literature/SKILL.md +189 -0
- package/skills/literature/references/preclinical-extraction-guide.md +215 -0
- package/skills/literature/references/pubmed_api_reference.md +298 -0
- package/skills/literature/references/pubmed_common_queries.md +453 -0
- package/skills/literature/references/pubmed_routine.md +93 -0
- package/skills/literature/references/pubmed_search_syntax.md +436 -0
- package/skills/literature/scripts/export_all.py +53 -0
- package/skills/literature/scripts/extract_experiments.py +401 -0
- package/skills/literature/scripts/generate_table.py +94 -0
- package/skills/literature/scripts/synthesis.py +94 -0
- package/src/index.ts +0 -4
- package/src/literature-search.ts +2 -110
- package/src/rendering.ts +13 -23
- package/src/shared.ts +0 -21
- package/src/types.ts +0 -13
- package/src/fulltext.ts +0 -524
- package/src/semantic-scholar.ts +0 -199
package/README.md
CHANGED
|
@@ -1,13 +1,24 @@
|
|
|
1
1
|
# @fbraza/pi-cite
|
|
2
2
|
|
|
3
3
|
A standalone [Pi](https://pi.dev) extension providing literature-research tools for
|
|
4
|
-
academic workflows. Registers
|
|
4
|
+
academic workflows. Registers two tools callable by the agent:
|
|
5
5
|
|
|
6
|
-
- **`literature_search`** —
|
|
7
|
-
|
|
6
|
+
- **`literature_search`** — literature workflow search against PubMed using a
|
|
7
|
+
PubMed-ready query (MeSH `[mh]`, `[tiab]`, `[pt]`, substance `[nm]`, and Boolean
|
|
8
|
+
logic), with streaming progress and deduplicated results.
|
|
8
9
|
- **`pubmed_search`** — direct PubMed query (MeSH, `[tiab]`, `[pt]`, etc.).
|
|
9
|
-
|
|
10
|
-
|
|
10
|
+
|
|
11
|
+
## Bundled skill
|
|
12
|
+
|
|
13
|
+
Ships with the **`literature`** skill (`skills/literature/`), which turns these
|
|
14
|
+
tools into an end-to-end review workflow: verified-citation search, per-paper
|
|
15
|
+
experiment extraction, and a structured hypothesis synthesis. Its frontmatter
|
|
16
|
+
declares `allowed-tools` covering the extension's tools above, so the skill and
|
|
17
|
+
extension are paired on purpose.
|
|
18
|
+
|
|
19
|
+
- `references/` — PubMed query syntax, API reference, and common queries.
|
|
20
|
+
- `scripts/` — Python helpers (`extract_experiments.py`, `synthesis.py`,
|
|
21
|
+
`generate_table.py`, `export_all.py`) invoked by the skill.
|
|
11
22
|
|
|
12
23
|
## Install
|
|
13
24
|
|
|
@@ -41,4 +52,3 @@ npm run pack:check # preview the published tarball contents
|
|
|
41
52
|
| Variable | Purpose |
|
|
42
53
|
|---|---|
|
|
43
54
|
| `NCBI_API_KEY` / `api_key` env | PubMed rate limit + E-utilities auth |
|
|
44
|
-
| `SEMANTIC_SCHOLAR_API_KEY` | Enables Semantic Scholar supplementary search |
|
package/package.json
CHANGED
|
@@ -1,23 +1,26 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@fbraza/pi-cite",
|
|
3
|
-
"version": "0.
|
|
4
|
-
"description": "Pi extension with PubMed
|
|
3
|
+
"version": "0.3.0",
|
|
4
|
+
"description": "Pi extension with PubMed and literature search tools.",
|
|
5
5
|
"license": "MIT",
|
|
6
6
|
"type": "module",
|
|
7
7
|
"files": [
|
|
8
8
|
"src",
|
|
9
|
+
"skills",
|
|
9
10
|
"README.md"
|
|
10
11
|
],
|
|
11
12
|
"keywords": [
|
|
12
13
|
"pi-package",
|
|
13
14
|
"pi-extension",
|
|
14
15
|
"literature",
|
|
15
|
-
"pubmed"
|
|
16
|
-
"semantic-scholar"
|
|
16
|
+
"pubmed"
|
|
17
17
|
],
|
|
18
18
|
"pi": {
|
|
19
19
|
"extensions": [
|
|
20
20
|
"./src/index.ts"
|
|
21
|
+
],
|
|
22
|
+
"skills": [
|
|
23
|
+
"./skills"
|
|
21
24
|
]
|
|
22
25
|
},
|
|
23
26
|
"scripts": {
|
|
@@ -0,0 +1,189 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: literature
|
|
3
|
+
description: Unified literature search, verification, and synthesis workflow for scientific questions. Use when any biological claim needs a verified citation, when reviewing a gene/pathway/disease/drug/target, when surveying preclinical evidence for a target in a disease, when checking novelty, or when turning a paper set into a structured hypothesis synthesis.
|
|
4
|
+
allowed-tools: Read, Write, WebFetch, WebSearch, literature_search, pubmed_search
|
|
5
|
+
starting-prompt: Conduct a literature review on my research topic with verified citations, structured synthesis, and a per-paper summary table.
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# Literature
|
|
9
|
+
|
|
10
|
+
Unified literature skill replacing the previous split review and preclinical workflows.
|
|
11
|
+
|
|
12
|
+
## What this skill covers
|
|
13
|
+
|
|
14
|
+
Use this skill when you need to:
|
|
15
|
+
- verify any biological claim with a real citation
|
|
16
|
+
- review literature on a gene, pathway, disease, drug, or molecular target
|
|
17
|
+
- survey preclinical evidence for a target in a disease context
|
|
18
|
+
- check whether a finding appears novel or already published
|
|
19
|
+
- synthesize a paper set into hypotheses, contradictions, and evidence-weighted conclusions
|
|
20
|
+
|
|
21
|
+
Do not use this skill for:
|
|
22
|
+
- running computational omics analyses
|
|
23
|
+
- generating figures without a literature component
|
|
24
|
+
- inventing or guessing citations
|
|
25
|
+
|
|
26
|
+
## Hard rules
|
|
27
|
+
|
|
28
|
+
- Every citation must be real and verifiable.
|
|
29
|
+
- Never fabricate PMIDs, DOIs, titles, journals, years, or author lists.
|
|
30
|
+
- Distinguish human, animal, and in vitro evidence.
|
|
31
|
+
- Weight evidence quality by study design and replication.
|
|
32
|
+
- Use inline numbered citations like `[1]` or `[1, 2]` in narrative synthesis.
|
|
33
|
+
- Never overwrite outputs from a previous literature search.
|
|
34
|
+
- Never write literature-review outputs directly to generic shared paths under `results/`.
|
|
35
|
+
|
|
36
|
+
## Standard workflow
|
|
37
|
+
|
|
38
|
+
### Step 1 — Clarify scope
|
|
39
|
+
|
|
40
|
+
Always clarify:
|
|
41
|
+
- exact claim, topic, target, or disease
|
|
42
|
+
- desired time range
|
|
43
|
+
- species restrictions
|
|
44
|
+
- study type filters
|
|
45
|
+
- whether the task is **general review** or **preclinical extraction mode**
|
|
46
|
+
|
|
47
|
+
### Step 2 — Create a dedicated output folder
|
|
48
|
+
|
|
49
|
+
For every new literature review or literature research task, create a new dedicated folder under `results/literature_review/` before generating files.
|
|
50
|
+
|
|
51
|
+
Use the path `results/literature_review/<subject_of_study>/`, where `<subject_of_study>` is a short **snake_case title summary of the theme** of the literature search. Derive it from the scope clarified in Step 1: lower case, words separated by single underscores, no spaces, hyphens, or punctuation. For example, a review on **trained immunity in transplantation** becomes:
|
|
52
|
+
|
|
53
|
+
- `results/literature_review/trained_immunity_in_transplantation/`
|
|
54
|
+
|
|
55
|
+
Other examples:
|
|
56
|
+
- `results/literature_review/sirna_lung_transplant_new_treatments/`
|
|
57
|
+
- `results/literature_review/multiomics_ml_biomarkers_in_pgd/`
|
|
58
|
+
|
|
59
|
+
All generated files for that search session must be saved inside this dedicated subject folder, including:
|
|
60
|
+
- `literature_report.md`
|
|
61
|
+
- `paper_summary_table.csv`
|
|
62
|
+
- `search_log.md`
|
|
63
|
+
- any optional analysis/export artifacts such as `analysis_object.pkl`
|
|
64
|
+
|
|
65
|
+
Never write outputs directly to the parent folder or to the `results/` root, for example:
|
|
66
|
+
- `results/literature_review/literature_report.md`
|
|
67
|
+
- `results/literature_review/paper_summary_table.csv`
|
|
68
|
+
- `results/literature_review/analysis_object.pkl`
|
|
69
|
+
- `results/literature_report.md`
|
|
70
|
+
|
|
71
|
+
If a folder for a previous search on the same subject already exists, create a new folder with a distinct descriptive `<subject_of_study>` title rather than using versioned filenames.
|
|
72
|
+
|
|
73
|
+
At the end of the task, clearly report the exact output folder and generated file paths to the user.
|
|
74
|
+
|
|
75
|
+
### Step 3 — Search
|
|
76
|
+
|
|
77
|
+
Use the custom literature tool as the primary search path:
|
|
78
|
+
- **Primary:** `literature_search`
|
|
79
|
+
|
|
80
|
+
When calling `literature_search`:
|
|
81
|
+
- Always construct `pubmed_query` using PubMed-specific syntax from the references below.
|
|
82
|
+
- Use MeSH terms (`[mh]` / `[majr]`), title/abstract terms (`[tiab]`), publication types (`[pt]`), substance names (`[nm]`), date filters, and Boolean logic as appropriate.
|
|
83
|
+
- Do not pass a generic natural-language query as `pubmed_query` when a PubMed/MeSH query can be constructed.
|
|
84
|
+
|
|
85
|
+
These extension tools are the preferred search path for this skill. Do not fall back to generic `WebFetch` / `WebSearch` first when one of these typed tools fits the task.
|
|
86
|
+
|
|
87
|
+
Read these references before constructing queries:
|
|
88
|
+
- `references/pubmed_routine.md`
|
|
89
|
+
- `references/pubmed_search_syntax.md`
|
|
90
|
+
- `references/pubmed_common_queries.md`
|
|
91
|
+
|
|
92
|
+
### Step 4 — Screen and prioritise
|
|
93
|
+
|
|
94
|
+
- Deduplicate PubMed results.
|
|
95
|
+
- Prioritise by relevance, recency, and study type.
|
|
96
|
+
- Default to deep reading of the top 20 papers unless the user asks otherwise.
|
|
97
|
+
- For preclinical requests, keep studies with experimental target perturbation evidence.
|
|
98
|
+
|
|
99
|
+
### Step 5 — Synthesis
|
|
100
|
+
|
|
101
|
+
Always produce:
|
|
102
|
+
1. a narrative synthesis with inline numbered citations
|
|
103
|
+
2. a per-paper structured summary table
|
|
104
|
+
|
|
105
|
+
When in **preclinical extraction mode**, add:
|
|
106
|
+
- Experiment Type
|
|
107
|
+
- Model System
|
|
108
|
+
- Assay/Endpoint
|
|
109
|
+
- Finding Direction
|
|
110
|
+
|
|
111
|
+
Use:
|
|
112
|
+
- `scripts/synthesis.py`
|
|
113
|
+
- `scripts/generate_table.py`
|
|
114
|
+
- `scripts/export_all.py`
|
|
115
|
+
|
|
116
|
+
For preclinical extraction details, read:
|
|
117
|
+
- `references/preclinical-extraction-guide.md`
|
|
118
|
+
- `scripts/extract_experiments.py`
|
|
119
|
+
|
|
120
|
+
## Evidence quality framework
|
|
121
|
+
|
|
122
|
+
Rank evidence broadly as:
|
|
123
|
+
- **High:** replicated clinical evidence, meta-analysis, systematic review, strong human studies
|
|
124
|
+
- **Moderate:** strong animal studies, coherent multi-model evidence, robust mechanistic studies
|
|
125
|
+
- **Low/Preliminary:** single-study results, purely computational inference, unreplicated in vitro work
|
|
126
|
+
|
|
127
|
+
### What to mark as preliminary
|
|
128
|
+
- single-study findings
|
|
129
|
+
- animal-only findings for human claims
|
|
130
|
+
- in vitro findings without in vivo follow-up
|
|
131
|
+
|
|
132
|
+
### What to refuse without qualification
|
|
133
|
+
- causal claims from correlational studies
|
|
134
|
+
- claims supported only by retracted work
|
|
135
|
+
- claims contradicting the weight of evidence
|
|
136
|
+
|
|
137
|
+
## Output format
|
|
138
|
+
|
|
139
|
+
### Narrative section
|
|
140
|
+
|
|
141
|
+
Use concise prose with inline citations.
|
|
142
|
+
|
|
143
|
+
### Paper Summary Table
|
|
144
|
+
|
|
145
|
+
```markdown
|
|
146
|
+
## Paper Summary Table
|
|
147
|
+
|
|
148
|
+
| # | PMID/DOI | Authors (year) | Key Message | Key Results | Key Methods | Study Type | Evidence Quality |
|
|
149
|
+
|---|---|---|---|---|---|---|---|
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
### Extra columns for preclinical extraction mode
|
|
153
|
+
|
|
154
|
+
```markdown
|
|
155
|
+
| Experiment Type | Model System | Assay/Endpoint | Finding Direction |
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
## Hypothesis synthesis
|
|
159
|
+
|
|
160
|
+
After reviewing the core paper set, optionally produce:
|
|
161
|
+
- explicit hypotheses stated by authors
|
|
162
|
+
- implicit mechanistic hypotheses inferred from evidence
|
|
163
|
+
- contradiction matrix across papers
|
|
164
|
+
- highest-confidence next-step hypotheses
|
|
165
|
+
|
|
166
|
+
## Expected files
|
|
167
|
+
|
|
168
|
+
Typical outputs must be placed in a dedicated subject folder under `./results/literature_review/`, for example `./results/literature_review/<subject_of_study>/`:
|
|
169
|
+
- `literature_report.md`
|
|
170
|
+
- `paper_summary_table.csv`
|
|
171
|
+
- `search_log.md`
|
|
172
|
+
- optional `analysis_object.pkl` or other export artifacts when produced
|
|
173
|
+
|
|
174
|
+
Do not write these outputs directly to `./results/literature_review/` or to `./results/`, and do not reuse a previous subject folder.
|
|
175
|
+
|
|
176
|
+
## Companion references
|
|
177
|
+
|
|
178
|
+
- `references/pubmed_api_reference.md`
|
|
179
|
+
- `references/pubmed_routine.md`
|
|
180
|
+
- `references/pubmed_search_syntax.md`
|
|
181
|
+
- `references/pubmed_common_queries.md`
|
|
182
|
+
- `references/preclinical-extraction-guide.md`
|
|
183
|
+
|
|
184
|
+
## Companion scripts
|
|
185
|
+
|
|
186
|
+
- `scripts/extract_experiments.py`
|
|
187
|
+
- `scripts/synthesis.py`
|
|
188
|
+
- `scripts/generate_table.py`
|
|
189
|
+
- `scripts/export_all.py`
|
|
@@ -0,0 +1,215 @@
|
|
|
1
|
+
# Experiment Extraction Guide
|
|
2
|
+
|
|
3
|
+
**Workflow:** literature
|
|
4
|
+
**Purpose:** Keyword-based extraction approach, dictionaries, and limitations for classifying in vitro and in vivo experiments from paper abstracts.
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## Overview
|
|
9
|
+
|
|
10
|
+
The `extract_all_experiments()` function parses paper abstracts to classify each paper into one of three categories:
|
|
11
|
+
- **in_vitro** — only cell-based experiments reported
|
|
12
|
+
- **in_vivo** — only animal model experiments reported
|
|
13
|
+
- **both** — both in vitro and in vivo experiments reported
|
|
14
|
+
- **unclassified** — insufficient information in abstract to classify
|
|
15
|
+
|
|
16
|
+
Extraction is keyword-based (not AI-based), operating on the abstract text. It is fast and reproducible but has known limitations (see below).
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
## Extraction Dictionaries
|
|
21
|
+
|
|
22
|
+
### In Vitro Keywords
|
|
23
|
+
|
|
24
|
+
These terms in the abstract signal cell-based experiments:
|
|
25
|
+
|
|
26
|
+
```python
|
|
27
|
+
IN_VITRO_KEYWORDS = [
|
|
28
|
+
# Cell lines
|
|
29
|
+
"cell line", "cell lines", "cancer cells", "tumor cells",
|
|
30
|
+
"HeLa", "MCF-7", "A549", "HCT116", "PC-3", "LNCaP", "MDA-MB",
|
|
31
|
+
"U87", "U251", "Jurkat", "THP-1", "RAW264", "HEK293",
|
|
32
|
+
|
|
33
|
+
# Assay types
|
|
34
|
+
"in vitro", "cell viability", "cell proliferation", "MTT assay",
|
|
35
|
+
"CCK-8", "colony formation", "clonogenic", "wound healing",
|
|
36
|
+
"scratch assay", "transwell", "invasion assay", "migration assay",
|
|
37
|
+
"flow cytometry", "apoptosis", "cell cycle", "western blot",
|
|
38
|
+
"immunofluorescence", "ELISA", "qPCR", "RNA-seq", "ChIP",
|
|
39
|
+
|
|
40
|
+
# Culture conditions
|
|
41
|
+
"cultured", "culture", "monolayer", "3D culture", "organoid",
|
|
42
|
+
"spheroid", "primary cells", "iPSC", "stem cells"
|
|
43
|
+
]
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
### In Vivo Keywords
|
|
47
|
+
|
|
48
|
+
These terms signal animal model experiments:
|
|
49
|
+
|
|
50
|
+
```python
|
|
51
|
+
IN_VIVO_KEYWORDS = [
|
|
52
|
+
# Animal models
|
|
53
|
+
"in vivo", "mouse model", "murine", "mice", "rat model",
|
|
54
|
+
"xenograft", "syngeneic", "PDX", "patient-derived xenograft",
|
|
55
|
+
"transgenic", "knockout mouse", "knock-in", "GEM", "GEMM",
|
|
56
|
+
"orthotopic", "subcutaneous tumor", "allograft",
|
|
57
|
+
|
|
58
|
+
# Animal procedures
|
|
59
|
+
"tumor volume", "tumor growth", "body weight", "survival",
|
|
60
|
+
"tumor regression", "metastasis model", "lung metastasis",
|
|
61
|
+
"liver metastasis", "intracranial", "intraperitoneal injection",
|
|
62
|
+
"intravenous injection", "oral gavage", "dosing",
|
|
63
|
+
|
|
64
|
+
# Animal species
|
|
65
|
+
"C57BL/6", "BALB/c", "nude mice", "SCID", "NOD-SCID",
|
|
66
|
+
"NSG", "athymic", "immunodeficient", "Sprague-Dawley",
|
|
67
|
+
"Wistar rat", "zebrafish"
|
|
68
|
+
]
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
---
|
|
72
|
+
|
|
73
|
+
## Extraction Logic
|
|
74
|
+
|
|
75
|
+
```python
|
|
76
|
+
def classify_paper(abstract: str) -> dict:
|
|
77
|
+
"""
|
|
78
|
+
Classify a paper abstract into experiment categories.
|
|
79
|
+
Returns dict with: experiment_type, in_vitro_signals, in_vivo_signals,
|
|
80
|
+
cell_lines, animal_models, assays, endpoints, findings
|
|
81
|
+
"""
|
|
82
|
+
abstract_lower = abstract.lower()
|
|
83
|
+
|
|
84
|
+
# Count keyword matches
|
|
85
|
+
in_vitro_hits = [kw for kw in IN_VITRO_KEYWORDS if kw.lower() in abstract_lower]
|
|
86
|
+
in_vivo_hits = [kw for kw in IN_VIVO_KEYWORDS if kw.lower() in abstract_lower]
|
|
87
|
+
|
|
88
|
+
# Classify
|
|
89
|
+
has_vitro = len(in_vitro_hits) >= 1
|
|
90
|
+
has_vivo = len(in_vivo_hits) >= 1
|
|
91
|
+
|
|
92
|
+
if has_vitro and has_vivo:
|
|
93
|
+
experiment_type = "both"
|
|
94
|
+
elif has_vitro:
|
|
95
|
+
experiment_type = "in_vitro"
|
|
96
|
+
elif has_vivo:
|
|
97
|
+
experiment_type = "in_vivo"
|
|
98
|
+
else:
|
|
99
|
+
experiment_type = "unclassified"
|
|
100
|
+
|
|
101
|
+
return {
|
|
102
|
+
"experiment_type": experiment_type,
|
|
103
|
+
"in_vitro_signals": in_vitro_hits,
|
|
104
|
+
"in_vivo_signals": in_vivo_hits,
|
|
105
|
+
"cell_lines": extract_cell_lines(abstract),
|
|
106
|
+
"animal_models": extract_animal_models(abstract),
|
|
107
|
+
"assays": extract_assays(abstract),
|
|
108
|
+
"endpoints": extract_endpoints(abstract),
|
|
109
|
+
"findings": extract_findings_sentence(abstract)
|
|
110
|
+
}
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
---
|
|
114
|
+
|
|
115
|
+
## Structured Field Extraction
|
|
116
|
+
|
|
117
|
+
Beyond classification, the script extracts specific structured fields:
|
|
118
|
+
|
|
119
|
+
### Cell Lines
|
|
120
|
+
Extracted using a curated regex pattern matching common cancer cell line names:
|
|
121
|
+
```python
|
|
122
|
+
CELL_LINE_PATTERN = r'\b(MCF-7|MDA-MB-\d+|A549|HCT116|PC-3|LNCaP|HeLa|U87|U251|Jurkat|THP-1|HEK293[T]?|Caco-2|SW480|HT-29|PANC-1|MiaPaCa|BxPC-3|DU145|22Rv1)\b'
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
### Animal Models
|
|
126
|
+
```python
|
|
127
|
+
ANIMAL_MODEL_PATTERN = r'\b(xenograft|syngeneic|PDX|patient-derived xenograft|transgenic|knockout|GEMM|orthotopic|allograft|C57BL/6|BALB/c|nude mice|NSG|SCID)\b'
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
### Assay Types
|
|
131
|
+
```python
|
|
132
|
+
ASSAY_PATTERN = r'\b(MTT|CCK-8|colony formation|clonogenic|wound healing|transwell|flow cytometry|western blot|ELISA|qPCR|RNA-seq|ChIP-seq|ATAC-seq|immunofluorescence|IHC|co-immunoprecipitation)\b'
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
### Endpoints
|
|
136
|
+
```python
|
|
137
|
+
ENDPOINT_PATTERN = r'\b(tumor volume|tumor growth|survival|body weight|metastasis|apoptosis|cell viability|proliferation|migration|invasion|angiogenesis|tumor regression)\b'
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
### Key Findings Sentence
|
|
141
|
+
The script extracts the last 1-2 sentences of the abstract as a proxy for the main finding:
|
|
142
|
+
```python
|
|
143
|
+
def extract_findings_sentence(abstract: str) -> str:
|
|
144
|
+
sentences = abstract.split(". ")
|
|
145
|
+
return ". ".join(sentences[-2:]).strip()
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
---
|
|
149
|
+
|
|
150
|
+
## Output Schema
|
|
151
|
+
|
|
152
|
+
The `experiment_extraction.csv` file contains one row per paper with these columns:
|
|
153
|
+
|
|
154
|
+
| Column | Type | Description |
|
|
155
|
+
|--------|------|-------------|
|
|
156
|
+
| `pmid` | str | PubMed ID |
|
|
157
|
+
| `doi` | str | DOI |
|
|
158
|
+
| `title` | str | Paper title |
|
|
159
|
+
| `year` | int | Publication year |
|
|
160
|
+
| `experiment_type` | str | in_vitro / in_vivo / both / unclassified |
|
|
161
|
+
| `cell_lines` | str | Comma-separated cell line names found |
|
|
162
|
+
| `animal_models` | str | Comma-separated animal model terms found |
|
|
163
|
+
| `assays` | str | Comma-separated assay types found |
|
|
164
|
+
| `endpoints` | str | Comma-separated endpoints found |
|
|
165
|
+
| `in_vitro_signal_count` | int | Number of in vitro keyword matches |
|
|
166
|
+
| `in_vivo_signal_count` | int | Number of in vivo keyword matches |
|
|
167
|
+
| `key_findings` | str | Last 1-2 sentences of abstract |
|
|
168
|
+
|
|
169
|
+
---
|
|
170
|
+
|
|
171
|
+
## Known Limitations
|
|
172
|
+
|
|
173
|
+
### 1. Abstract-only extraction
|
|
174
|
+
The script only reads abstracts, not full text. Papers that describe experiments only in the methods/results sections (not the abstract) will be misclassified as "unclassified".
|
|
175
|
+
|
|
176
|
+
**Mitigation:** No full-text enrichment step is currently available; papers whose experiments appear only in the methods/results sections may be misclassified as "unclassified".
|
|
177
|
+
|
|
178
|
+
### 2. Keyword sensitivity
|
|
179
|
+
- **False positives:** A paper mentioning "mouse model" in the introduction (not as an experiment performed) may be classified as in_vivo.
|
|
180
|
+
- **False negatives:** Novel cell lines or animal models not in the dictionary will be missed.
|
|
181
|
+
|
|
182
|
+
### 3. No quantitative extraction
|
|
183
|
+
The script does not extract dosing, IC50 values, tumor volumes, or statistical significance — only qualitative classification.
|
|
184
|
+
|
|
185
|
+
### 4. Language dependency
|
|
186
|
+
Only English-language abstracts are supported. Non-English papers will be classified as "unclassified".
|
|
187
|
+
|
|
188
|
+
### 5. High "unclassified" rate for some targets
|
|
189
|
+
For niche targets or highly mechanistic papers (e.g., structural biology, computational studies), most papers may be classified as "unclassified". This is expected behavior — not a bug.
|
|
190
|
+
|
|
191
|
+
---
|
|
192
|
+
|
|
193
|
+
## Improving Extraction for Your Target
|
|
194
|
+
|
|
195
|
+
If you find too many "unclassified" papers, you can extend the keyword dictionaries:
|
|
196
|
+
|
|
197
|
+
```python
|
|
198
|
+
# Add target-specific cell lines
|
|
199
|
+
CUSTOM_CELL_LINES = ["your_cell_line_1", "your_cell_line_2"]
|
|
200
|
+
IN_VITRO_KEYWORDS.extend(CUSTOM_CELL_LINES)
|
|
201
|
+
|
|
202
|
+
# Add target-specific animal models
|
|
203
|
+
CUSTOM_ANIMAL_MODELS = ["your_model_1", "your_model_2"]
|
|
204
|
+
IN_VIVO_KEYWORDS.extend(CUSTOM_ANIMAL_MODELS)
|
|
205
|
+
```
|
|
206
|
+
|
|
207
|
+
Edit `scripts/extract_experiments.py` directly and re-run Step 2.
|
|
208
|
+
|
|
209
|
+
---
|
|
210
|
+
|
|
211
|
+
## References
|
|
212
|
+
|
|
213
|
+
- Keyword extraction approach inspired by: Lever J, et al. (2019) *PLOS Biol* — text mining for biomedical literature
|
|
214
|
+
- Cell line nomenclature: ATCC (https://www.atcc.org/)
|
|
215
|
+
- Animal model terminology: NCI Thesaurus (https://ncithesaurus.nci.nih.gov/)
|