@fbraza/pi-cite 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -9,6 +9,19 @@ academic workflows. Registers four tools callable by the agent:
9
9
  - **`fetch_fulltext`** — retrieve a paper PDF via PMC → publisher OA → fallback.
10
10
  - (`semantic_scholar` helper used internally by the search tools.)
11
11
 
12
+ ## Bundled skill
13
+
14
+ Ships with the **`literature`** skill (`skills/literature/`), which turns these
15
+ tools into an end-to-end review workflow: verified-citation search, full-text
16
+ retrieval, per-paper experiment extraction, and a structured hypothesis
17
+ synthesis. Its frontmatter declares `allowed-tools` covering the extension's
18
+ tools above, so the skill and extension are paired on purpose.
19
+
20
+ - `references/` — PubMed/Semantic Scholar query syntax, API reference, and
21
+ full-text access routines.
22
+ - `scripts/` — Python helpers (`extract_experiments.py`, `synthesis.py`,
23
+ `generate_table.py`, `export_all.py`) invoked by the skill.
24
+
12
25
  ## Install
13
26
 
14
27
  Published on npm as `@fbraza/pi-cite`:
package/package.json CHANGED
@@ -1,11 +1,12 @@
1
1
  {
2
2
  "name": "@fbraza/pi-cite",
3
- "version": "0.1.0",
3
+ "version": "0.2.0",
4
4
  "description": "Pi extension with PubMed, Semantic Scholar, literature search, and full-text retrieval tools.",
5
5
  "license": "MIT",
6
6
  "type": "module",
7
7
  "files": [
8
8
  "src",
9
+ "skills",
9
10
  "README.md"
10
11
  ],
11
12
  "keywords": [
@@ -18,6 +19,9 @@
18
19
  "pi": {
19
20
  "extensions": [
20
21
  "./src/index.ts"
22
+ ],
23
+ "skills": [
24
+ "./skills"
21
25
  ]
22
26
  },
23
27
  "scripts": {
@@ -0,0 +1,208 @@
1
+ ---
2
+ name: literature
3
+ description: Unified literature search, verification, full-text retrieval, and synthesis workflow for scientific questions. Use when any biological claim needs a verified citation, when reviewing a gene/pathway/disease/drug/target, when surveying preclinical evidence for a target in a disease, when checking novelty, when retrieving full text for specific papers, or when turning a paper set into a structured hypothesis synthesis.
4
+ allowed-tools: Read, Write, WebFetch, WebSearch, literature_search, pubmed_search, semantic_scholar_search, fetch_fulltext
5
+ starting-prompt: Conduct a literature review on my research topic with verified citations, structured synthesis, and a per-paper summary table.
6
+ ---
7
+
8
+ # Literature
9
+
10
+ Unified literature skill replacing the previous split review and preclinical workflows.
11
+
12
+ ## What this skill covers
13
+
14
+ Use this skill when you need to:
15
+ - verify any biological claim with a real citation
16
+ - review literature on a gene, pathway, disease, drug, or molecular target
17
+ - survey preclinical evidence for a target in a disease context
18
+ - check whether a finding appears novel or already published
19
+ - retrieve full text or PDFs for key papers
20
+ - synthesize a paper set into hypotheses, contradictions, and evidence-weighted conclusions
21
+
22
+ Do not use this skill for:
23
+ - running computational omics analyses
24
+ - generating figures without a literature component
25
+ - inventing or guessing citations
26
+
27
+ ## Hard rules
28
+
29
+ - Every citation must be real and verifiable.
30
+ - Never fabricate PMIDs, DOIs, titles, journals, years, or author lists.
31
+ - Distinguish human, animal, and in vitro evidence.
32
+ - Weight evidence quality by study design and replication.
33
+ - Record how full text was obtained for each paper.
34
+ - Use inline numbered citations like `[1]` or `[1, 2]` in narrative synthesis.
35
+ - Never overwrite outputs from a previous literature search.
36
+ - Never write literature-review outputs directly to generic shared paths under `results/`.
37
+
38
+ ## Standard workflow
39
+
40
+ ### Step 1 — Clarify scope
41
+
42
+ Always clarify:
43
+ - exact claim, topic, target, or disease
44
+ - desired time range
45
+ - species restrictions
46
+ - study type filters
47
+ - whether the task is **general review** or **preclinical extraction mode**
48
+
49
+ ### Step 2 — Create a dedicated output folder
50
+
51
+ For every new literature review or literature research task, create a new dedicated folder inside `results/` before generating files.
52
+
53
+ The folder name must describe the search session/topic clearly, for example:
54
+ - `results/literature_multiomics_ML_biomarkers_PGD/`
55
+ - `results/literature_siRNA_lung_transplant_new_treatments/`
56
+
57
+ All generated files for that search session must be saved inside this dedicated folder, including:
58
+ - `literature_report.md`
59
+ - `paper_summary_table.csv`
60
+ - `search_log.md`
61
+ - `pdfs/`
62
+ - any optional analysis/export artifacts such as `analysis_object.pkl`
63
+
64
+ Never write directly to generic shared paths such as:
65
+ - `results/literature_report.md`
66
+ - `results/paper_summary_table.csv`
67
+ - `results/analysis_object.pkl`
68
+ - `results/literature_pdfs/`
69
+
70
+ If a folder for a previous search already exists, create a new folder with a distinct descriptive search-session title rather than using versioned filenames.
71
+
72
+ At the end of the task, clearly report the exact output folder and generated file paths to the user.
73
+
74
+ ### Step 3 — Search
75
+
76
+ Use the custom literature tool as the primary search path:
77
+ - **Primary:** `literature_search`
78
+
79
+ When calling `literature_search`:
80
+ - Always construct `pubmed_query` using PubMed-specific syntax from the references below.
81
+ - Use MeSH terms (`[mh]` / `[majr]`), title/abstract terms (`[tiab]`), publication types (`[pt]`), substance names (`[nm]`), date filters, and Boolean logic as appropriate.
82
+ - Construct `semantic_scholar_query` separately as broader natural-language search terms when useful. Semantic Scholar is used automatically as supplementary search only when `SEMANTIC_SCHOLAR_API_KEY` is configured.
83
+ - Do not pass a generic natural-language query as `pubmed_query` when a PubMed/MeSH query can be constructed.
84
+
85
+ These extension tools are the preferred search path for this skill. Do not fall back to generic `WebFetch` / `WebSearch` first when one of these typed tools fits the task.
86
+
87
+ Read these references before constructing queries:
88
+ - `references/pubmed_routine.md`
89
+ - `references/pubmed_search_syntax.md`
90
+ - `references/pubmed_common_queries.md`
91
+ - `references/semanticscholar_routine.md`
92
+
93
+ ### Step 4 — Screen and prioritise
94
+
95
+ - Deduplicate across PubMed and Semantic Scholar sources.
96
+ - Prioritise by relevance, recency, citation count, and study type.
97
+ - Default to deep reading of the top 20 papers unless the user asks otherwise.
98
+ - For preclinical requests, keep studies with experimental target perturbation evidence.
99
+
100
+ ### Step 5 — Retrieve full text
101
+
102
+ Use `fetch_fulltext` for top papers. Prefer it over ad-hoc `WebFetch` PDF retrieval because it applies the defined PMC → publisher OA → Sci-Hub chain.
103
+
104
+ Access chain:
105
+ 1. PMC
106
+ 2. publisher open-access page
107
+ 3. Sci-Hub fallback
108
+
109
+ Read:
110
+ - `references/full-text-access-guide.md`
111
+ - `references/scihub_routine.md`
112
+
113
+ ### Step 6 — Synthesis
114
+
115
+ Always produce:
116
+ 1. a narrative synthesis with inline numbered citations
117
+ 2. a per-paper structured summary table
118
+
119
+ When in **preclinical extraction mode**, add:
120
+ - Experiment Type
121
+ - Model System
122
+ - Assay/Endpoint
123
+ - Finding Direction
124
+
125
+ Use:
126
+ - `scripts/synthesis.py`
127
+ - `scripts/generate_table.py`
128
+ - `scripts/export_all.py`
129
+
130
+ For preclinical extraction details, read:
131
+ - `references/preclinical-extraction-guide.md`
132
+ - `scripts/extract_experiments.py`
133
+
134
+ ## Evidence quality framework
135
+
136
+ Rank evidence broadly as:
137
+ - **High:** replicated clinical evidence, meta-analysis, systematic review, strong human studies
138
+ - **Moderate:** strong animal studies, coherent multi-model evidence, robust mechanistic studies
139
+ - **Low/Preliminary:** single-study results, purely computational inference, unreplicated in vitro work
140
+
141
+ ### What to mark as preliminary
142
+ - single-study findings
143
+ - animal-only findings for human claims
144
+ - in vitro findings without in vivo follow-up
145
+
146
+ ### What to refuse without qualification
147
+ - causal claims from correlational studies
148
+ - claims supported only by retracted work
149
+ - claims contradicting the weight of evidence
150
+
151
+ ## Output format
152
+
153
+ ### Narrative section
154
+
155
+ Use concise prose with inline citations.
156
+
157
+ ### Paper Summary Table
158
+
159
+ ```markdown
160
+ ## Paper Summary Table
161
+
162
+ | # | PMID/DOI | Authors (year) | Key Message | Key Results | Key Methods | Study Type | Evidence Quality |
163
+ |---|---|---|---|---|---|---|---|
164
+ ```
165
+
166
+ ### Extra columns for preclinical extraction mode
167
+
168
+ ```markdown
169
+ | Experiment Type | Model System | Assay/Endpoint | Finding Direction |
170
+ ```
171
+
172
+ ## Hypothesis synthesis
173
+
174
+ After reviewing the core paper set, optionally produce:
175
+ - explicit hypotheses stated by authors
176
+ - implicit mechanistic hypotheses inferred from evidence
177
+ - contradiction matrix across papers
178
+ - highest-confidence next-step hypotheses
179
+
180
+ ## Expected files
181
+
182
+ Typical outputs must be placed in a dedicated search-session folder under `./results/`, for example `./results/literature_<descriptive_topic>/`:
183
+ - `literature_report.md`
184
+ - `paper_summary_table.csv`
185
+ - `search_log.md`
186
+ - `pdfs/`
187
+ - optional `analysis_object.pkl` or other export artifacts when produced
188
+
189
+ Do not write these outputs directly to `./results/` or reuse a previous search folder.
190
+
191
+ ## Companion references
192
+
193
+ - `references/pubmed_api_reference.md`
194
+ - `references/pubmed_routine.md`
195
+ - `references/pubmed_search_syntax.md`
196
+ - `references/pubmed_common_queries.md`
197
+ - `references/semanticscholar_routine.md`
198
+ - `references/preclinical-extraction-guide.md`
199
+ - `references/full-text-access-guide.md`
200
+ - `references/scihub_routine.md`
201
+
202
+ ## Companion scripts
203
+
204
+ - `scripts/extract_experiments.py`
205
+ - `scripts/synthesis.py`
206
+ - `scripts/generate_table.py`
207
+ - `scripts/export_all.py`
208
+ - `scripts/scihub_pdf_resolver.py`
@@ -0,0 +1,34 @@
1
+ # Full-Text Access Guide
2
+
3
+ **Workflow:** literature
4
+ **Purpose:** Retrieve PDFs for prioritised papers using a consistent fallback chain.
5
+
6
+ ## Access order
7
+
8
+ 1. **PubMed Central (PMC)**
9
+ - Preferred for PubMed-indexed papers with open full text.
10
+ - Use PubMed/PMC linking first when a PMID is available.
11
+
12
+ 2. **Publisher open-access page**
13
+ - Resolve DOI at `https://doi.org/<doi>`.
14
+ - Look for `citation_pdf_url`, explicit PDF links, or embedded PDF viewers.
15
+
16
+ 3. **Sci-Hub fallback**
17
+ - Use only as the final fallback after OA routes are exhausted.
18
+ - Record that Sci-Hub was used.
19
+
20
+ ## Per-paper logging
21
+
22
+ For each paper, record:
23
+ - PMID
24
+ - DOI
25
+ - source used: `pmc`, `publisher_oa`, `scihub`, or `not_found`
26
+ - direct PDF URL if found
27
+ - local saved path if downloaded
28
+ - access note
29
+
30
+ ## Notes
31
+
32
+ - PMC and publisher OA should always be attempted before Sci-Hub.
33
+ - If no DOI is known but PMID exists, try resolving identifiers from PubMed metadata first.
34
+ - If no PDF is found, keep the paper in the synthesis and note `not_found`.
@@ -0,0 +1,215 @@
1
+ # Experiment Extraction Guide
2
+
3
+ **Workflow:** literature
4
+ **Purpose:** Keyword-based extraction approach, dictionaries, and limitations for classifying in vitro and in vivo experiments from paper abstracts.
5
+
6
+ ---
7
+
8
+ ## Overview
9
+
10
+ The `extract_all_experiments()` function parses paper abstracts to classify each paper into one of three categories:
11
+ - **in_vitro** — only cell-based experiments reported
12
+ - **in_vivo** — only animal model experiments reported
13
+ - **both** — both in vitro and in vivo experiments reported
14
+ - **unclassified** — insufficient information in abstract to classify
15
+
16
+ Extraction is keyword-based (not AI-based), operating on the abstract text. It is fast and reproducible but has known limitations (see below).
17
+
18
+ ---
19
+
20
+ ## Extraction Dictionaries
21
+
22
+ ### In Vitro Keywords
23
+
24
+ These terms in the abstract signal cell-based experiments:
25
+
26
+ ```python
27
+ IN_VITRO_KEYWORDS = [
28
+ # Cell lines
29
+ "cell line", "cell lines", "cancer cells", "tumor cells",
30
+ "HeLa", "MCF-7", "A549", "HCT116", "PC-3", "LNCaP", "MDA-MB",
31
+ "U87", "U251", "Jurkat", "THP-1", "RAW264", "HEK293",
32
+
33
+ # Assay types
34
+ "in vitro", "cell viability", "cell proliferation", "MTT assay",
35
+ "CCK-8", "colony formation", "clonogenic", "wound healing",
36
+ "scratch assay", "transwell", "invasion assay", "migration assay",
37
+ "flow cytometry", "apoptosis", "cell cycle", "western blot",
38
+ "immunofluorescence", "ELISA", "qPCR", "RNA-seq", "ChIP",
39
+
40
+ # Culture conditions
41
+ "cultured", "culture", "monolayer", "3D culture", "organoid",
42
+ "spheroid", "primary cells", "iPSC", "stem cells"
43
+ ]
44
+ ```
45
+
46
+ ### In Vivo Keywords
47
+
48
+ These terms signal animal model experiments:
49
+
50
+ ```python
51
+ IN_VIVO_KEYWORDS = [
52
+ # Animal models
53
+ "in vivo", "mouse model", "murine", "mice", "rat model",
54
+ "xenograft", "syngeneic", "PDX", "patient-derived xenograft",
55
+ "transgenic", "knockout mouse", "knock-in", "GEM", "GEMM",
56
+ "orthotopic", "subcutaneous tumor", "allograft",
57
+
58
+ # Animal procedures
59
+ "tumor volume", "tumor growth", "body weight", "survival",
60
+ "tumor regression", "metastasis model", "lung metastasis",
61
+ "liver metastasis", "intracranial", "intraperitoneal injection",
62
+ "intravenous injection", "oral gavage", "dosing",
63
+
64
+ # Animal species
65
+ "C57BL/6", "BALB/c", "nude mice", "SCID", "NOD-SCID",
66
+ "NSG", "athymic", "immunodeficient", "Sprague-Dawley",
67
+ "Wistar rat", "zebrafish"
68
+ ]
69
+ ```
70
+
71
+ ---
72
+
73
+ ## Extraction Logic
74
+
75
+ ```python
76
+ def classify_paper(abstract: str) -> dict:
77
+ """
78
+ Classify a paper abstract into experiment categories.
79
+ Returns dict with: experiment_type, in_vitro_signals, in_vivo_signals,
80
+ cell_lines, animal_models, assays, endpoints, findings
81
+ """
82
+ abstract_lower = abstract.lower()
83
+
84
+ # Count keyword matches
85
+ in_vitro_hits = [kw for kw in IN_VITRO_KEYWORDS if kw.lower() in abstract_lower]
86
+ in_vivo_hits = [kw for kw in IN_VIVO_KEYWORDS if kw.lower() in abstract_lower]
87
+
88
+ # Classify
89
+ has_vitro = len(in_vitro_hits) >= 1
90
+ has_vivo = len(in_vivo_hits) >= 1
91
+
92
+ if has_vitro and has_vivo:
93
+ experiment_type = "both"
94
+ elif has_vitro:
95
+ experiment_type = "in_vitro"
96
+ elif has_vivo:
97
+ experiment_type = "in_vivo"
98
+ else:
99
+ experiment_type = "unclassified"
100
+
101
+ return {
102
+ "experiment_type": experiment_type,
103
+ "in_vitro_signals": in_vitro_hits,
104
+ "in_vivo_signals": in_vivo_hits,
105
+ "cell_lines": extract_cell_lines(abstract),
106
+ "animal_models": extract_animal_models(abstract),
107
+ "assays": extract_assays(abstract),
108
+ "endpoints": extract_endpoints(abstract),
109
+ "findings": extract_findings_sentence(abstract)
110
+ }
111
+ ```
112
+
113
+ ---
114
+
115
+ ## Structured Field Extraction
116
+
117
+ Beyond classification, the script extracts specific structured fields:
118
+
119
+ ### Cell Lines
120
+ Extracted using a curated regex pattern matching common cancer cell line names:
121
+ ```python
122
+ CELL_LINE_PATTERN = r'\b(MCF-7|MDA-MB-\d+|A549|HCT116|PC-3|LNCaP|HeLa|U87|U251|Jurkat|THP-1|HEK293[T]?|Caco-2|SW480|HT-29|PANC-1|MiaPaCa|BxPC-3|DU145|22Rv1)\b'
123
+ ```
124
+
125
+ ### Animal Models
126
+ ```python
127
+ ANIMAL_MODEL_PATTERN = r'\b(xenograft|syngeneic|PDX|patient-derived xenograft|transgenic|knockout|GEMM|orthotopic|allograft|C57BL/6|BALB/c|nude mice|NSG|SCID)\b'
128
+ ```
129
+
130
+ ### Assay Types
131
+ ```python
132
+ ASSAY_PATTERN = r'\b(MTT|CCK-8|colony formation|clonogenic|wound healing|transwell|flow cytometry|western blot|ELISA|qPCR|RNA-seq|ChIP-seq|ATAC-seq|immunofluorescence|IHC|co-immunoprecipitation)\b'
133
+ ```
134
+
135
+ ### Endpoints
136
+ ```python
137
+ ENDPOINT_PATTERN = r'\b(tumor volume|tumor growth|survival|body weight|metastasis|apoptosis|cell viability|proliferation|migration|invasion|angiogenesis|tumor regression)\b'
138
+ ```
139
+
140
+ ### Key Findings Sentence
141
+ The script extracts the last 1-2 sentences of the abstract as a proxy for the main finding:
142
+ ```python
143
+ def extract_findings_sentence(abstract: str) -> str:
144
+ sentences = abstract.split(". ")
145
+ return ". ".join(sentences[-2:]).strip()
146
+ ```
147
+
148
+ ---
149
+
150
+ ## Output Schema
151
+
152
+ The `experiment_extraction.csv` file contains one row per paper with these columns:
153
+
154
+ | Column | Type | Description |
155
+ |--------|------|-------------|
156
+ | `pmid` | str | PubMed ID |
157
+ | `doi` | str | DOI |
158
+ | `title` | str | Paper title |
159
+ | `year` | int | Publication year |
160
+ | `experiment_type` | str | in_vitro / in_vivo / both / unclassified |
161
+ | `cell_lines` | str | Comma-separated cell line names found |
162
+ | `animal_models` | str | Comma-separated animal model terms found |
163
+ | `assays` | str | Comma-separated assay types found |
164
+ | `endpoints` | str | Comma-separated endpoints found |
165
+ | `in_vitro_signal_count` | int | Number of in vitro keyword matches |
166
+ | `in_vivo_signal_count` | int | Number of in vivo keyword matches |
167
+ | `key_findings` | str | Last 1-2 sentences of abstract |
168
+
169
+ ---
170
+
171
+ ## Known Limitations
172
+
173
+ ### 1. Abstract-only extraction
174
+ The script only reads abstracts, not full text. Papers that describe experiments only in the methods/results sections (not the abstract) will be misclassified as "unclassified".
175
+
176
+ **Mitigation:** Step 5 of the workflow (full-text enrichment) addresses this for top papers.
177
+
178
+ ### 2. Keyword sensitivity
179
+ - **False positives:** A paper mentioning "mouse model" in the introduction (not as an experiment performed) may be classified as in_vivo.
180
+ - **False negatives:** Novel cell lines or animal models not in the dictionary will be missed.
181
+
182
+ ### 3. No quantitative extraction
183
+ The script does not extract dosing, IC50 values, tumor volumes, or statistical significance — only qualitative classification.
184
+
185
+ ### 4. Language dependency
186
+ Only English-language abstracts are supported. Non-English papers will be classified as "unclassified".
187
+
188
+ ### 5. High "unclassified" rate for some targets
189
+ For niche targets or highly mechanistic papers (e.g., structural biology, computational studies), most papers may be classified as "unclassified". This is expected behavior — not a bug.
190
+
191
+ ---
192
+
193
+ ## Improving Extraction for Your Target
194
+
195
+ If you find too many "unclassified" papers, you can extend the keyword dictionaries:
196
+
197
+ ```python
198
+ # Add target-specific cell lines
199
+ CUSTOM_CELL_LINES = ["your_cell_line_1", "your_cell_line_2"]
200
+ IN_VITRO_KEYWORDS.extend(CUSTOM_CELL_LINES)
201
+
202
+ # Add target-specific animal models
203
+ CUSTOM_ANIMAL_MODELS = ["your_model_1", "your_model_2"]
204
+ IN_VIVO_KEYWORDS.extend(CUSTOM_ANIMAL_MODELS)
205
+ ```
206
+
207
+ Edit `scripts/extract_experiments.py` directly and re-run Step 2.
208
+
209
+ ---
210
+
211
+ ## References
212
+
213
+ - Keyword extraction approach inspired by: Lever J, et al. (2019) *PLOS Biol* — text mining for biomedical literature
214
+ - Cell line nomenclature: ATCC (https://www.atcc.org/)
215
+ - Animal model terminology: NCI Thesaurus (https://ncithesaurus.nci.nih.gov/)