@fbraza/pi-cite 0.1.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,13 +1,24 @@
1
1
  # @fbraza/pi-cite
2
2
 
3
3
  A standalone [Pi](https://pi.dev) extension providing literature-research tools for
4
- academic workflows. Registers four tools callable by the agent:
4
+ academic workflows. Registers two tools callable by the agent:
5
5
 
6
- - **`literature_search`** — PubMed-first search with optional Semantic Scholar
7
- supplementary metadata.
6
+ - **`literature_search`** — literature workflow search against PubMed using a
7
+ PubMed-ready query (MeSH `[mh]`, `[tiab]`, `[pt]`, substance `[nm]`, and Boolean
8
+ logic), with streaming progress and deduplicated results.
8
9
  - **`pubmed_search`** — direct PubMed query (MeSH, `[tiab]`, `[pt]`, etc.).
9
- - **`fetch_fulltext`** — retrieve a paper PDF via PMC → publisher OA → fallback.
10
- - (`semantic_scholar` helper used internally by the search tools.)
10
+
11
+ ## Bundled skill
12
+
13
+ Ships with the **`literature`** skill (`skills/literature/`), which turns these
14
+ tools into an end-to-end review workflow: verified-citation search, per-paper
15
+ experiment extraction, and a structured hypothesis synthesis. Its frontmatter
16
+ declares `allowed-tools` covering the extension's tools above, so the skill and
17
+ extension are paired on purpose.
18
+
19
+ - `references/` — PubMed query syntax, API reference, and common queries.
20
+ - `scripts/` — Python helpers (`extract_experiments.py`, `synthesis.py`,
21
+ `generate_table.py`, `export_all.py`) invoked by the skill.
11
22
 
12
23
  ## Install
13
24
 
@@ -41,4 +52,3 @@ npm run pack:check # preview the published tarball contents
41
52
  | Variable | Purpose |
42
53
  |---|---|
43
54
  | `NCBI_API_KEY` / `api_key` env | PubMed rate limit + E-utilities auth |
44
- | `SEMANTIC_SCHOLAR_API_KEY` | Enables Semantic Scholar supplementary search |
package/package.json CHANGED
@@ -1,23 +1,26 @@
1
1
  {
2
2
  "name": "@fbraza/pi-cite",
3
- "version": "0.1.0",
4
- "description": "Pi extension with PubMed, Semantic Scholar, literature search, and full-text retrieval tools.",
3
+ "version": "0.3.0",
4
+ "description": "Pi extension with PubMed and literature search tools.",
5
5
  "license": "MIT",
6
6
  "type": "module",
7
7
  "files": [
8
8
  "src",
9
+ "skills",
9
10
  "README.md"
10
11
  ],
11
12
  "keywords": [
12
13
  "pi-package",
13
14
  "pi-extension",
14
15
  "literature",
15
- "pubmed",
16
- "semantic-scholar"
16
+ "pubmed"
17
17
  ],
18
18
  "pi": {
19
19
  "extensions": [
20
20
  "./src/index.ts"
21
+ ],
22
+ "skills": [
23
+ "./skills"
21
24
  ]
22
25
  },
23
26
  "scripts": {
@@ -0,0 +1,189 @@
1
+ ---
2
+ name: literature
3
+ description: Unified literature search, verification, and synthesis workflow for scientific questions. Use when any biological claim needs a verified citation, when reviewing a gene/pathway/disease/drug/target, when surveying preclinical evidence for a target in a disease, when checking novelty, or when turning a paper set into a structured hypothesis synthesis.
4
+ allowed-tools: Read, Write, WebFetch, WebSearch, literature_search, pubmed_search
5
+ starting-prompt: Conduct a literature review on my research topic with verified citations, structured synthesis, and a per-paper summary table.
6
+ ---
7
+
8
+ # Literature
9
+
10
+ Unified literature skill replacing the previous split review and preclinical workflows.
11
+
12
+ ## What this skill covers
13
+
14
+ Use this skill when you need to:
15
+ - verify any biological claim with a real citation
16
+ - review literature on a gene, pathway, disease, drug, or molecular target
17
+ - survey preclinical evidence for a target in a disease context
18
+ - check whether a finding appears novel or already published
19
+ - synthesize a paper set into hypotheses, contradictions, and evidence-weighted conclusions
20
+
21
+ Do not use this skill for:
22
+ - running computational omics analyses
23
+ - generating figures without a literature component
24
+ - inventing or guessing citations
25
+
26
+ ## Hard rules
27
+
28
+ - Every citation must be real and verifiable.
29
+ - Never fabricate PMIDs, DOIs, titles, journals, years, or author lists.
30
+ - Distinguish human, animal, and in vitro evidence.
31
+ - Weight evidence quality by study design and replication.
32
+ - Use inline numbered citations like `[1]` or `[1, 2]` in narrative synthesis.
33
+ - Never overwrite outputs from a previous literature search.
34
+ - Never write literature-review outputs directly to generic shared paths under `results/`.
35
+
36
+ ## Standard workflow
37
+
38
+ ### Step 1 — Clarify scope
39
+
40
+ Always clarify:
41
+ - exact claim, topic, target, or disease
42
+ - desired time range
43
+ - species restrictions
44
+ - study type filters
45
+ - whether the task is **general review** or **preclinical extraction mode**
46
+
47
+ ### Step 2 — Create a dedicated output folder
48
+
49
+ For every new literature review or literature research task, create a new dedicated folder under `results/literature_review/` before generating files.
50
+
51
+ Use the path `results/literature_review/<subject_of_study>/`, where `<subject_of_study>` is a short **snake_case title summary of the theme** of the literature search. Derive it from the scope clarified in Step 1: lower case, words separated by single underscores, no spaces, hyphens, or punctuation. For example, a review on **trained immunity in transplantation** becomes:
52
+
53
+ - `results/literature_review/trained_immunity_in_transplantation/`
54
+
55
+ Other examples:
56
+ - `results/literature_review/sirna_lung_transplant_new_treatments/`
57
+ - `results/literature_review/multiomics_ml_biomarkers_in_pgd/`
58
+
59
+ All generated files for that search session must be saved inside this dedicated subject folder, including:
60
+ - `literature_report.md`
61
+ - `paper_summary_table.csv`
62
+ - `search_log.md`
63
+ - any optional analysis/export artifacts such as `analysis_object.pkl`
64
+
65
+ Never write outputs directly to the parent folder or to the `results/` root, for example:
66
+ - `results/literature_review/literature_report.md`
67
+ - `results/literature_review/paper_summary_table.csv`
68
+ - `results/literature_review/analysis_object.pkl`
69
+ - `results/literature_report.md`
70
+
71
+ If a folder for a previous search on the same subject already exists, create a new folder with a distinct descriptive `<subject_of_study>` title rather than using versioned filenames.
72
+
73
+ At the end of the task, clearly report the exact output folder and generated file paths to the user.
74
+
75
+ ### Step 3 — Search
76
+
77
+ Use the custom literature tool as the primary search path:
78
+ - **Primary:** `literature_search`
79
+
80
+ When calling `literature_search`:
81
+ - Always construct `pubmed_query` using PubMed-specific syntax from the references below.
82
+ - Use MeSH terms (`[mh]` / `[majr]`), title/abstract terms (`[tiab]`), publication types (`[pt]`), substance names (`[nm]`), date filters, and Boolean logic as appropriate.
83
+ - Do not pass a generic natural-language query as `pubmed_query` when a PubMed/MeSH query can be constructed.
84
+
85
+ These extension tools are the preferred search path for this skill. Do not fall back to generic `WebFetch` / `WebSearch` first when one of these typed tools fits the task.
86
+
87
+ Read these references before constructing queries:
88
+ - `references/pubmed_routine.md`
89
+ - `references/pubmed_search_syntax.md`
90
+ - `references/pubmed_common_queries.md`
91
+
92
+ ### Step 4 — Screen and prioritise
93
+
94
+ - Deduplicate PubMed results.
95
+ - Prioritise by relevance, recency, and study type.
96
+ - Default to deep reading of the top 20 papers unless the user asks otherwise.
97
+ - For preclinical requests, keep studies with experimental target perturbation evidence.
98
+
99
+ ### Step 5 — Synthesis
100
+
101
+ Always produce:
102
+ 1. a narrative synthesis with inline numbered citations
103
+ 2. a per-paper structured summary table
104
+
105
+ When in **preclinical extraction mode**, add:
106
+ - Experiment Type
107
+ - Model System
108
+ - Assay/Endpoint
109
+ - Finding Direction
110
+
111
+ Use:
112
+ - `scripts/synthesis.py`
113
+ - `scripts/generate_table.py`
114
+ - `scripts/export_all.py`
115
+
116
+ For preclinical extraction details, read:
117
+ - `references/preclinical-extraction-guide.md`
118
+ - `scripts/extract_experiments.py`
119
+
120
+ ## Evidence quality framework
121
+
122
+ Rank evidence broadly as:
123
+ - **High:** replicated clinical evidence, meta-analysis, systematic review, strong human studies
124
+ - **Moderate:** strong animal studies, coherent multi-model evidence, robust mechanistic studies
125
+ - **Low/Preliminary:** single-study results, purely computational inference, unreplicated in vitro work
126
+
127
+ ### What to mark as preliminary
128
+ - single-study findings
129
+ - animal-only findings for human claims
130
+ - in vitro findings without in vivo follow-up
131
+
132
+ ### What to refuse without qualification
133
+ - causal claims from correlational studies
134
+ - claims supported only by retracted work
135
+ - claims contradicting the weight of evidence
136
+
137
+ ## Output format
138
+
139
+ ### Narrative section
140
+
141
+ Use concise prose with inline citations.
142
+
143
+ ### Paper Summary Table
144
+
145
+ ```markdown
146
+ ## Paper Summary Table
147
+
148
+ | # | PMID/DOI | Authors (year) | Key Message | Key Results | Key Methods | Study Type | Evidence Quality |
149
+ |---|---|---|---|---|---|---|---|
150
+ ```
151
+
152
+ ### Extra columns for preclinical extraction mode
153
+
154
+ ```markdown
155
+ | Experiment Type | Model System | Assay/Endpoint | Finding Direction |
156
+ ```
157
+
158
+ ## Hypothesis synthesis
159
+
160
+ After reviewing the core paper set, optionally produce:
161
+ - explicit hypotheses stated by authors
162
+ - implicit mechanistic hypotheses inferred from evidence
163
+ - contradiction matrix across papers
164
+ - highest-confidence next-step hypotheses
165
+
166
+ ## Expected files
167
+
168
+ Typical outputs must be placed in a dedicated subject folder under `./results/literature_review/`, for example `./results/literature_review/<subject_of_study>/`:
169
+ - `literature_report.md`
170
+ - `paper_summary_table.csv`
171
+ - `search_log.md`
172
+ - optional `analysis_object.pkl` or other export artifacts when produced
173
+
174
+ Do not write these outputs directly to `./results/literature_review/` or to `./results/`, and do not reuse a previous subject folder.
175
+
176
+ ## Companion references
177
+
178
+ - `references/pubmed_api_reference.md`
179
+ - `references/pubmed_routine.md`
180
+ - `references/pubmed_search_syntax.md`
181
+ - `references/pubmed_common_queries.md`
182
+ - `references/preclinical-extraction-guide.md`
183
+
184
+ ## Companion scripts
185
+
186
+ - `scripts/extract_experiments.py`
187
+ - `scripts/synthesis.py`
188
+ - `scripts/generate_table.py`
189
+ - `scripts/export_all.py`
@@ -0,0 +1,215 @@
1
+ # Experiment Extraction Guide
2
+
3
+ **Workflow:** literature
4
+ **Purpose:** Keyword-based extraction approach, dictionaries, and limitations for classifying in vitro and in vivo experiments from paper abstracts.
5
+
6
+ ---
7
+
8
+ ## Overview
9
+
10
+ The `extract_all_experiments()` function parses paper abstracts to classify each paper into one of three categories:
11
+ - **in_vitro** — only cell-based experiments reported
12
+ - **in_vivo** — only animal model experiments reported
13
+ - **both** — both in vitro and in vivo experiments reported
14
+ - **unclassified** — insufficient information in abstract to classify
15
+
16
+ Extraction is keyword-based (not AI-based), operating on the abstract text. It is fast and reproducible but has known limitations (see below).
17
+
18
+ ---
19
+
20
+ ## Extraction Dictionaries
21
+
22
+ ### In Vitro Keywords
23
+
24
+ These terms in the abstract signal cell-based experiments:
25
+
26
+ ```python
27
+ IN_VITRO_KEYWORDS = [
28
+ # Cell lines
29
+ "cell line", "cell lines", "cancer cells", "tumor cells",
30
+ "HeLa", "MCF-7", "A549", "HCT116", "PC-3", "LNCaP", "MDA-MB",
31
+ "U87", "U251", "Jurkat", "THP-1", "RAW264", "HEK293",
32
+
33
+ # Assay types
34
+ "in vitro", "cell viability", "cell proliferation", "MTT assay",
35
+ "CCK-8", "colony formation", "clonogenic", "wound healing",
36
+ "scratch assay", "transwell", "invasion assay", "migration assay",
37
+ "flow cytometry", "apoptosis", "cell cycle", "western blot",
38
+ "immunofluorescence", "ELISA", "qPCR", "RNA-seq", "ChIP",
39
+
40
+ # Culture conditions
41
+ "cultured", "culture", "monolayer", "3D culture", "organoid",
42
+ "spheroid", "primary cells", "iPSC", "stem cells"
43
+ ]
44
+ ```
45
+
46
+ ### In Vivo Keywords
47
+
48
+ These terms signal animal model experiments:
49
+
50
+ ```python
51
+ IN_VIVO_KEYWORDS = [
52
+ # Animal models
53
+ "in vivo", "mouse model", "murine", "mice", "rat model",
54
+ "xenograft", "syngeneic", "PDX", "patient-derived xenograft",
55
+ "transgenic", "knockout mouse", "knock-in", "GEM", "GEMM",
56
+ "orthotopic", "subcutaneous tumor", "allograft",
57
+
58
+ # Animal procedures
59
+ "tumor volume", "tumor growth", "body weight", "survival",
60
+ "tumor regression", "metastasis model", "lung metastasis",
61
+ "liver metastasis", "intracranial", "intraperitoneal injection",
62
+ "intravenous injection", "oral gavage", "dosing",
63
+
64
+ # Animal species
65
+ "C57BL/6", "BALB/c", "nude mice", "SCID", "NOD-SCID",
66
+ "NSG", "athymic", "immunodeficient", "Sprague-Dawley",
67
+ "Wistar rat", "zebrafish"
68
+ ]
69
+ ```
70
+
71
+ ---
72
+
73
+ ## Extraction Logic
74
+
75
+ ```python
76
+ def classify_paper(abstract: str) -> dict:
77
+ """
78
+ Classify a paper abstract into experiment categories.
79
+ Returns dict with: experiment_type, in_vitro_signals, in_vivo_signals,
80
+ cell_lines, animal_models, assays, endpoints, findings
81
+ """
82
+ abstract_lower = abstract.lower()
83
+
84
+ # Count keyword matches
85
+ in_vitro_hits = [kw for kw in IN_VITRO_KEYWORDS if kw.lower() in abstract_lower]
86
+ in_vivo_hits = [kw for kw in IN_VIVO_KEYWORDS if kw.lower() in abstract_lower]
87
+
88
+ # Classify
89
+ has_vitro = len(in_vitro_hits) >= 1
90
+ has_vivo = len(in_vivo_hits) >= 1
91
+
92
+ if has_vitro and has_vivo:
93
+ experiment_type = "both"
94
+ elif has_vitro:
95
+ experiment_type = "in_vitro"
96
+ elif has_vivo:
97
+ experiment_type = "in_vivo"
98
+ else:
99
+ experiment_type = "unclassified"
100
+
101
+ return {
102
+ "experiment_type": experiment_type,
103
+ "in_vitro_signals": in_vitro_hits,
104
+ "in_vivo_signals": in_vivo_hits,
105
+ "cell_lines": extract_cell_lines(abstract),
106
+ "animal_models": extract_animal_models(abstract),
107
+ "assays": extract_assays(abstract),
108
+ "endpoints": extract_endpoints(abstract),
109
+ "findings": extract_findings_sentence(abstract)
110
+ }
111
+ ```
112
+
113
+ ---
114
+
115
+ ## Structured Field Extraction
116
+
117
+ Beyond classification, the script extracts specific structured fields:
118
+
119
+ ### Cell Lines
120
+ Extracted using a curated regex pattern matching common cancer cell line names:
121
+ ```python
122
+ CELL_LINE_PATTERN = r'\b(MCF-7|MDA-MB-\d+|A549|HCT116|PC-3|LNCaP|HeLa|U87|U251|Jurkat|THP-1|HEK293[T]?|Caco-2|SW480|HT-29|PANC-1|MiaPaCa|BxPC-3|DU145|22Rv1)\b'
123
+ ```
124
+
125
+ ### Animal Models
126
+ ```python
127
+ ANIMAL_MODEL_PATTERN = r'\b(xenograft|syngeneic|PDX|patient-derived xenograft|transgenic|knockout|GEMM|orthotopic|allograft|C57BL/6|BALB/c|nude mice|NSG|SCID)\b'
128
+ ```
129
+
130
+ ### Assay Types
131
+ ```python
132
+ ASSAY_PATTERN = r'\b(MTT|CCK-8|colony formation|clonogenic|wound healing|transwell|flow cytometry|western blot|ELISA|qPCR|RNA-seq|ChIP-seq|ATAC-seq|immunofluorescence|IHC|co-immunoprecipitation)\b'
133
+ ```
134
+
135
+ ### Endpoints
136
+ ```python
137
+ ENDPOINT_PATTERN = r'\b(tumor volume|tumor growth|survival|body weight|metastasis|apoptosis|cell viability|proliferation|migration|invasion|angiogenesis|tumor regression)\b'
138
+ ```
139
+
140
+ ### Key Findings Sentence
141
+ The script extracts the last 1-2 sentences of the abstract as a proxy for the main finding:
142
+ ```python
143
+ def extract_findings_sentence(abstract: str) -> str:
144
+ sentences = abstract.split(". ")
145
+ return ". ".join(sentences[-2:]).strip()
146
+ ```
147
+
148
+ ---
149
+
150
+ ## Output Schema
151
+
152
+ The `experiment_extraction.csv` file contains one row per paper with these columns:
153
+
154
+ | Column | Type | Description |
155
+ |--------|------|-------------|
156
+ | `pmid` | str | PubMed ID |
157
+ | `doi` | str | DOI |
158
+ | `title` | str | Paper title |
159
+ | `year` | int | Publication year |
160
+ | `experiment_type` | str | in_vitro / in_vivo / both / unclassified |
161
+ | `cell_lines` | str | Comma-separated cell line names found |
162
+ | `animal_models` | str | Comma-separated animal model terms found |
163
+ | `assays` | str | Comma-separated assay types found |
164
+ | `endpoints` | str | Comma-separated endpoints found |
165
+ | `in_vitro_signal_count` | int | Number of in vitro keyword matches |
166
+ | `in_vivo_signal_count` | int | Number of in vivo keyword matches |
167
+ | `key_findings` | str | Last 1-2 sentences of abstract |
168
+
169
+ ---
170
+
171
+ ## Known Limitations
172
+
173
+ ### 1. Abstract-only extraction
174
+ The script only reads abstracts, not full text. Papers that describe experiments only in the methods/results sections (not the abstract) will be misclassified as "unclassified".
175
+
176
+ **Mitigation:** No full-text enrichment step is currently available; papers whose experiments appear only in the methods/results sections may be misclassified as "unclassified".
177
+
178
+ ### 2. Keyword sensitivity
179
+ - **False positives:** A paper mentioning "mouse model" in the introduction (not as an experiment performed) may be classified as in_vivo.
180
+ - **False negatives:** Novel cell lines or animal models not in the dictionary will be missed.
181
+
182
+ ### 3. No quantitative extraction
183
+ The script does not extract dosing, IC50 values, tumor volumes, or statistical significance — only qualitative classification.
184
+
185
+ ### 4. Language dependency
186
+ Only English-language abstracts are supported. Non-English papers will be classified as "unclassified".
187
+
188
+ ### 5. High "unclassified" rate for some targets
189
+ For niche targets or highly mechanistic papers (e.g., structural biology, computational studies), most papers may be classified as "unclassified". This is expected behavior — not a bug.
190
+
191
+ ---
192
+
193
+ ## Improving Extraction for Your Target
194
+
195
+ If you find too many "unclassified" papers, you can extend the keyword dictionaries:
196
+
197
+ ```python
198
+ # Add target-specific cell lines
199
+ CUSTOM_CELL_LINES = ["your_cell_line_1", "your_cell_line_2"]
200
+ IN_VITRO_KEYWORDS.extend(CUSTOM_CELL_LINES)
201
+
202
+ # Add target-specific animal models
203
+ CUSTOM_ANIMAL_MODELS = ["your_model_1", "your_model_2"]
204
+ IN_VIVO_KEYWORDS.extend(CUSTOM_ANIMAL_MODELS)
205
+ ```
206
+
207
+ Edit `scripts/extract_experiments.py` directly and re-run Step 2.
208
+
209
+ ---
210
+
211
+ ## References
212
+
213
+ - Keyword extraction approach inspired by: Lever J, et al. (2019) *PLOS Biol* — text mining for biomedical literature
214
+ - Cell line nomenclature: ATCC (https://www.atcc.org/)
215
+ - Animal model terminology: NCI Thesaurus (https://ncithesaurus.nci.nih.gov/)