@pencil-agent/nano-pencil 2.0.1 → 2.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (188) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/model/custom-providers.js +1 -1
  7. package/dist/core/model-registry.js +5 -5
  8. package/dist/extensions/builtin/AGENT.md +115 -115
  9. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  10. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  11. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  12. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  13. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  14. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  15. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  16. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  17. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  18. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  19. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  20. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  91. package/dist/extensions/builtin/browser/browser.md +73 -73
  92. package/dist/extensions/builtin/browser/install.md +142 -142
  93. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  94. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  95. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  96. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  97. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  98. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  99. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  100. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  101. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  102. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  103. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  104. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  105. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  107. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  108. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  109. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  110. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  111. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  112. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  113. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  114. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  115. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  116. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  117. package/dist/extensions/builtin/debug/index.js +9 -9
  118. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  119. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  120. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  121. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  122. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  123. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  124. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  125. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  126. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  127. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  128. package/dist/extensions/builtin/goal/README.md +67 -67
  129. package/dist/extensions/builtin/goal/index.js +6 -6
  130. package/dist/extensions/builtin/grub/README.md +112 -112
  131. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  132. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  133. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  134. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  135. package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
  136. package/dist/extensions/builtin/loop/README.md +92 -92
  137. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  138. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  139. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  140. package/dist/extensions/builtin/sal/README.md +72 -72
  141. package/dist/extensions/builtin/security-audit/README.md +289 -289
  142. package/dist/extensions/builtin/team/AGENT.md +112 -112
  143. package/dist/extensions/builtin/team/TESTING.md +299 -299
  144. package/dist/extensions/builtin/token-save/README.md +56 -56
  145. package/dist/extensions/optional/AGENT.md +10 -10
  146. package/dist/modes/interactive/controllers/input-submit-controller.js +2 -2
  147. package/dist/modes/interactive/controllers/stream-render-controller.js +2 -2
  148. package/dist/modes/interactive/interactive-mode.js +19 -19
  149. package/dist/modes/interactive/theme/dark.json +85 -85
  150. package/dist/modes/interactive/theme/light.json +84 -84
  151. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  152. package/dist/modes/interactive/theme/warm.json +81 -81
  153. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  154. package/dist/node_modules/@pencil-agent/ai/dist/models.generated.js +1 -1
  155. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
  156. package/docs/SDK-TESTING.md +364 -0
  157. package/docs/codex-goal-command-impl.md +1055 -1055
  158. package/docs/codex-goal-vs-grub.md +500 -500
  159. package/docs/custom-provider.md +27 -27
  160. package/docs/extensions.md +27 -27
  161. package/docs/keybindings.md +27 -27
  162. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  163. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  164. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  165. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  166. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  167. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  168. package/docs/loop-usage-examples.md +214 -214
  169. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
  170. package/docs/models.md +27 -27
  171. package/docs/packages.md +27 -27
  172. package/docs/pi-design-philosophy.md +457 -457
  173. package/docs/planmode.md +1987 -1987
  174. package/docs/prompt-templates.md +27 -27
  175. package/docs/providers.md +27 -27
  176. package/docs/sdk.md +27 -27
  177. package/docs/skills.md +27 -27
  178. package/docs/startup-performance-optimization.md +301 -0
  179. package/docs/themes.md +27 -27
  180. package/docs/tui.md +27 -27
  181. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
  182. package/package.json +190 -190
  183. package/docs/cc-agent-design.md +0 -1297
  184. package/docs/cc-tui-design.md +0 -1333
  185. package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +0 -170
  186. package/docs/scan-report.md +0 -3820
  187. package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
  188. package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +0 -261
@@ -1,421 +1,421 @@
1
- # PubMed / NCBI — Scraping & Data Extraction
2
-
3
- `https://pubmed.ncbi.nlm.nih.gov` — 37 M+ biomedical citations. **Never use the browser for PubMed.** All data is reachable via `http_get` using the NCBI E-utilities REST API. No API key required; a free key raises the rate limit from 3 to 10 req/s.
4
-
5
- ## Do this first
6
-
7
- **ESearch → ESummary is the fastest pipeline for most tasks — two calls, JSON responses, no XML parsing.**
8
-
9
- ```python
10
- import json
11
- from helpers import http_get
12
-
13
- # Step 1: search → get PMIDs
14
- search = json.loads(http_get(
15
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
16
- "?db=pubmed&term=deep+learning+radiology&retmax=10&retmode=json"
17
- ))
18
- pmids = search['esearchresult']['idlist'] # e.g. ['41999029', '41998456', ...]
19
- count = search['esearchresult']['count'] # total hits across all pages
20
-
21
- # Step 2: fetch lightweight metadata for all PMIDs in one call
22
- summary = json.loads(http_get(
23
- f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
24
- f"?db=pubmed&id={','.join(pmids)}&retmode=json"
25
- ))
26
- result = summary['result']
27
- for uid in result['uids']:
28
- art = result[uid]
29
- print(uid, art['pubdate'], art['source'])
30
- print(" ", art['title'][:80])
31
- print(" authors:", [a['name'] for a in art['authors'][:3]])
32
- # Confirmed output (2026-04-18):
33
- # 41999029 2026 Apr 18 Med Sci Monit
34
- # Use of Deep Learning Models in the Diagnosis of Proptosis Through Orbi
35
- # authors: ['Kesimal U', 'Akkaya HE', 'Polat Ö']
36
- # 41998456 2026 Apr 17 Sci Rep
37
- # ...
38
- ```
39
-
40
- Use **EFetch XML** when you need: full abstract text, MeSH terms, complete author names (not just "Last I"), structured abstract labels, or the DOI from within the article record.
41
-
42
- ## Common workflows
43
-
44
- ### Search PubMed (ESearch)
45
-
46
- ```python
47
- import json
48
- from helpers import http_get
49
-
50
- data = json.loads(http_get(
51
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
52
- "?db=pubmed"
53
- "&term=large+language+models+clinical"
54
- "&retmax=5"
55
- "&retmode=json"
56
- "&sort=pub+date" # newest first; default is relevance
57
- "&datetype=pdat" # filter by publication date
58
- "&mindate=2024/01/01&maxdate=2024/12/31" # YYYY/MM/DD format
59
- ))
60
- result = data['esearchresult']
61
- print("Total hits:", result['count']) # '24160' — note: string, not int
62
- print("PMIDs:", result['idlist'])
63
- print("Query translation:", result['querytranslation'])
64
- # Confirmed output (2026-04-18):
65
- # Total hits: 24160
66
- # PMIDs: ['41996895', '41996722', '41996006', '41995888', '41995759']
67
- # Query translation: "large language models"[MeSH Terms] OR ...
68
- ```
69
-
70
- #### ESearch field tags (append to term)
71
-
72
- ```
73
- machine learning[MeSH Terms] MeSH controlled vocabulary
74
- Hinton GE[Author] author last + initials
75
- attention is all you need[Title] title words
76
- Nature[Journal] journal name
77
- 2024[pdat] publication year
78
- ```
79
-
80
- Boolean operators: `AND`, `OR`, `NOT`. Phrase search: `"exact phrase"[Title]`.
81
-
82
- #### Sort options (`sort=`)
83
-
84
- | Value | Effect |
85
- |---|---|
86
- | *(omit)* | Relevance (default) |
87
- | `pub+date` | Most recent publication first |
88
- | `Author` | First author alphabetical |
89
- | `JournalName` | Journal alphabetical |
90
-
91
- ### Lightweight metadata — ESummary (JSON, no XML)
92
-
93
- ```python
94
- import json
95
- from helpers import http_get
96
-
97
- data = json.loads(http_get(
98
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
99
- "?db=pubmed&id=41999029,41998456,41997837&retmode=json"
100
- ))
101
- result = data['result']
102
- for uid in result['uids']:
103
- art = result[uid]
104
- # Key fields available:
105
- title = art['title'] # full title string
106
- source = art['source'] # abbreviated journal name
107
- fulljournalname = art['fulljournalname']
108
- pubdate = art['pubdate'] # e.g. '2026 Apr 18'
109
- epubdate = art['epubdate'] # e-pub ahead of print date (may be empty)
110
- authors = art['authors'] # list of {'name': 'Last I', 'authtype': ...}
111
- volume = art['volume']
112
- issue = art['issue']
113
- pages = art['pages']
114
- pubtype = art['pubtype'] # list: ['Journal Article', 'Review', ...]
115
- # Extract DOI from elocationid or articleids:
116
- doi_field = art['elocationid'] # e.g. 'doi: 10.12659/MSM.951157'
117
- article_ids = {x['idtype']: x['value'] for x in art['articleids']}
118
- doi = article_ids.get('doi')
119
- pmc_id = article_ids.get('pmc') # PMC ID if open access
120
- print(uid, pubdate, source)
121
- print(" ", title[:70])
122
- print(" doi:", doi, "| pmc:", pmc_id)
123
- # Confirmed output (2026-04-18):
124
- # 41999029 2026 Apr 18 Med Sci Monit
125
- # Use of Deep Learning Models in the Diagnosis of Proptosis Through Orbi
126
- # doi: 10.12659/MSM.951157 | pmc: None
127
- ```
128
-
129
- ### Full article metadata — EFetch XML
130
-
131
- Use this for full abstracts, complete author names, MeSH terms, structured abstract sections.
132
-
133
- ```python
134
- import json, xml.etree.ElementTree as ET
135
- from helpers import http_get
136
-
137
- raw = http_get(
138
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
139
- "?db=pubmed&id=41999029,36328784&retmode=xml&rettype=abstract"
140
- )
141
- root = ET.fromstring(raw)
142
-
143
- for art in root.findall('.//PubmedArticle'):
144
- mc = art.find('MedlineCitation')
145
- pmid = mc.find('PMID').text
146
- article = mc.find('Article')
147
-
148
- # Title — use itertext() to handle embedded tags like <i>, <sub>
149
- title = ''.join(article.find('ArticleTitle').itertext()).strip()
150
-
151
- # Abstract — plain or structured (BACKGROUND / METHODS / RESULTS / CONCLUSION)
152
- abstract_el = article.find('Abstract')
153
- if abstract_el is not None:
154
- sections = []
155
- for t in abstract_el.findall('AbstractText'):
156
- label = t.get('Label', '') # e.g. 'BACKGROUND', 'METHODS'
157
- text = ''.join(t.itertext()).strip()
158
- sections.append(f"[{label}] {text}" if label else text)
159
- abstract = ' '.join(sections)
160
- else:
161
- abstract = '' # ~15% of articles have no abstract
162
-
163
- # Journal + year
164
- journal = article.find('Journal')
165
- j_title = journal.find('Title').text if journal is not None else ''
166
- pub_date = journal.find('.//PubDate') if journal is not None else None
167
- if pub_date is not None:
168
- year_el = pub_date.find('Year')
169
- medline_el = pub_date.find('MedlineDate') # fallback for old/seasonal dates
170
- season_el = pub_date.find('Season') # e.g. 'Jul-Aug', 'Oct-Dec'
171
- year = (year_el.text if year_el is not None
172
- else medline_el.text[:4] if medline_el is not None else '')
173
-
174
- # DOI
175
- doi_el = next(
176
- (e for e in article.findall('ELocationID') if e.get('EIdType') == 'doi'),
177
- None
178
- )
179
- doi = doi_el.text if doi_el is not None else ''
180
-
181
- # Authors — handle CollectiveName (consortium/group authors)
182
- author_list = article.find('AuthorList')
183
- authors = []
184
- if author_list is not None:
185
- for a in author_list.findall('Author'):
186
- collective = a.find('CollectiveName')
187
- last = a.find('LastName')
188
- fore = a.find('ForeName')
189
- initials = a.find('Initials')
190
- if collective is not None:
191
- authors.append(collective.text)
192
- elif last is not None:
193
- full = last.text
194
- if fore is not None:
195
- full += f", {fore.text}"
196
- authors.append(full)
197
-
198
- # MeSH controlled vocabulary terms
199
- mesh_list = mc.find('MeshHeadingList')
200
- mesh_terms = []
201
- if mesh_list is not None:
202
- mesh_terms = [
203
- mh.find('DescriptorName').text
204
- for mh in mesh_list.findall('MeshHeading')
205
- if mh.find('DescriptorName') is not None
206
- ]
207
-
208
- print(f"PMID={pmid} ({year}) {j_title}")
209
- print(f" Title: {title[:70]}")
210
- print(f" Authors: {authors[:3]}")
211
- print(f" DOI: {doi}")
212
- print(f" MeSH: {mesh_terms[:4]}")
213
- print(f" Abstract: {abstract[:120]}")
214
- # Confirmed output (2026-04-18):
215
- # PMID=41999029 (2026) Medical science monitor : international medical...
216
- # Title: Use of Deep Learning Models in the Diagnosis of Proptosis Thro
217
- # Authors: ['Kesimal, Uğur', 'Akkaya, Habip Eser', 'Polat, Önder']
218
- # DOI: 10.12659/MSM.951157
219
- # MeSH: ['Humans', 'Deep Learning', 'Exophthalmos', 'Magnetic Resonance Imaging']
220
- # Abstract: BACKGROUND Proptosis is a common manifestation of orbital disease...
221
- # PMID=36328784 (...)
222
- # Abstract: [OBJECTIVES] Physical inactivity and sedentary behaviour... ← structured
223
- ```
224
-
225
- ### Large result sets — usehistory + WebEnv
226
-
227
- When `count` exceeds `retmax` (max 10 000), use server-side history to paginate EFetch without re-running ESearch on every page.
228
-
229
- ```python
230
- import json, xml.etree.ElementTree as ET
231
- from helpers import http_get
232
-
233
- # Step 1: ESearch with usehistory=y — NCBI holds result set on server
234
- search = json.loads(http_get(
235
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
236
- "?db=pubmed&term=CRISPR+gene+editing&retmax=0&retmode=json&usehistory=y"
237
- ))
238
- webenv = search['esearchresult']['webenv'] # server-side session token
239
- query_key = search['esearchresult']['querykey'] # result set ID within session
240
- total = int(search['esearchresult']['count'])
241
- print(f"Total: {total}, WebEnv: {webenv[:30]}..., query_key: {query_key}")
242
- # Confirmed output (2026-04-18):
243
- # Total: 24160, WebEnv: MCID_69e4203757db89391008d6f1..., query_key: 1
244
-
245
- # Step 2: EFetch pages using WebEnv (no re-searching)
246
- batch_size = 200
247
- for start in range(0, min(total, 1000), batch_size): # cap at 1000 for demo
248
- raw = http_get(
249
- f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
250
- f"?db=pubmed&query_key={query_key}&WebEnv={webenv}"
251
- f"&retstart={start}&retmax={batch_size}&retmode=xml&rettype=abstract"
252
- )
253
- root = ET.fromstring(raw)
254
- articles = root.findall('.//PubmedArticle')
255
- print(f" Fetched {len(articles)} articles (start={start})")
256
- # process articles here...
257
- ```
258
-
259
- ### EInfo — list available NCBI databases
260
-
261
- ```python
262
- import json
263
- from helpers import http_get
264
-
265
- data = json.loads(http_get(
266
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?retmode=json"
267
- ))
268
- dbs = data['einforesult']['dblist']
269
- print(f"Total databases: {len(dbs)}") # Confirmed: 39 (2026-04-18)
270
- print(dbs[:10])
271
- # ['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure',
272
- # 'genome', 'annotinfo', 'assembly', 'bioproject']
273
- ```
274
-
275
- Get PubMed-specific metadata (field list, link list):
276
-
277
- ```python
278
- import json
279
- from helpers import http_get
280
-
281
- data = json.loads(http_get(
282
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=pubmed&retmode=json"
283
- ))
284
- db_info = data['einforesult']['dbinfo'][0]
285
- print("DB name:", db_info['dbname'])
286
- print("Record count:", db_info['count']) # total PubMed records
287
- link_names = [l['name'] for l in db_info.get('linklist', [])]
288
- print(f"Link types ({len(link_names)}):", link_names[:5])
289
- # Confirmed (2026-04-18):
290
- # DB name: pubmed
291
- # Record count: 37620453
292
- # Link types (48): ['pubmed_assembly', 'pubmed_bioproject', ...]
293
- ```
294
-
295
- ### ELink — cross-database linking
296
-
297
- ELink connects a PubMed record to associated data in other NCBI databases. The `pubmed_pubmed` "related articles" linkname relies on a similarity server that is intermittently unavailable (returns `"Couldn't resolve #exLinkSrv2, the address table is empty."`). Use the non-similarity links below instead.
298
-
299
- ```python
300
- import json
301
- from helpers import http_get
302
-
303
- # Link a PMID to its free full-text in PMC (if open access)
304
- # linkname=pubmed_pmc — may also hit the server outage; check error field
305
- data = json.loads(http_get(
306
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi"
307
- "?dbfrom=pubmed&id=38325330&linkname=pubmed_pmc&retmode=json"
308
- ))
309
- error = data.get('ERROR', '')
310
- if error:
311
- print("ELink error:", error) # 'Couldn't resolve #exLinkSrv2...' — NCBI server issue
312
- else:
313
- for ls in data.get('linksets', []):
314
- for lsdb in ls.get('linksetdbs', []):
315
- print(lsdb['linkname'], "→", lsdb['links'][:5])
316
- ```
317
-
318
- Available ELink linknames from pubmed (48 total):
319
-
320
- | linkname | Target |
321
- |---|---|
322
- | `pubmed_pmc` | Free full text in PMC |
323
- | `pubmed_pubmed_citedin` | Articles citing this paper |
324
- | `pubmed_pubmed_refs` | References cited by this paper |
325
- | `pubmed_gene` | Related Gene records |
326
- | `pubmed_clinvar` | Clinical variants associated with publication |
327
- | `pubmed_gds` | Related GEO datasets |
328
-
329
- **Practical alternative**: If ELink is down, extract DOI from EFetch/ESummary and use `https://doi.org/{doi}` directly for the full-text link.
330
-
331
- ## URL and parameter reference
332
-
333
- ### E-utilities base URLs
334
-
335
- ```
336
- https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi # search → PMIDs
337
- https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi # PMIDs → JSON summary
338
- https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi # PMIDs → full XML
339
- https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi # cross-db links
340
- https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi # DB metadata
341
- ```
342
-
343
- ### ESearch parameters
344
-
345
- | Parameter | Values | Notes |
346
- |---|---|---|
347
- | `db` | `pubmed` | Always `pubmed` for PubMed |
348
- | `term` | query string | Supports field tags like `[Author]`, `[Title]`, `[MeSH Terms]` |
349
- | `retmax` | integer, max 10000 | Results returned per call |
350
- | `retmode` | `json` | JSON output |
351
- | `sort` | `pub+date`, `Author`, `JournalName` | Default is relevance |
352
- | `datetype` | `pdat` (pub), `edat` (entrez), `mdat` (modified) | |
353
- | `mindate`, `maxdate` | `YYYY/MM/DD` or `YYYY` | Requires `datetype` |
354
- | `usehistory` | `y` | Store results on server; returns `webenv` + `querykey` |
355
-
356
- ### EFetch parameters
357
-
358
- | Parameter | Values | Notes |
359
- |---|---|---|
360
- | `db` | `pubmed` | |
361
- | `id` | `38000000,37999999` | Comma-separated PMIDs; max ~200 per call |
362
- | `query_key` + `WebEnv` | from ESearch `usehistory=y` | Alternative to `id` for large sets |
363
- | `retstart` | integer | Offset for pagination with WebEnv |
364
- | `retmax` | integer, max 10000 | Batch size |
365
- | `retmode` | `xml` | Use XML for EFetch (JSON not available for full records) |
366
- | `rettype` | `abstract` | Returns abstract + core metadata |
367
-
368
- ### PubMed article URL construction
369
-
370
- ```python
371
- pmid = "41999029"
372
- pubmed_url = f"https://pubmed.ncbi.nlm.nih.gov/{pmid}/"
373
- doi = "10.12659/MSM.951157"
374
- doi_url = f"https://doi.org/{doi}" # resolves to publisher page
375
- pmc_id = "PMC9876543" # from ESummary articleids
376
- pmc_url = f"https://www.ncbi.nlm.nih.gov/pmc/articles/{pmc_id}/"
377
- ```
378
-
379
- ## Gotchas
380
-
381
- - **`count` is a string, not int.** `search['esearchresult']['count']` returns `'24160'`, not `24160`. Always cast with `int()` before arithmetic.
382
-
383
- - **EFetch retmode must be `xml` for full records.** Unlike ESearch and ESummary, EFetch with `retmode=json` returns flat text (the MEDLINE citation text format), not structured JSON. Parse EFetch responses with `xml.etree.ElementTree`.
384
-
385
- - **`ArticleTitle` may contain embedded XML tags.** Titles with italics (`<i>Staphylococcus aureus</i>`) or math (`<sub>2</sub>`) are mixed-content nodes. Always use `''.join(el.itertext())` instead of `el.text`, which silently drops everything after the first child tag.
386
-
387
- - **~15% of articles have no abstract.** `article.find('Abstract')` returns `None` for short communications, editorials, letters, and older records. Always guard with `if abstract_el is not None`.
388
-
389
- - **Author names vary in structure — always handle `CollectiveName`.** Consortium papers list a group name (`'GeKeR Study Group'`, `'Breast Cancer Association Consortium'`) under `<CollectiveName>` instead of `<LastName>/<ForeName>`. Individual authors have `<LastName>` + optionally `<ForeName>` and `<Initials>`. Check `CollectiveName` first; falling through to `LastName` without the check produces `None` errors.
390
- - Confirmed real examples (2026-04-18): PMID 37586835 (`GeKeR Study Group`), PMID 36328784 (`Breast Cancer Association Consortium`)
391
-
392
- - **PubDate has three possible structures.** Most articles have `<Year>` + optional `<Month>` + optional `<Day>`. Seasonal journals use `<Season>` (e.g. `Jul-Aug`, `Oct-Dec`) instead of `<Month>`. A minority of older records use `<MedlineDate>` (e.g. `1995 Fall`) with no `<Year>`. Safe extraction pattern:
393
- ```python
394
- pub_date = journal.find('.//PubDate')
395
- year_el = pub_date.find('Year') if pub_date is not None else None
396
- medline_el = pub_date.find('MedlineDate') if pub_date is not None else None
397
- year = (year_el.text if year_el is not None
398
- else medline_el.text[:4] if medline_el is not None else '')
399
- ```
400
-
401
- - **Batch EFetch: keep IDs to ~200 per call.** The API accepts comma-separated IDs in `id=`, but very large batches (500+) occasionally time out or return truncated XML. For >200 articles, iterate in chunks or use `usehistory` + `WebEnv`.
402
-
403
- - **ELink `pubmed_pubmed` (related articles) is intermittently broken.** The NCBI similarity server returns `"Couldn't resolve #exLinkSrv2, the address table is empty."` — this is a persistent server-side issue as of 2026-04-18, not a rate-limit error. Other linknames (`pubmed_gene`, `pubmed_pmc`, `pubmed_clinvar`) fail with the same error. Use the DOI as a fallback link to publisher full text.
404
-
405
- - **Rate limits: 3 req/s without API key, 10 req/s with free key.** Exceeding 3 req/s returns HTTP 429. Insert `time.sleep(0.34)` between sequential calls without a key. Get a free API key at https://www.ncbi.nlm.nih.gov/account/ and append `&api_key=YOUR_KEY` to all URLs.
406
-
407
- - **`retmax` upper bound is 10 000 for ESearch.** To retrieve more than 10 000 PMIDs for a search, use `usehistory=y` and page through EFetch with `retstart` offsets. EFetch itself also accepts `retmax` up to 10 000 per call.
408
-
409
- - **`retmax=0` in ESearch returns only the count, not IDs — useful for counting.** Combine with `usehistory=y` to store the result for later paging without fetching IDs upfront:
410
- ```python
411
- search = json.loads(http_get(
412
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
413
- "?db=pubmed&term=cancer&retmax=0&retmode=json&usehistory=y"
414
- ))
415
- total = int(search['esearchresult']['count']) # e.g. 4800000
416
- webenv = search['esearchresult']['webenv']
417
- ```
418
-
419
- - **ESummary `authors` field uses abbreviated names (`Last I`), not full names.** Use EFetch XML to get `ForeName` (e.g. `'Kesimal, Uğur'` vs ESummary `'Kesimal U'`). For bulk tasks where full names are not needed, ESummary is faster.
420
-
421
- - **`querytranslation` shows how NCBI interpreted your term.** The ESearch response includes `esearchresult.querytranslation` — a MeSH-expanded version of your query. Inspect it to verify the search matched what you intended.
1
+ # PubMed / NCBI — Scraping & Data Extraction
2
+
3
+ `https://pubmed.ncbi.nlm.nih.gov` — 37 M+ biomedical citations. **Never use the browser for PubMed.** All data is reachable via `http_get` using the NCBI E-utilities REST API. No API key required; a free key raises the rate limit from 3 to 10 req/s.
4
+
5
+ ## Do this first
6
+
7
+ **ESearch → ESummary is the fastest pipeline for most tasks — two calls, JSON responses, no XML parsing.**
8
+
9
+ ```python
10
+ import json
11
+ from helpers import http_get
12
+
13
+ # Step 1: search → get PMIDs
14
+ search = json.loads(http_get(
15
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
16
+ "?db=pubmed&term=deep+learning+radiology&retmax=10&retmode=json"
17
+ ))
18
+ pmids = search['esearchresult']['idlist'] # e.g. ['41999029', '41998456', ...]
19
+ count = search['esearchresult']['count'] # total hits across all pages
20
+
21
+ # Step 2: fetch lightweight metadata for all PMIDs in one call
22
+ summary = json.loads(http_get(
23
+ f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
24
+ f"?db=pubmed&id={','.join(pmids)}&retmode=json"
25
+ ))
26
+ result = summary['result']
27
+ for uid in result['uids']:
28
+ art = result[uid]
29
+ print(uid, art['pubdate'], art['source'])
30
+ print(" ", art['title'][:80])
31
+ print(" authors:", [a['name'] for a in art['authors'][:3]])
32
+ # Confirmed output (2026-04-18):
33
+ # 41999029 2026 Apr 18 Med Sci Monit
34
+ # Use of Deep Learning Models in the Diagnosis of Proptosis Through Orbi
35
+ # authors: ['Kesimal U', 'Akkaya HE', 'Polat Ö']
36
+ # 41998456 2026 Apr 17 Sci Rep
37
+ # ...
38
+ ```
39
+
40
+ Use **EFetch XML** when you need: full abstract text, MeSH terms, complete author names (not just "Last I"), structured abstract labels, or the DOI from within the article record.
41
+
42
+ ## Common workflows
43
+
44
+ ### Search PubMed (ESearch)
45
+
46
+ ```python
47
+ import json
48
+ from helpers import http_get
49
+
50
+ data = json.loads(http_get(
51
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
52
+ "?db=pubmed"
53
+ "&term=large+language+models+clinical"
54
+ "&retmax=5"
55
+ "&retmode=json"
56
+ "&sort=pub+date" # newest first; default is relevance
57
+ "&datetype=pdat" # filter by publication date
58
+ "&mindate=2024/01/01&maxdate=2024/12/31" # YYYY/MM/DD format
59
+ ))
60
+ result = data['esearchresult']
61
+ print("Total hits:", result['count']) # '24160' — note: string, not int
62
+ print("PMIDs:", result['idlist'])
63
+ print("Query translation:", result['querytranslation'])
64
+ # Confirmed output (2026-04-18):
65
+ # Total hits: 24160
66
+ # PMIDs: ['41996895', '41996722', '41996006', '41995888', '41995759']
67
+ # Query translation: "large language models"[MeSH Terms] OR ...
68
+ ```
69
+
70
+ #### ESearch field tags (append to term)
71
+
72
+ ```
73
+ machine learning[MeSH Terms] MeSH controlled vocabulary
74
+ Hinton GE[Author] author last + initials
75
+ attention is all you need[Title] title words
76
+ Nature[Journal] journal name
77
+ 2024[pdat] publication year
78
+ ```
79
+
80
+ Boolean operators: `AND`, `OR`, `NOT`. Phrase search: `"exact phrase"[Title]`.
81
+
82
+ #### Sort options (`sort=`)
83
+
84
+ | Value | Effect |
85
+ |---|---|
86
+ | *(omit)* | Relevance (default) |
87
+ | `pub+date` | Most recent publication first |
88
+ | `Author` | First author alphabetical |
89
+ | `JournalName` | Journal alphabetical |
90
+
91
+ ### Lightweight metadata — ESummary (JSON, no XML)
92
+
93
+ ```python
94
+ import json
95
+ from helpers import http_get
96
+
97
+ data = json.loads(http_get(
98
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
99
+ "?db=pubmed&id=41999029,41998456,41997837&retmode=json"
100
+ ))
101
+ result = data['result']
102
+ for uid in result['uids']:
103
+ art = result[uid]
104
+ # Key fields available:
105
+ title = art['title'] # full title string
106
+ source = art['source'] # abbreviated journal name
107
+ fulljournalname = art['fulljournalname']
108
+ pubdate = art['pubdate'] # e.g. '2026 Apr 18'
109
+ epubdate = art['epubdate'] # e-pub ahead of print date (may be empty)
110
+ authors = art['authors'] # list of {'name': 'Last I', 'authtype': ...}
111
+ volume = art['volume']
112
+ issue = art['issue']
113
+ pages = art['pages']
114
+ pubtype = art['pubtype'] # list: ['Journal Article', 'Review', ...]
115
+ # Extract DOI from elocationid or articleids:
116
+ doi_field = art['elocationid'] # e.g. 'doi: 10.12659/MSM.951157'
117
+ article_ids = {x['idtype']: x['value'] for x in art['articleids']}
118
+ doi = article_ids.get('doi')
119
+ pmc_id = article_ids.get('pmc') # PMC ID if open access
120
+ print(uid, pubdate, source)
121
+ print(" ", title[:70])
122
+ print(" doi:", doi, "| pmc:", pmc_id)
123
+ # Confirmed output (2026-04-18):
124
+ # 41999029 2026 Apr 18 Med Sci Monit
125
+ # Use of Deep Learning Models in the Diagnosis of Proptosis Through Orbi
126
+ # doi: 10.12659/MSM.951157 | pmc: None
127
+ ```
128
+
129
+ ### Full article metadata — EFetch XML
130
+
131
+ Use this for full abstracts, complete author names, MeSH terms, structured abstract sections.
132
+
133
+ ```python
134
+ import json, xml.etree.ElementTree as ET
135
+ from helpers import http_get
136
+
137
+ raw = http_get(
138
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
139
+ "?db=pubmed&id=41999029,36328784&retmode=xml&rettype=abstract"
140
+ )
141
+ root = ET.fromstring(raw)
142
+
143
+ for art in root.findall('.//PubmedArticle'):
144
+ mc = art.find('MedlineCitation')
145
+ pmid = mc.find('PMID').text
146
+ article = mc.find('Article')
147
+
148
+ # Title — use itertext() to handle embedded tags like <i>, <sub>
149
+ title = ''.join(article.find('ArticleTitle').itertext()).strip()
150
+
151
+ # Abstract — plain or structured (BACKGROUND / METHODS / RESULTS / CONCLUSION)
152
+ abstract_el = article.find('Abstract')
153
+ if abstract_el is not None:
154
+ sections = []
155
+ for t in abstract_el.findall('AbstractText'):
156
+ label = t.get('Label', '') # e.g. 'BACKGROUND', 'METHODS'
157
+ text = ''.join(t.itertext()).strip()
158
+ sections.append(f"[{label}] {text}" if label else text)
159
+ abstract = ' '.join(sections)
160
+ else:
161
+ abstract = '' # ~15% of articles have no abstract
162
+
163
+ # Journal + year
164
+ journal = article.find('Journal')
165
+ j_title = journal.find('Title').text if journal is not None else ''
166
+ pub_date = journal.find('.//PubDate') if journal is not None else None
167
+ if pub_date is not None:
168
+ year_el = pub_date.find('Year')
169
+ medline_el = pub_date.find('MedlineDate') # fallback for old/seasonal dates
170
+ season_el = pub_date.find('Season') # e.g. 'Jul-Aug', 'Oct-Dec'
171
+ year = (year_el.text if year_el is not None
172
+ else medline_el.text[:4] if medline_el is not None else '')
173
+
174
+ # DOI
175
+ doi_el = next(
176
+ (e for e in article.findall('ELocationID') if e.get('EIdType') == 'doi'),
177
+ None
178
+ )
179
+ doi = doi_el.text if doi_el is not None else ''
180
+
181
+ # Authors — handle CollectiveName (consortium/group authors)
182
+ author_list = article.find('AuthorList')
183
+ authors = []
184
+ if author_list is not None:
185
+ for a in author_list.findall('Author'):
186
+ collective = a.find('CollectiveName')
187
+ last = a.find('LastName')
188
+ fore = a.find('ForeName')
189
+ initials = a.find('Initials')
190
+ if collective is not None:
191
+ authors.append(collective.text)
192
+ elif last is not None:
193
+ full = last.text
194
+ if fore is not None:
195
+ full += f", {fore.text}"
196
+ authors.append(full)
197
+
198
+ # MeSH controlled vocabulary terms
199
+ mesh_list = mc.find('MeshHeadingList')
200
+ mesh_terms = []
201
+ if mesh_list is not None:
202
+ mesh_terms = [
203
+ mh.find('DescriptorName').text
204
+ for mh in mesh_list.findall('MeshHeading')
205
+ if mh.find('DescriptorName') is not None
206
+ ]
207
+
208
+ print(f"PMID={pmid} ({year}) {j_title}")
209
+ print(f" Title: {title[:70]}")
210
+ print(f" Authors: {authors[:3]}")
211
+ print(f" DOI: {doi}")
212
+ print(f" MeSH: {mesh_terms[:4]}")
213
+ print(f" Abstract: {abstract[:120]}")
214
+ # Confirmed output (2026-04-18):
215
+ # PMID=41999029 (2026) Medical science monitor : international medical...
216
+ # Title: Use of Deep Learning Models in the Diagnosis of Proptosis Thro
217
+ # Authors: ['Kesimal, Uğur', 'Akkaya, Habip Eser', 'Polat, Önder']
218
+ # DOI: 10.12659/MSM.951157
219
+ # MeSH: ['Humans', 'Deep Learning', 'Exophthalmos', 'Magnetic Resonance Imaging']
220
+ # Abstract: BACKGROUND Proptosis is a common manifestation of orbital disease...
221
+ # PMID=36328784 (...)
222
+ # Abstract: [OBJECTIVES] Physical inactivity and sedentary behaviour... ← structured
223
+ ```
224
+
225
+ ### Large result sets — usehistory + WebEnv
226
+
227
+ When `count` exceeds `retmax` (max 10 000), use server-side history to paginate EFetch without re-running ESearch on every page.
228
+
229
+ ```python
230
+ import json, xml.etree.ElementTree as ET
231
+ from helpers import http_get
232
+
233
+ # Step 1: ESearch with usehistory=y — NCBI holds result set on server
234
+ search = json.loads(http_get(
235
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
236
+ "?db=pubmed&term=CRISPR+gene+editing&retmax=0&retmode=json&usehistory=y"
237
+ ))
238
+ webenv = search['esearchresult']['webenv'] # server-side session token
239
+ query_key = search['esearchresult']['querykey'] # result set ID within session
240
+ total = int(search['esearchresult']['count'])
241
+ print(f"Total: {total}, WebEnv: {webenv[:30]}..., query_key: {query_key}")
242
+ # Confirmed output (2026-04-18):
243
+ # Total: 24160, WebEnv: MCID_69e4203757db89391008d6f1..., query_key: 1
244
+
245
+ # Step 2: EFetch pages using WebEnv (no re-searching)
246
+ batch_size = 200
247
+ for start in range(0, min(total, 1000), batch_size): # cap at 1000 for demo
248
+ raw = http_get(
249
+ f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
250
+ f"?db=pubmed&query_key={query_key}&WebEnv={webenv}"
251
+ f"&retstart={start}&retmax={batch_size}&retmode=xml&rettype=abstract"
252
+ )
253
+ root = ET.fromstring(raw)
254
+ articles = root.findall('.//PubmedArticle')
255
+ print(f" Fetched {len(articles)} articles (start={start})")
256
+ # process articles here...
257
+ ```
258
+
259
+ ### EInfo — list available NCBI databases
260
+
261
+ ```python
262
+ import json
263
+ from helpers import http_get
264
+
265
+ data = json.loads(http_get(
266
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?retmode=json"
267
+ ))
268
+ dbs = data['einforesult']['dblist']
269
+ print(f"Total databases: {len(dbs)}") # Confirmed: 39 (2026-04-18)
270
+ print(dbs[:10])
271
+ # ['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure',
272
+ # 'genome', 'annotinfo', 'assembly', 'bioproject']
273
+ ```
274
+
275
+ Get PubMed-specific metadata (field list, link list):
276
+
277
+ ```python
278
+ import json
279
+ from helpers import http_get
280
+
281
+ data = json.loads(http_get(
282
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=pubmed&retmode=json"
283
+ ))
284
+ db_info = data['einforesult']['dbinfo'][0]
285
+ print("DB name:", db_info['dbname'])
286
+ print("Record count:", db_info['count']) # total PubMed records
287
+ link_names = [l['name'] for l in db_info.get('linklist', [])]
288
+ print(f"Link types ({len(link_names)}):", link_names[:5])
289
+ # Confirmed (2026-04-18):
290
+ # DB name: pubmed
291
+ # Record count: 37620453
292
+ # Link types (48): ['pubmed_assembly', 'pubmed_bioproject', ...]
293
+ ```
294
+
295
+ ### ELink — cross-database linking
296
+
297
+ ELink connects a PubMed record to associated data in other NCBI databases. The `pubmed_pubmed` "related articles" linkname relies on a similarity server that is intermittently unavailable (returns `"Couldn't resolve #exLinkSrv2, the address table is empty."`). Use the non-similarity links below instead.
298
+
299
+ ```python
300
+ import json
301
+ from helpers import http_get
302
+
303
+ # Link a PMID to its free full-text in PMC (if open access)
304
+ # linkname=pubmed_pmc — may also hit the server outage; check error field
305
+ data = json.loads(http_get(
306
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi"
307
+ "?dbfrom=pubmed&id=38325330&linkname=pubmed_pmc&retmode=json"
308
+ ))
309
+ error = data.get('ERROR', '')
310
+ if error:
311
+ print("ELink error:", error) # 'Couldn't resolve #exLinkSrv2...' — NCBI server issue
312
+ else:
313
+ for ls in data.get('linksets', []):
314
+ for lsdb in ls.get('linksetdbs', []):
315
+ print(lsdb['linkname'], "→", lsdb['links'][:5])
316
+ ```
317
+
318
+ Available ELink linknames from pubmed (48 total):
319
+
320
+ | linkname | Target |
321
+ |---|---|
322
+ | `pubmed_pmc` | Free full text in PMC |
323
+ | `pubmed_pubmed_citedin` | Articles citing this paper |
324
+ | `pubmed_pubmed_refs` | References cited by this paper |
325
+ | `pubmed_gene` | Related Gene records |
326
+ | `pubmed_clinvar` | Clinical variants associated with publication |
327
+ | `pubmed_gds` | Related GEO datasets |
328
+
329
+ **Practical alternative**: If ELink is down, extract DOI from EFetch/ESummary and use `https://doi.org/{doi}` directly for the full-text link.
330
+
331
+ ## URL and parameter reference
332
+
333
+ ### E-utilities base URLs
334
+
335
+ ```
336
+ https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi # search → PMIDs
337
+ https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi # PMIDs → JSON summary
338
+ https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi # PMIDs → full XML
339
+ https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi # cross-db links
340
+ https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi # DB metadata
341
+ ```
342
+
343
+ ### ESearch parameters
344
+
345
+ | Parameter | Values | Notes |
346
+ |---|---|---|
347
+ | `db` | `pubmed` | Always `pubmed` for PubMed |
348
+ | `term` | query string | Supports field tags like `[Author]`, `[Title]`, `[MeSH Terms]` |
349
+ | `retmax` | integer, max 10000 | Results returned per call |
350
+ | `retmode` | `json` | JSON output |
351
+ | `sort` | `pub+date`, `Author`, `JournalName` | Default is relevance |
352
+ | `datetype` | `pdat` (pub), `edat` (entrez), `mdat` (modified) | |
353
+ | `mindate`, `maxdate` | `YYYY/MM/DD` or `YYYY` | Requires `datetype` |
354
+ | `usehistory` | `y` | Store results on server; returns `webenv` + `querykey` |
355
+
356
+ ### EFetch parameters
357
+
358
+ | Parameter | Values | Notes |
359
+ |---|---|---|
360
+ | `db` | `pubmed` | |
361
+ | `id` | `38000000,37999999` | Comma-separated PMIDs; max ~200 per call |
362
+ | `query_key` + `WebEnv` | from ESearch `usehistory=y` | Alternative to `id` for large sets |
363
+ | `retstart` | integer | Offset for pagination with WebEnv |
364
+ | `retmax` | integer, max 10000 | Batch size |
365
+ | `retmode` | `xml` | Use XML for EFetch (JSON not available for full records) |
366
+ | `rettype` | `abstract` | Returns abstract + core metadata |
367
+
368
+ ### PubMed article URL construction
369
+
370
+ ```python
371
+ pmid = "41999029"
372
+ pubmed_url = f"https://pubmed.ncbi.nlm.nih.gov/{pmid}/"
373
+ doi = "10.12659/MSM.951157"
374
+ doi_url = f"https://doi.org/{doi}" # resolves to publisher page
375
+ pmc_id = "PMC9876543" # from ESummary articleids
376
+ pmc_url = f"https://www.ncbi.nlm.nih.gov/pmc/articles/{pmc_id}/"
377
+ ```
378
+
379
+ ## Gotchas
380
+
381
+ - **`count` is a string, not int.** `search['esearchresult']['count']` returns `'24160'`, not `24160`. Always cast with `int()` before arithmetic.
382
+
383
+ - **EFetch retmode must be `xml` for full records.** Unlike ESearch and ESummary, EFetch with `retmode=json` returns flat text (the MEDLINE citation text format), not structured JSON. Parse EFetch responses with `xml.etree.ElementTree`.
384
+
385
+ - **`ArticleTitle` may contain embedded XML tags.** Titles with italics (`<i>Staphylococcus aureus</i>`) or math (`<sub>2</sub>`) are mixed-content nodes. Always use `''.join(el.itertext())` instead of `el.text`, which silently drops everything after the first child tag.
386
+
387
+ - **~15% of articles have no abstract.** `article.find('Abstract')` returns `None` for short communications, editorials, letters, and older records. Always guard with `if abstract_el is not None`.
388
+
389
+ - **Author names vary in structure — always handle `CollectiveName`.** Consortium papers list a group name (`'GeKeR Study Group'`, `'Breast Cancer Association Consortium'`) under `<CollectiveName>` instead of `<LastName>/<ForeName>`. Individual authors have `<LastName>` + optionally `<ForeName>` and `<Initials>`. Check `CollectiveName` first; falling through to `LastName` without the check produces `None` errors.
390
+ - Confirmed real examples (2026-04-18): PMID 37586835 (`GeKeR Study Group`), PMID 36328784 (`Breast Cancer Association Consortium`)
391
+
392
+ - **PubDate has three possible structures.** Most articles have `<Year>` + optional `<Month>` + optional `<Day>`. Seasonal journals use `<Season>` (e.g. `Jul-Aug`, `Oct-Dec`) instead of `<Month>`. A minority of older records use `<MedlineDate>` (e.g. `1995 Fall`) with no `<Year>`. Safe extraction pattern:
393
+ ```python
394
+ pub_date = journal.find('.//PubDate')
395
+ year_el = pub_date.find('Year') if pub_date is not None else None
396
+ medline_el = pub_date.find('MedlineDate') if pub_date is not None else None
397
+ year = (year_el.text if year_el is not None
398
+ else medline_el.text[:4] if medline_el is not None else '')
399
+ ```
400
+
401
+ - **Batch EFetch: keep IDs to ~200 per call.** The API accepts comma-separated IDs in `id=`, but very large batches (500+) occasionally time out or return truncated XML. For >200 articles, iterate in chunks or use `usehistory` + `WebEnv`.
402
+
403
+ - **ELink `pubmed_pubmed` (related articles) is intermittently broken.** The NCBI similarity server returns `"Couldn't resolve #exLinkSrv2, the address table is empty."` — this is a persistent server-side issue as of 2026-04-18, not a rate-limit error. Other linknames (`pubmed_gene`, `pubmed_pmc`, `pubmed_clinvar`) fail with the same error. Use the DOI as a fallback link to publisher full text.
404
+
405
+ - **Rate limits: 3 req/s without API key, 10 req/s with free key.** Exceeding 3 req/s returns HTTP 429. Insert `time.sleep(0.34)` between sequential calls without a key. Get a free API key at https://www.ncbi.nlm.nih.gov/account/ and append `&api_key=YOUR_KEY` to all URLs.
406
+
407
+ - **`retmax` upper bound is 10 000 for ESearch.** To retrieve more than 10 000 PMIDs for a search, use `usehistory=y` and page through EFetch with `retstart` offsets. EFetch itself also accepts `retmax` up to 10 000 per call.
408
+
409
+ - **`retmax=0` in ESearch returns only the count, not IDs — useful for counting.** Combine with `usehistory=y` to store the result for later paging without fetching IDs upfront:
410
+ ```python
411
+ search = json.loads(http_get(
412
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
413
+ "?db=pubmed&term=cancer&retmax=0&retmode=json&usehistory=y"
414
+ ))
415
+ total = int(search['esearchresult']['count']) # e.g. 4800000
416
+ webenv = search['esearchresult']['webenv']
417
+ ```
418
+
419
+ - **ESummary `authors` field uses abbreviated names (`Last I`), not full names.** Use EFetch XML to get `ForeName` (e.g. `'Kesimal, Uğur'` vs ESummary `'Kesimal U'`). For bulk tasks where full names are not needed, ESummary is faster.
420
+
421
+ - **`querytranslation` shows how NCBI interpreted your term.** The ESearch response includes `esearchresult.querytranslation` — a MeSH-expanded version of your query. Inspect it to verify the search matched what you intended.