@pencil-agent/nano-pencil 2.0.0 → 2.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (195) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/mcp/mcp-client.d.ts +3 -1
  7. package/dist/core/mcp/mcp-client.js +6 -6
  8. package/dist/core/mcp/mcp-config.d.ts +3 -3
  9. package/dist/core/mcp/mcp-config.js +1 -1
  10. package/dist/core/mcp/mcp-manager.d.ts +5 -1
  11. package/dist/core/mcp/mcp-manager.js +1 -1
  12. package/dist/core/platform/config/resource-loader.d.ts +2 -0
  13. package/dist/core/platform/config/resource-loader.js +2 -2
  14. package/dist/core/runtime/agent-session.d.ts +12 -0
  15. package/dist/core/runtime/agent-session.js +8 -8
  16. package/dist/core/runtime/sdk.d.ts +8 -0
  17. package/dist/core/runtime/sdk.js +1 -1
  18. package/dist/extensions/builtin/AGENT.md +115 -115
  19. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  20. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  91. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  92. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  93. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  94. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  95. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  96. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  97. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  98. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  99. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  100. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  101. package/dist/extensions/builtin/browser/browser.md +73 -73
  102. package/dist/extensions/builtin/browser/install.md +142 -142
  103. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  104. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  105. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  107. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  108. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  109. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  110. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  111. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  112. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  113. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  114. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  115. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  116. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  117. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  118. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  119. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  120. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  121. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  122. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  123. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  124. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  125. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  126. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  127. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  128. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  129. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  130. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  131. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  132. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  133. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  134. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  135. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  136. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  137. package/dist/extensions/builtin/goal/README.md +67 -67
  138. package/dist/extensions/builtin/grub/README.md +112 -112
  139. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  140. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  141. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  142. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  143. package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
  144. package/dist/extensions/builtin/loop/README.md +92 -92
  145. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  146. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  147. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  148. package/dist/extensions/builtin/sal/README.md +72 -72
  149. package/dist/extensions/builtin/security-audit/README.md +289 -289
  150. package/dist/extensions/builtin/team/AGENT.md +112 -112
  151. package/dist/extensions/builtin/team/TESTING.md +299 -299
  152. package/dist/extensions/builtin/token-save/README.md +56 -56
  153. package/dist/extensions/optional/AGENT.md +10 -10
  154. package/dist/modes/interactive/interactive-mode.js +36 -36
  155. package/dist/modes/interactive/theme/dark.json +85 -85
  156. package/dist/modes/interactive/theme/light.json +84 -84
  157. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  158. package/dist/modes/interactive/theme/warm.json +81 -81
  159. package/dist/node_modules/@pencil-agent/agent-core/dist/agent-loop.js +3 -2
  160. package/dist/node_modules/@pencil-agent/agent-core/dist/structured-adaptive-agent-loop.js +2 -1
  161. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  162. package/docs/cc-agent-design.md +1297 -0
  163. package/docs/cc-tui-design.md +1333 -0
  164. package/docs/codex-goal-command-impl.md +1055 -1055
  165. package/docs/codex-goal-vs-grub.md +500 -500
  166. package/docs/custom-provider.md +27 -27
  167. package/docs/extensions.md +27 -27
  168. package/docs/keybindings.md +27 -27
  169. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  170. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  171. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  172. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  173. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  174. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  175. package/docs/loop-usage-examples.md +214 -214
  176. package/docs/models.md +27 -27
  177. package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +170 -0
  178. package/docs/packages.md +27 -27
  179. package/docs/pi-design-philosophy.md +457 -457
  180. package/docs/planmode.md +1987 -1987
  181. package/docs/prompt-templates.md +27 -27
  182. package/docs/providers.md +27 -27
  183. package/docs/scan-report.md +3820 -0
  184. package/docs/sdk.md +27 -27
  185. package/docs/skills.md +27 -27
  186. package/docs/themes.md +27 -27
  187. package/docs/tui.md +27 -27
  188. package/docs//345/257/271/346/240/207Claude-Code.md +1775 -0
  189. package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +261 -0
  190. package/package.json +190 -190
  191. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +0 -851
  192. package/docs/SDK-TESTING.md +0 -364
  193. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +0 -593
  194. package/docs/startup-performance-optimization.md +0 -301
  195. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +0 -47
@@ -1,421 +1,421 @@
1
- # PubMed / NCBI — Scraping & Data Extraction
2
-
3
- `https://pubmed.ncbi.nlm.nih.gov` — 37 M+ biomedical citations. **Never use the browser for PubMed.** All data is reachable via `http_get` using the NCBI E-utilities REST API. No API key required; a free key raises the rate limit from 3 to 10 req/s.
4
-
5
- ## Do this first
6
-
7
- **ESearch → ESummary is the fastest pipeline for most tasks — two calls, JSON responses, no XML parsing.**
8
-
9
- ```python
10
- import json
11
- from helpers import http_get
12
-
13
- # Step 1: search → get PMIDs
14
- search = json.loads(http_get(
15
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
16
- "?db=pubmed&term=deep+learning+radiology&retmax=10&retmode=json"
17
- ))
18
- pmids = search['esearchresult']['idlist'] # e.g. ['41999029', '41998456', ...]
19
- count = search['esearchresult']['count'] # total hits across all pages
20
-
21
- # Step 2: fetch lightweight metadata for all PMIDs in one call
22
- summary = json.loads(http_get(
23
- f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
24
- f"?db=pubmed&id={','.join(pmids)}&retmode=json"
25
- ))
26
- result = summary['result']
27
- for uid in result['uids']:
28
- art = result[uid]
29
- print(uid, art['pubdate'], art['source'])
30
- print(" ", art['title'][:80])
31
- print(" authors:", [a['name'] for a in art['authors'][:3]])
32
- # Confirmed output (2026-04-18):
33
- # 41999029 2026 Apr 18 Med Sci Monit
34
- # Use of Deep Learning Models in the Diagnosis of Proptosis Through Orbi
35
- # authors: ['Kesimal U', 'Akkaya HE', 'Polat Ö']
36
- # 41998456 2026 Apr 17 Sci Rep
37
- # ...
38
- ```
39
-
40
- Use **EFetch XML** when you need: full abstract text, MeSH terms, complete author names (not just "Last I"), structured abstract labels, or the DOI from within the article record.
41
-
42
- ## Common workflows
43
-
44
- ### Search PubMed (ESearch)
45
-
46
- ```python
47
- import json
48
- from helpers import http_get
49
-
50
- data = json.loads(http_get(
51
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
52
- "?db=pubmed"
53
- "&term=large+language+models+clinical"
54
- "&retmax=5"
55
- "&retmode=json"
56
- "&sort=pub+date" # newest first; default is relevance
57
- "&datetype=pdat" # filter by publication date
58
- "&mindate=2024/01/01&maxdate=2024/12/31" # YYYY/MM/DD format
59
- ))
60
- result = data['esearchresult']
61
- print("Total hits:", result['count']) # '24160' — note: string, not int
62
- print("PMIDs:", result['idlist'])
63
- print("Query translation:", result['querytranslation'])
64
- # Confirmed output (2026-04-18):
65
- # Total hits: 24160
66
- # PMIDs: ['41996895', '41996722', '41996006', '41995888', '41995759']
67
- # Query translation: "large language models"[MeSH Terms] OR ...
68
- ```
69
-
70
- #### ESearch field tags (append to term)
71
-
72
- ```
73
- machine learning[MeSH Terms] MeSH controlled vocabulary
74
- Hinton GE[Author] author last + initials
75
- attention is all you need[Title] title words
76
- Nature[Journal] journal name
77
- 2024[pdat] publication year
78
- ```
79
-
80
- Boolean operators: `AND`, `OR`, `NOT`. Phrase search: `"exact phrase"[Title]`.
81
-
82
- #### Sort options (`sort=`)
83
-
84
- | Value | Effect |
85
- |---|---|
86
- | *(omit)* | Relevance (default) |
87
- | `pub+date` | Most recent publication first |
88
- | `Author` | First author alphabetical |
89
- | `JournalName` | Journal alphabetical |
90
-
91
- ### Lightweight metadata — ESummary (JSON, no XML)
92
-
93
- ```python
94
- import json
95
- from helpers import http_get
96
-
97
- data = json.loads(http_get(
98
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
99
- "?db=pubmed&id=41999029,41998456,41997837&retmode=json"
100
- ))
101
- result = data['result']
102
- for uid in result['uids']:
103
- art = result[uid]
104
- # Key fields available:
105
- title = art['title'] # full title string
106
- source = art['source'] # abbreviated journal name
107
- fulljournalname = art['fulljournalname']
108
- pubdate = art['pubdate'] # e.g. '2026 Apr 18'
109
- epubdate = art['epubdate'] # e-pub ahead of print date (may be empty)
110
- authors = art['authors'] # list of {'name': 'Last I', 'authtype': ...}
111
- volume = art['volume']
112
- issue = art['issue']
113
- pages = art['pages']
114
- pubtype = art['pubtype'] # list: ['Journal Article', 'Review', ...]
115
- # Extract DOI from elocationid or articleids:
116
- doi_field = art['elocationid'] # e.g. 'doi: 10.12659/MSM.951157'
117
- article_ids = {x['idtype']: x['value'] for x in art['articleids']}
118
- doi = article_ids.get('doi')
119
- pmc_id = article_ids.get('pmc') # PMC ID if open access
120
- print(uid, pubdate, source)
121
- print(" ", title[:70])
122
- print(" doi:", doi, "| pmc:", pmc_id)
123
- # Confirmed output (2026-04-18):
124
- # 41999029 2026 Apr 18 Med Sci Monit
125
- # Use of Deep Learning Models in the Diagnosis of Proptosis Through Orbi
126
- # doi: 10.12659/MSM.951157 | pmc: None
127
- ```
128
-
129
- ### Full article metadata — EFetch XML
130
-
131
- Use this for full abstracts, complete author names, MeSH terms, structured abstract sections.
132
-
133
- ```python
134
- import json, xml.etree.ElementTree as ET
135
- from helpers import http_get
136
-
137
- raw = http_get(
138
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
139
- "?db=pubmed&id=41999029,36328784&retmode=xml&rettype=abstract"
140
- )
141
- root = ET.fromstring(raw)
142
-
143
- for art in root.findall('.//PubmedArticle'):
144
- mc = art.find('MedlineCitation')
145
- pmid = mc.find('PMID').text
146
- article = mc.find('Article')
147
-
148
- # Title — use itertext() to handle embedded tags like <i>, <sub>
149
- title = ''.join(article.find('ArticleTitle').itertext()).strip()
150
-
151
- # Abstract — plain or structured (BACKGROUND / METHODS / RESULTS / CONCLUSION)
152
- abstract_el = article.find('Abstract')
153
- if abstract_el is not None:
154
- sections = []
155
- for t in abstract_el.findall('AbstractText'):
156
- label = t.get('Label', '') # e.g. 'BACKGROUND', 'METHODS'
157
- text = ''.join(t.itertext()).strip()
158
- sections.append(f"[{label}] {text}" if label else text)
159
- abstract = ' '.join(sections)
160
- else:
161
- abstract = '' # ~15% of articles have no abstract
162
-
163
- # Journal + year
164
- journal = article.find('Journal')
165
- j_title = journal.find('Title').text if journal is not None else ''
166
- pub_date = journal.find('.//PubDate') if journal is not None else None
167
- if pub_date is not None:
168
- year_el = pub_date.find('Year')
169
- medline_el = pub_date.find('MedlineDate') # fallback for old/seasonal dates
170
- season_el = pub_date.find('Season') # e.g. 'Jul-Aug', 'Oct-Dec'
171
- year = (year_el.text if year_el is not None
172
- else medline_el.text[:4] if medline_el is not None else '')
173
-
174
- # DOI
175
- doi_el = next(
176
- (e for e in article.findall('ELocationID') if e.get('EIdType') == 'doi'),
177
- None
178
- )
179
- doi = doi_el.text if doi_el is not None else ''
180
-
181
- # Authors — handle CollectiveName (consortium/group authors)
182
- author_list = article.find('AuthorList')
183
- authors = []
184
- if author_list is not None:
185
- for a in author_list.findall('Author'):
186
- collective = a.find('CollectiveName')
187
- last = a.find('LastName')
188
- fore = a.find('ForeName')
189
- initials = a.find('Initials')
190
- if collective is not None:
191
- authors.append(collective.text)
192
- elif last is not None:
193
- full = last.text
194
- if fore is not None:
195
- full += f", {fore.text}"
196
- authors.append(full)
197
-
198
- # MeSH controlled vocabulary terms
199
- mesh_list = mc.find('MeshHeadingList')
200
- mesh_terms = []
201
- if mesh_list is not None:
202
- mesh_terms = [
203
- mh.find('DescriptorName').text
204
- for mh in mesh_list.findall('MeshHeading')
205
- if mh.find('DescriptorName') is not None
206
- ]
207
-
208
- print(f"PMID={pmid} ({year}) {j_title}")
209
- print(f" Title: {title[:70]}")
210
- print(f" Authors: {authors[:3]}")
211
- print(f" DOI: {doi}")
212
- print(f" MeSH: {mesh_terms[:4]}")
213
- print(f" Abstract: {abstract[:120]}")
214
- # Confirmed output (2026-04-18):
215
- # PMID=41999029 (2026) Medical science monitor : international medical...
216
- # Title: Use of Deep Learning Models in the Diagnosis of Proptosis Thro
217
- # Authors: ['Kesimal, Uğur', 'Akkaya, Habip Eser', 'Polat, Önder']
218
- # DOI: 10.12659/MSM.951157
219
- # MeSH: ['Humans', 'Deep Learning', 'Exophthalmos', 'Magnetic Resonance Imaging']
220
- # Abstract: BACKGROUND Proptosis is a common manifestation of orbital disease...
221
- # PMID=36328784 (...)
222
- # Abstract: [OBJECTIVES] Physical inactivity and sedentary behaviour... ← structured
223
- ```
224
-
225
- ### Large result sets — usehistory + WebEnv
226
-
227
- When `count` exceeds `retmax` (max 10 000), use server-side history to paginate EFetch without re-running ESearch on every page.
228
-
229
- ```python
230
- import json, xml.etree.ElementTree as ET
231
- from helpers import http_get
232
-
233
- # Step 1: ESearch with usehistory=y — NCBI holds result set on server
234
- search = json.loads(http_get(
235
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
236
- "?db=pubmed&term=CRISPR+gene+editing&retmax=0&retmode=json&usehistory=y"
237
- ))
238
- webenv = search['esearchresult']['webenv'] # server-side session token
239
- query_key = search['esearchresult']['querykey'] # result set ID within session
240
- total = int(search['esearchresult']['count'])
241
- print(f"Total: {total}, WebEnv: {webenv[:30]}..., query_key: {query_key}")
242
- # Confirmed output (2026-04-18):
243
- # Total: 24160, WebEnv: MCID_69e4203757db89391008d6f1..., query_key: 1
244
-
245
- # Step 2: EFetch pages using WebEnv (no re-searching)
246
- batch_size = 200
247
- for start in range(0, min(total, 1000), batch_size): # cap at 1000 for demo
248
- raw = http_get(
249
- f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
250
- f"?db=pubmed&query_key={query_key}&WebEnv={webenv}"
251
- f"&retstart={start}&retmax={batch_size}&retmode=xml&rettype=abstract"
252
- )
253
- root = ET.fromstring(raw)
254
- articles = root.findall('.//PubmedArticle')
255
- print(f" Fetched {len(articles)} articles (start={start})")
256
- # process articles here...
257
- ```
258
-
259
- ### EInfo — list available NCBI databases
260
-
261
- ```python
262
- import json
263
- from helpers import http_get
264
-
265
- data = json.loads(http_get(
266
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?retmode=json"
267
- ))
268
- dbs = data['einforesult']['dblist']
269
- print(f"Total databases: {len(dbs)}") # Confirmed: 39 (2026-04-18)
270
- print(dbs[:10])
271
- # ['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure',
272
- # 'genome', 'annotinfo', 'assembly', 'bioproject']
273
- ```
274
-
275
- Get PubMed-specific metadata (field list, link list):
276
-
277
- ```python
278
- import json
279
- from helpers import http_get
280
-
281
- data = json.loads(http_get(
282
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=pubmed&retmode=json"
283
- ))
284
- db_info = data['einforesult']['dbinfo'][0]
285
- print("DB name:", db_info['dbname'])
286
- print("Record count:", db_info['count']) # total PubMed records
287
- link_names = [l['name'] for l in db_info.get('linklist', [])]
288
- print(f"Link types ({len(link_names)}):", link_names[:5])
289
- # Confirmed (2026-04-18):
290
- # DB name: pubmed
291
- # Record count: 37620453
292
- # Link types (48): ['pubmed_assembly', 'pubmed_bioproject', ...]
293
- ```
294
-
295
- ### ELink — cross-database linking
296
-
297
- ELink connects a PubMed record to associated data in other NCBI databases. The `pubmed_pubmed` "related articles" linkname relies on a similarity server that is intermittently unavailable (returns `"Couldn't resolve #exLinkSrv2, the address table is empty."`). Use the non-similarity links below instead.
298
-
299
- ```python
300
- import json
301
- from helpers import http_get
302
-
303
- # Link a PMID to its free full-text in PMC (if open access)
304
- # linkname=pubmed_pmc — may also hit the server outage; check error field
305
- data = json.loads(http_get(
306
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi"
307
- "?dbfrom=pubmed&id=38325330&linkname=pubmed_pmc&retmode=json"
308
- ))
309
- error = data.get('ERROR', '')
310
- if error:
311
- print("ELink error:", error) # 'Couldn't resolve #exLinkSrv2...' — NCBI server issue
312
- else:
313
- for ls in data.get('linksets', []):
314
- for lsdb in ls.get('linksetdbs', []):
315
- print(lsdb['linkname'], "→", lsdb['links'][:5])
316
- ```
317
-
318
- Available ELink linknames from pubmed (48 total):
319
-
320
- | linkname | Target |
321
- |---|---|
322
- | `pubmed_pmc` | Free full text in PMC |
323
- | `pubmed_pubmed_citedin` | Articles citing this paper |
324
- | `pubmed_pubmed_refs` | References cited by this paper |
325
- | `pubmed_gene` | Related Gene records |
326
- | `pubmed_clinvar` | Clinical variants associated with publication |
327
- | `pubmed_gds` | Related GEO datasets |
328
-
329
- **Practical alternative**: If ELink is down, extract DOI from EFetch/ESummary and use `https://doi.org/{doi}` directly for the full-text link.
330
-
331
- ## URL and parameter reference
332
-
333
- ### E-utilities base URLs
334
-
335
- ```
336
- https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi # search → PMIDs
337
- https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi # PMIDs → JSON summary
338
- https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi # PMIDs → full XML
339
- https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi # cross-db links
340
- https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi # DB metadata
341
- ```
342
-
343
- ### ESearch parameters
344
-
345
- | Parameter | Values | Notes |
346
- |---|---|---|
347
- | `db` | `pubmed` | Always `pubmed` for PubMed |
348
- | `term` | query string | Supports field tags like `[Author]`, `[Title]`, `[MeSH Terms]` |
349
- | `retmax` | integer, max 10000 | Results returned per call |
350
- | `retmode` | `json` | JSON output |
351
- | `sort` | `pub+date`, `Author`, `JournalName` | Default is relevance |
352
- | `datetype` | `pdat` (pub), `edat` (entrez), `mdat` (modified) | |
353
- | `mindate`, `maxdate` | `YYYY/MM/DD` or `YYYY` | Requires `datetype` |
354
- | `usehistory` | `y` | Store results on server; returns `webenv` + `querykey` |
355
-
356
- ### EFetch parameters
357
-
358
- | Parameter | Values | Notes |
359
- |---|---|---|
360
- | `db` | `pubmed` | |
361
- | `id` | `38000000,37999999` | Comma-separated PMIDs; max ~200 per call |
362
- | `query_key` + `WebEnv` | from ESearch `usehistory=y` | Alternative to `id` for large sets |
363
- | `retstart` | integer | Offset for pagination with WebEnv |
364
- | `retmax` | integer, max 10000 | Batch size |
365
- | `retmode` | `xml` | Use XML for EFetch (JSON not available for full records) |
366
- | `rettype` | `abstract` | Returns abstract + core metadata |
367
-
368
- ### PubMed article URL construction
369
-
370
- ```python
371
- pmid = "41999029"
372
- pubmed_url = f"https://pubmed.ncbi.nlm.nih.gov/{pmid}/"
373
- doi = "10.12659/MSM.951157"
374
- doi_url = f"https://doi.org/{doi}" # resolves to publisher page
375
- pmc_id = "PMC9876543" # from ESummary articleids
376
- pmc_url = f"https://www.ncbi.nlm.nih.gov/pmc/articles/{pmc_id}/"
377
- ```
378
-
379
- ## Gotchas
380
-
381
- - **`count` is a string, not int.** `search['esearchresult']['count']` returns `'24160'`, not `24160`. Always cast with `int()` before arithmetic.
382
-
383
- - **EFetch retmode must be `xml` for full records.** Unlike ESearch and ESummary, EFetch with `retmode=json` returns flat text (the MEDLINE citation text format), not structured JSON. Parse EFetch responses with `xml.etree.ElementTree`.
384
-
385
- - **`ArticleTitle` may contain embedded XML tags.** Titles with italics (`<i>Staphylococcus aureus</i>`) or math (`<sub>2</sub>`) are mixed-content nodes. Always use `''.join(el.itertext())` instead of `el.text`, which silently drops everything after the first child tag.
386
-
387
- - **~15% of articles have no abstract.** `article.find('Abstract')` returns `None` for short communications, editorials, letters, and older records. Always guard with `if abstract_el is not None`.
388
-
389
- - **Author names vary in structure — always handle `CollectiveName`.** Consortium papers list a group name (`'GeKeR Study Group'`, `'Breast Cancer Association Consortium'`) under `<CollectiveName>` instead of `<LastName>/<ForeName>`. Individual authors have `<LastName>` + optionally `<ForeName>` and `<Initials>`. Check `CollectiveName` first; falling through to `LastName` without the check produces `None` errors.
390
- - Confirmed real examples (2026-04-18): PMID 37586835 (`GeKeR Study Group`), PMID 36328784 (`Breast Cancer Association Consortium`)
391
-
392
- - **PubDate has three possible structures.** Most articles have `<Year>` + optional `<Month>` + optional `<Day>`. Seasonal journals use `<Season>` (e.g. `Jul-Aug`, `Oct-Dec`) instead of `<Month>`. A minority of older records use `<MedlineDate>` (e.g. `1995 Fall`) with no `<Year>`. Safe extraction pattern:
393
- ```python
394
- pub_date = journal.find('.//PubDate')
395
- year_el = pub_date.find('Year') if pub_date is not None else None
396
- medline_el = pub_date.find('MedlineDate') if pub_date is not None else None
397
- year = (year_el.text if year_el is not None
398
- else medline_el.text[:4] if medline_el is not None else '')
399
- ```
400
-
401
- - **Batch EFetch: keep IDs to ~200 per call.** The API accepts comma-separated IDs in `id=`, but very large batches (500+) occasionally time out or return truncated XML. For >200 articles, iterate in chunks or use `usehistory` + `WebEnv`.
402
-
403
- - **ELink `pubmed_pubmed` (related articles) is intermittently broken.** The NCBI similarity server returns `"Couldn't resolve #exLinkSrv2, the address table is empty."` — this is a persistent server-side issue as of 2026-04-18, not a rate-limit error. Other linknames (`pubmed_gene`, `pubmed_pmc`, `pubmed_clinvar`) fail with the same error. Use the DOI as a fallback link to publisher full text.
404
-
405
- - **Rate limits: 3 req/s without API key, 10 req/s with free key.** Exceeding 3 req/s returns HTTP 429. Insert `time.sleep(0.34)` between sequential calls without a key. Get a free API key at https://www.ncbi.nlm.nih.gov/account/ and append `&api_key=YOUR_KEY` to all URLs.
406
-
407
- - **`retmax` upper bound is 10 000 for ESearch.** To retrieve more than 10 000 PMIDs for a search, use `usehistory=y` and page through EFetch with `retstart` offsets. EFetch itself also accepts `retmax` up to 10 000 per call.
408
-
409
- - **`retmax=0` in ESearch returns only the count, not IDs — useful for counting.** Combine with `usehistory=y` to store the result for later paging without fetching IDs upfront:
410
- ```python
411
- search = json.loads(http_get(
412
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
413
- "?db=pubmed&term=cancer&retmax=0&retmode=json&usehistory=y"
414
- ))
415
- total = int(search['esearchresult']['count']) # e.g. 4800000
416
- webenv = search['esearchresult']['webenv']
417
- ```
418
-
419
- - **ESummary `authors` field uses abbreviated names (`Last I`), not full names.** Use EFetch XML to get `ForeName` (e.g. `'Kesimal, Uğur'` vs ESummary `'Kesimal U'`). For bulk tasks where full names are not needed, ESummary is faster.
420
-
421
- - **`querytranslation` shows how NCBI interpreted your term.** The ESearch response includes `esearchresult.querytranslation` — a MeSH-expanded version of your query. Inspect it to verify the search matched what you intended.
1
+ # PubMed / NCBI — Scraping & Data Extraction
2
+
3
+ `https://pubmed.ncbi.nlm.nih.gov` — 37 M+ biomedical citations. **Never use the browser for PubMed.** All data is reachable via `http_get` using the NCBI E-utilities REST API. No API key required; a free key raises the rate limit from 3 to 10 req/s.
4
+
5
+ ## Do this first
6
+
7
+ **ESearch → ESummary is the fastest pipeline for most tasks — two calls, JSON responses, no XML parsing.**
8
+
9
+ ```python
10
+ import json
11
+ from helpers import http_get
12
+
13
+ # Step 1: search → get PMIDs
14
+ search = json.loads(http_get(
15
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
16
+ "?db=pubmed&term=deep+learning+radiology&retmax=10&retmode=json"
17
+ ))
18
+ pmids = search['esearchresult']['idlist'] # e.g. ['41999029', '41998456', ...]
19
+ count = search['esearchresult']['count'] # total hits across all pages
20
+
21
+ # Step 2: fetch lightweight metadata for all PMIDs in one call
22
+ summary = json.loads(http_get(
23
+ f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
24
+ f"?db=pubmed&id={','.join(pmids)}&retmode=json"
25
+ ))
26
+ result = summary['result']
27
+ for uid in result['uids']:
28
+ art = result[uid]
29
+ print(uid, art['pubdate'], art['source'])
30
+ print(" ", art['title'][:80])
31
+ print(" authors:", [a['name'] for a in art['authors'][:3]])
32
+ # Confirmed output (2026-04-18):
33
+ # 41999029 2026 Apr 18 Med Sci Monit
34
+ # Use of Deep Learning Models in the Diagnosis of Proptosis Through Orbi
35
+ # authors: ['Kesimal U', 'Akkaya HE', 'Polat Ö']
36
+ # 41998456 2026 Apr 17 Sci Rep
37
+ # ...
38
+ ```
39
+
40
+ Use **EFetch XML** when you need: full abstract text, MeSH terms, complete author names (not just "Last I"), structured abstract labels, or the DOI from within the article record.
41
+
42
+ ## Common workflows
43
+
44
+ ### Search PubMed (ESearch)
45
+
46
+ ```python
47
+ import json
48
+ from helpers import http_get
49
+
50
+ data = json.loads(http_get(
51
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
52
+ "?db=pubmed"
53
+ "&term=large+language+models+clinical"
54
+ "&retmax=5"
55
+ "&retmode=json"
56
+ "&sort=pub+date" # newest first; default is relevance
57
+ "&datetype=pdat" # filter by publication date
58
+ "&mindate=2024/01/01&maxdate=2024/12/31" # YYYY/MM/DD format
59
+ ))
60
+ result = data['esearchresult']
61
+ print("Total hits:", result['count']) # '24160' — note: string, not int
62
+ print("PMIDs:", result['idlist'])
63
+ print("Query translation:", result['querytranslation'])
64
+ # Confirmed output (2026-04-18):
65
+ # Total hits: 24160
66
+ # PMIDs: ['41996895', '41996722', '41996006', '41995888', '41995759']
67
+ # Query translation: "large language models"[MeSH Terms] OR ...
68
+ ```
69
+
70
+ #### ESearch field tags (append to term)
71
+
72
+ ```
73
+ machine learning[MeSH Terms] MeSH controlled vocabulary
74
+ Hinton GE[Author] author last + initials
75
+ attention is all you need[Title] title words
76
+ Nature[Journal] journal name
77
+ 2024[pdat] publication year
78
+ ```
79
+
80
+ Boolean operators: `AND`, `OR`, `NOT`. Phrase search: `"exact phrase"[Title]`.
81
+
82
+ #### Sort options (`sort=`)
83
+
84
+ | Value | Effect |
85
+ |---|---|
86
+ | *(omit)* | Relevance (default) |
87
+ | `pub+date` | Most recent publication first |
88
+ | `Author` | First author alphabetical |
89
+ | `JournalName` | Journal alphabetical |
90
+
91
+ ### Lightweight metadata — ESummary (JSON, no XML)
92
+
93
+ ```python
94
+ import json
95
+ from helpers import http_get
96
+
97
+ data = json.loads(http_get(
98
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
99
+ "?db=pubmed&id=41999029,41998456,41997837&retmode=json"
100
+ ))
101
+ result = data['result']
102
+ for uid in result['uids']:
103
+ art = result[uid]
104
+ # Key fields available:
105
+ title = art['title'] # full title string
106
+ source = art['source'] # abbreviated journal name
107
+ fulljournalname = art['fulljournalname']
108
+ pubdate = art['pubdate'] # e.g. '2026 Apr 18'
109
+ epubdate = art['epubdate'] # e-pub ahead of print date (may be empty)
110
+ authors = art['authors'] # list of {'name': 'Last I', 'authtype': ...}
111
+ volume = art['volume']
112
+ issue = art['issue']
113
+ pages = art['pages']
114
+ pubtype = art['pubtype'] # list: ['Journal Article', 'Review', ...]
115
+ # Extract DOI from elocationid or articleids:
116
+ doi_field = art['elocationid'] # e.g. 'doi: 10.12659/MSM.951157'
117
+ article_ids = {x['idtype']: x['value'] for x in art['articleids']}
118
+ doi = article_ids.get('doi')
119
+ pmc_id = article_ids.get('pmc') # PMC ID if open access
120
+ print(uid, pubdate, source)
121
+ print(" ", title[:70])
122
+ print(" doi:", doi, "| pmc:", pmc_id)
123
+ # Confirmed output (2026-04-18):
124
+ # 41999029 2026 Apr 18 Med Sci Monit
125
+ # Use of Deep Learning Models in the Diagnosis of Proptosis Through Orbi
126
+ # doi: 10.12659/MSM.951157 | pmc: None
127
+ ```
128
+
129
+ ### Full article metadata — EFetch XML
130
+
131
+ Use this for full abstracts, complete author names, MeSH terms, structured abstract sections.
132
+
133
+ ```python
134
+ import json, xml.etree.ElementTree as ET
135
+ from helpers import http_get
136
+
137
+ raw = http_get(
138
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
139
+ "?db=pubmed&id=41999029,36328784&retmode=xml&rettype=abstract"
140
+ )
141
+ root = ET.fromstring(raw)
142
+
143
+ for art in root.findall('.//PubmedArticle'):
144
+ mc = art.find('MedlineCitation')
145
+ pmid = mc.find('PMID').text
146
+ article = mc.find('Article')
147
+
148
+ # Title — use itertext() to handle embedded tags like <i>, <sub>
149
+ title = ''.join(article.find('ArticleTitle').itertext()).strip()
150
+
151
+ # Abstract — plain or structured (BACKGROUND / METHODS / RESULTS / CONCLUSION)
152
+ abstract_el = article.find('Abstract')
153
+ if abstract_el is not None:
154
+ sections = []
155
+ for t in abstract_el.findall('AbstractText'):
156
+ label = t.get('Label', '') # e.g. 'BACKGROUND', 'METHODS'
157
+ text = ''.join(t.itertext()).strip()
158
+ sections.append(f"[{label}] {text}" if label else text)
159
+ abstract = ' '.join(sections)
160
+ else:
161
+ abstract = '' # ~15% of articles have no abstract
162
+
163
+ # Journal + year
164
+ journal = article.find('Journal')
165
+ j_title = journal.find('Title').text if journal is not None else ''
166
+ pub_date = journal.find('.//PubDate') if journal is not None else None
167
+ if pub_date is not None:
168
+ year_el = pub_date.find('Year')
169
+ medline_el = pub_date.find('MedlineDate') # fallback for old/seasonal dates
170
+ season_el = pub_date.find('Season') # e.g. 'Jul-Aug', 'Oct-Dec'
171
+ year = (year_el.text if year_el is not None
172
+ else medline_el.text[:4] if medline_el is not None else '')
173
+
174
+ # DOI
175
+ doi_el = next(
176
+ (e for e in article.findall('ELocationID') if e.get('EIdType') == 'doi'),
177
+ None
178
+ )
179
+ doi = doi_el.text if doi_el is not None else ''
180
+
181
+ # Authors — handle CollectiveName (consortium/group authors)
182
+ author_list = article.find('AuthorList')
183
+ authors = []
184
+ if author_list is not None:
185
+ for a in author_list.findall('Author'):
186
+ collective = a.find('CollectiveName')
187
+ last = a.find('LastName')
188
+ fore = a.find('ForeName')
189
+ initials = a.find('Initials')
190
+ if collective is not None:
191
+ authors.append(collective.text)
192
+ elif last is not None:
193
+ full = last.text
194
+ if fore is not None:
195
+ full += f", {fore.text}"
196
+ authors.append(full)
197
+
198
+ # MeSH controlled vocabulary terms
199
+ mesh_list = mc.find('MeshHeadingList')
200
+ mesh_terms = []
201
+ if mesh_list is not None:
202
+ mesh_terms = [
203
+ mh.find('DescriptorName').text
204
+ for mh in mesh_list.findall('MeshHeading')
205
+ if mh.find('DescriptorName') is not None
206
+ ]
207
+
208
+ print(f"PMID={pmid} ({year}) {j_title}")
209
+ print(f" Title: {title[:70]}")
210
+ print(f" Authors: {authors[:3]}")
211
+ print(f" DOI: {doi}")
212
+ print(f" MeSH: {mesh_terms[:4]}")
213
+ print(f" Abstract: {abstract[:120]}")
214
+ # Confirmed output (2026-04-18):
215
+ # PMID=41999029 (2026) Medical science monitor : international medical...
216
+ # Title: Use of Deep Learning Models in the Diagnosis of Proptosis Thro
217
+ # Authors: ['Kesimal, Uğur', 'Akkaya, Habip Eser', 'Polat, Önder']
218
+ # DOI: 10.12659/MSM.951157
219
+ # MeSH: ['Humans', 'Deep Learning', 'Exophthalmos', 'Magnetic Resonance Imaging']
220
+ # Abstract: BACKGROUND Proptosis is a common manifestation of orbital disease...
221
+ # PMID=36328784 (...)
222
+ # Abstract: [OBJECTIVES] Physical inactivity and sedentary behaviour... ← structured
223
+ ```
224
+
225
+ ### Large result sets — usehistory + WebEnv
226
+
227
+ When `count` exceeds `retmax` (max 10 000), use server-side history to paginate EFetch without re-running ESearch on every page.
228
+
229
+ ```python
230
+ import json, xml.etree.ElementTree as ET
231
+ from helpers import http_get
232
+
233
+ # Step 1: ESearch with usehistory=y — NCBI holds result set on server
234
+ search = json.loads(http_get(
235
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
236
+ "?db=pubmed&term=CRISPR+gene+editing&retmax=0&retmode=json&usehistory=y"
237
+ ))
238
+ webenv = search['esearchresult']['webenv'] # server-side session token
239
+ query_key = search['esearchresult']['querykey'] # result set ID within session
240
+ total = int(search['esearchresult']['count'])
241
+ print(f"Total: {total}, WebEnv: {webenv[:30]}..., query_key: {query_key}")
242
+ # Confirmed output (2026-04-18):
243
+ # Total: 24160, WebEnv: MCID_69e4203757db89391008d6f1..., query_key: 1
244
+
245
+ # Step 2: EFetch pages using WebEnv (no re-searching)
246
+ batch_size = 200
247
+ for start in range(0, min(total, 1000), batch_size): # cap at 1000 for demo
248
+ raw = http_get(
249
+ f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
250
+ f"?db=pubmed&query_key={query_key}&WebEnv={webenv}"
251
+ f"&retstart={start}&retmax={batch_size}&retmode=xml&rettype=abstract"
252
+ )
253
+ root = ET.fromstring(raw)
254
+ articles = root.findall('.//PubmedArticle')
255
+ print(f" Fetched {len(articles)} articles (start={start})")
256
+ # process articles here...
257
+ ```
258
+
259
+ ### EInfo — list available NCBI databases
260
+
261
+ ```python
262
+ import json
263
+ from helpers import http_get
264
+
265
+ data = json.loads(http_get(
266
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?retmode=json"
267
+ ))
268
+ dbs = data['einforesult']['dblist']
269
+ print(f"Total databases: {len(dbs)}") # Confirmed: 39 (2026-04-18)
270
+ print(dbs[:10])
271
+ # ['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure',
272
+ # 'genome', 'annotinfo', 'assembly', 'bioproject']
273
+ ```
274
+
275
+ Get PubMed-specific metadata (field list, link list):
276
+
277
+ ```python
278
+ import json
279
+ from helpers import http_get
280
+
281
+ data = json.loads(http_get(
282
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=pubmed&retmode=json"
283
+ ))
284
+ db_info = data['einforesult']['dbinfo'][0]
285
+ print("DB name:", db_info['dbname'])
286
+ print("Record count:", db_info['count']) # total PubMed records
287
+ link_names = [l['name'] for l in db_info.get('linklist', [])]
288
+ print(f"Link types ({len(link_names)}):", link_names[:5])
289
+ # Confirmed (2026-04-18):
290
+ # DB name: pubmed
291
+ # Record count: 37620453
292
+ # Link types (48): ['pubmed_assembly', 'pubmed_bioproject', ...]
293
+ ```
294
+
295
+ ### ELink — cross-database linking
296
+
297
+ ELink connects a PubMed record to associated data in other NCBI databases. The `pubmed_pubmed` "related articles" linkname relies on a similarity server that is intermittently unavailable (returns `"Couldn't resolve #exLinkSrv2, the address table is empty."`). Use the non-similarity links below instead.
298
+
299
+ ```python
300
+ import json
301
+ from helpers import http_get
302
+
303
+ # Link a PMID to its free full-text in PMC (if open access)
304
+ # linkname=pubmed_pmc — may also hit the server outage; check error field
305
+ data = json.loads(http_get(
306
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi"
307
+ "?dbfrom=pubmed&id=38325330&linkname=pubmed_pmc&retmode=json"
308
+ ))
309
+ error = data.get('ERROR', '')
310
+ if error:
311
+ print("ELink error:", error) # 'Couldn't resolve #exLinkSrv2...' — NCBI server issue
312
+ else:
313
+ for ls in data.get('linksets', []):
314
+ for lsdb in ls.get('linksetdbs', []):
315
+ print(lsdb['linkname'], "→", lsdb['links'][:5])
316
+ ```
317
+
318
+ Available ELink linknames from pubmed (48 total):
319
+
320
+ | linkname | Target |
321
+ |---|---|
322
+ | `pubmed_pmc` | Free full text in PMC |
323
+ | `pubmed_pubmed_citedin` | Articles citing this paper |
324
+ | `pubmed_pubmed_refs` | References cited by this paper |
325
+ | `pubmed_gene` | Related Gene records |
326
+ | `pubmed_clinvar` | Clinical variants associated with publication |
327
+ | `pubmed_gds` | Related GEO datasets |
328
+
329
+ **Practical alternative**: If ELink is down, extract DOI from EFetch/ESummary and use `https://doi.org/{doi}` directly for the full-text link.
330
+
331
+ ## URL and parameter reference
332
+
333
+ ### E-utilities base URLs
334
+
335
+ ```
336
+ https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi # search → PMIDs
337
+ https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi # PMIDs → JSON summary
338
+ https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi # PMIDs → full XML
339
+ https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi # cross-db links
340
+ https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi # DB metadata
341
+ ```
342
+
343
+ ### ESearch parameters
344
+
345
+ | Parameter | Values | Notes |
346
+ |---|---|---|
347
+ | `db` | `pubmed` | Always `pubmed` for PubMed |
348
+ | `term` | query string | Supports field tags like `[Author]`, `[Title]`, `[MeSH Terms]` |
349
+ | `retmax` | integer, max 10000 | Results returned per call |
350
+ | `retmode` | `json` | JSON output |
351
+ | `sort` | `pub+date`, `Author`, `JournalName` | Default is relevance |
352
+ | `datetype` | `pdat` (pub), `edat` (entrez), `mdat` (modified) | |
353
+ | `mindate`, `maxdate` | `YYYY/MM/DD` or `YYYY` | Requires `datetype` |
354
+ | `usehistory` | `y` | Store results on server; returns `webenv` + `querykey` |
355
+
356
+ ### EFetch parameters
357
+
358
+ | Parameter | Values | Notes |
359
+ |---|---|---|
360
+ | `db` | `pubmed` | |
361
+ | `id` | `38000000,37999999` | Comma-separated PMIDs; max ~200 per call |
362
+ | `query_key` + `WebEnv` | from ESearch `usehistory=y` | Alternative to `id` for large sets |
363
+ | `retstart` | integer | Offset for pagination with WebEnv |
364
+ | `retmax` | integer, max 10000 | Batch size |
365
+ | `retmode` | `xml` | Use XML for EFetch (JSON not available for full records) |
366
+ | `rettype` | `abstract` | Returns abstract + core metadata |
367
+
368
+ ### PubMed article URL construction
369
+
370
+ ```python
371
+ pmid = "41999029"
372
+ pubmed_url = f"https://pubmed.ncbi.nlm.nih.gov/{pmid}/"
373
+ doi = "10.12659/MSM.951157"
374
+ doi_url = f"https://doi.org/{doi}" # resolves to publisher page
375
+ pmc_id = "PMC9876543" # from ESummary articleids
376
+ pmc_url = f"https://www.ncbi.nlm.nih.gov/pmc/articles/{pmc_id}/"
377
+ ```
378
+
379
+ ## Gotchas
380
+
381
+ - **`count` is a string, not int.** `search['esearchresult']['count']` returns `'24160'`, not `24160`. Always cast with `int()` before arithmetic.
382
+
383
+ - **EFetch retmode must be `xml` for full records.** Unlike ESearch and ESummary, EFetch with `retmode=json` returns flat text (the MEDLINE citation text format), not structured JSON. Parse EFetch responses with `xml.etree.ElementTree`.
384
+
385
+ - **`ArticleTitle` may contain embedded XML tags.** Titles with italics (`<i>Staphylococcus aureus</i>`) or math (`<sub>2</sub>`) are mixed-content nodes. Always use `''.join(el.itertext())` instead of `el.text`, which silently drops everything after the first child tag.
386
+
387
+ - **~15% of articles have no abstract.** `article.find('Abstract')` returns `None` for short communications, editorials, letters, and older records. Always guard with `if abstract_el is not None`.
388
+
389
+ - **Author names vary in structure — always handle `CollectiveName`.** Consortium papers list a group name (`'GeKeR Study Group'`, `'Breast Cancer Association Consortium'`) under `<CollectiveName>` instead of `<LastName>/<ForeName>`. Individual authors have `<LastName>` + optionally `<ForeName>` and `<Initials>`. Check `CollectiveName` first; falling through to `LastName` without the check produces `None` errors.
390
+ - Confirmed real examples (2026-04-18): PMID 37586835 (`GeKeR Study Group`), PMID 36328784 (`Breast Cancer Association Consortium`)
391
+
392
+ - **PubDate has three possible structures.** Most articles have `<Year>` + optional `<Month>` + optional `<Day>`. Seasonal journals use `<Season>` (e.g. `Jul-Aug`, `Oct-Dec`) instead of `<Month>`. A minority of older records use `<MedlineDate>` (e.g. `1995 Fall`) with no `<Year>`. Safe extraction pattern:
393
+ ```python
394
+ pub_date = journal.find('.//PubDate')
395
+ year_el = pub_date.find('Year') if pub_date is not None else None
396
+ medline_el = pub_date.find('MedlineDate') if pub_date is not None else None
397
+ year = (year_el.text if year_el is not None
398
+ else medline_el.text[:4] if medline_el is not None else '')
399
+ ```
400
+
401
+ - **Batch EFetch: keep IDs to ~200 per call.** The API accepts comma-separated IDs in `id=`, but very large batches (500+) occasionally time out or return truncated XML. For >200 articles, iterate in chunks or use `usehistory` + `WebEnv`.
402
+
403
+ - **ELink `pubmed_pubmed` (related articles) is intermittently broken.** The NCBI similarity server returns `"Couldn't resolve #exLinkSrv2, the address table is empty."` — this is a persistent server-side issue as of 2026-04-18, not a rate-limit error. Other linknames (`pubmed_gene`, `pubmed_pmc`, `pubmed_clinvar`) fail with the same error. Use the DOI as a fallback link to publisher full text.
404
+
405
+ - **Rate limits: 3 req/s without API key, 10 req/s with free key.** Exceeding 3 req/s returns HTTP 429. Insert `time.sleep(0.34)` between sequential calls without a key. Get a free API key at https://www.ncbi.nlm.nih.gov/account/ and append `&api_key=YOUR_KEY` to all URLs.
406
+
407
+ - **`retmax` upper bound is 10 000 for ESearch.** To retrieve more than 10 000 PMIDs for a search, use `usehistory=y` and page through EFetch with `retstart` offsets. EFetch itself also accepts `retmax` up to 10 000 per call.
408
+
409
+ - **`retmax=0` in ESearch returns only the count, not IDs — useful for counting.** Combine with `usehistory=y` to store the result for later paging without fetching IDs upfront:
410
+ ```python
411
+ search = json.loads(http_get(
412
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
413
+ "?db=pubmed&term=cancer&retmax=0&retmode=json&usehistory=y"
414
+ ))
415
+ total = int(search['esearchresult']['count']) # e.g. 4800000
416
+ webenv = search['esearchresult']['webenv']
417
+ ```
418
+
419
+ - **ESummary `authors` field uses abbreviated names (`Last I`), not full names.** Use EFetch XML to get `ForeName` (e.g. `'Kesimal, Uğur'` vs ESummary `'Kesimal U'`). For bulk tasks where full names are not needed, ESummary is faster.
420
+
421
+ - **`querytranslation` shows how NCBI interpreted your term.** The ESearch response includes `esearchresult.querytranslation` — a MeSH-expanded version of your query. Inspect it to verify the search matched what you intended.