@pencil-agent/nano-pencil 2.0.0-beta.8 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (241) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/extensions-host/index.d.ts +1 -1
  7. package/dist/core/extensions-host/loader.js +1 -1
  8. package/dist/core/extensions-host/runner.d.ts +1 -0
  9. package/dist/core/extensions-host/runner.js +2 -2
  10. package/dist/core/extensions-host/types.d.ts +17 -22
  11. package/dist/core/lib/ai/src/types.d.ts +12 -2
  12. package/dist/core/persona/persona-manager.js +5 -2
  13. package/dist/core/runtime/agent-session.js +3 -3
  14. package/dist/core/runtime/extension-core-bindings.d.ts +1 -0
  15. package/dist/core/runtime/extension-core-bindings.js +2 -2
  16. package/dist/extensions/builtin/AGENT.md +115 -115
  17. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  18. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  19. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  20. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  91. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  92. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  93. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  94. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  95. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  96. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  97. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  98. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  99. package/dist/extensions/builtin/browser/browser.md +73 -73
  100. package/dist/extensions/builtin/browser/install.md +142 -142
  101. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  102. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  103. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  104. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  105. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  107. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  108. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  109. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  110. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  111. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  112. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  113. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  114. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  115. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  116. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  117. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  118. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  119. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  120. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  121. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  122. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  123. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  124. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  125. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  126. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  127. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  128. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  129. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  130. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  131. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  132. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  133. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  134. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  135. package/dist/extensions/builtin/goal/README.md +67 -67
  136. package/dist/extensions/builtin/goal/goal-controller.d.ts +39 -10
  137. package/dist/extensions/builtin/goal/goal-controller.js +1 -1
  138. package/dist/extensions/builtin/goal/goal-format.js +1 -1
  139. package/dist/extensions/builtin/goal/goal-prompts.d.ts +2 -0
  140. package/dist/extensions/builtin/goal/goal-prompts.js +5 -4
  141. package/dist/extensions/builtin/goal/goal-store.js +1 -1
  142. package/dist/extensions/builtin/goal/index.d.ts +1 -1
  143. package/dist/extensions/builtin/goal/index.js +10 -7
  144. package/dist/extensions/builtin/grub/README.md +112 -112
  145. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  146. package/dist/extensions/builtin/link-world/index.js +6 -6
  147. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  148. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  149. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  150. package/dist/extensions/builtin/link-world/{network-routing.md → network-routing/network-routing.md} +67 -67
  151. package/dist/extensions/builtin/loop/README.md +92 -92
  152. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  153. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  154. package/dist/extensions/builtin/plan/index.js +1 -1
  155. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  156. package/dist/extensions/builtin/sal/README.md +72 -72
  157. package/dist/extensions/builtin/security-audit/README.md +289 -289
  158. package/dist/extensions/builtin/task/task-store.d.ts +4 -0
  159. package/dist/extensions/builtin/task/task-store.js +1 -1
  160. package/dist/extensions/builtin/team/AGENT.md +112 -112
  161. package/dist/extensions/builtin/team/TESTING.md +299 -299
  162. package/dist/extensions/builtin/token-save/README.md +56 -56
  163. package/dist/extensions/optional/AGENT.md +10 -10
  164. package/dist/index.d.ts +5 -30
  165. package/dist/index.js +1 -1
  166. package/dist/models.d.ts +7 -0
  167. package/dist/models.js +1 -0
  168. package/dist/modes/interactive/components/footer.js +1 -1
  169. package/dist/modes/interactive/components/task-status-panel.d.ts +36 -0
  170. package/dist/modes/interactive/components/task-status-panel.js +1 -0
  171. package/dist/modes/interactive/controllers/stream-render-controller.d.ts +7 -0
  172. package/dist/modes/interactive/controllers/stream-render-controller.js +2 -2
  173. package/dist/modes/interactive/interactive-mode.js +40 -40
  174. package/dist/modes/interactive/state/interactive-state.d.ts +2 -0
  175. package/dist/modes/interactive/state/interactive-state.js +1 -1
  176. package/dist/modes/interactive/theme/dark.json +85 -85
  177. package/dist/modes/interactive/theme/light.json +84 -84
  178. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  179. package/dist/modes/interactive/theme/warm.json +81 -81
  180. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  181. package/dist/node_modules/@pencil-agent/ai/dist/models.generated.js +1 -1
  182. package/dist/node_modules/@pencil-agent/ai/dist/providers/anthropic.js +2 -2
  183. package/dist/node_modules/@pencil-agent/ai/dist/providers/openai-completions.js +5 -5
  184. package/dist/node_modules/@pencil-agent/ai/dist/providers/openai-responses.js +1 -1
  185. package/dist/node_modules/@pencil-agent/ai/dist/stream.js +1 -1
  186. package/dist/packages/protocol/src/commands.d.ts +33 -0
  187. package/dist/packages/protocol/src/flags.d.ts +20 -0
  188. package/dist/packages/protocol/src/hooks.d.ts +17 -0
  189. package/dist/packages/protocol/src/hooks.js +0 -0
  190. package/dist/packages/{extension-sdk → protocol}/src/index.d.ts +7 -4
  191. package/dist/packages/protocol/src/index.js +1 -0
  192. package/dist/packages/{extension-sdk → protocol}/src/lifecycle.d.ts +15 -27
  193. package/dist/packages/protocol/src/lifecycle.js +0 -0
  194. package/dist/packages/{extension-sdk → protocol}/src/tools.d.ts +1 -1
  195. package/dist/packages/protocol/src/tools.js +0 -0
  196. package/dist/public-config.d.ts +12 -0
  197. package/dist/public-config.js +1 -0
  198. package/dist/runtime.d.ts +9 -0
  199. package/dist/runtime.js +1 -0
  200. package/dist/session-compaction.d.ts +7 -0
  201. package/dist/session-compaction.js +1 -0
  202. package/dist/session.d.ts +7 -0
  203. package/dist/session.js +1 -0
  204. package/dist/skills.d.ts +7 -0
  205. package/dist/skills.js +1 -0
  206. package/dist/tools.d.ts +7 -0
  207. package/dist/tools.js +1 -0
  208. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
  209. package/docs/SDK-TESTING.md +364 -0
  210. package/docs/codex-goal-command-impl.md +1055 -1055
  211. package/docs/codex-goal-vs-grub.md +500 -500
  212. package/docs/custom-provider.md +27 -27
  213. package/docs/extensions.md +27 -27
  214. package/docs/keybindings.md +27 -27
  215. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  216. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  217. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  218. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  219. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  220. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  221. package/docs/loop-usage-examples.md +214 -214
  222. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
  223. package/docs/models.md +27 -27
  224. package/docs/packages.md +27 -27
  225. package/docs/pi-design-philosophy.md +457 -457
  226. package/docs/planmode.md +1987 -1987
  227. package/docs/prompt-templates.md +27 -27
  228. package/docs/providers.md +27 -27
  229. package/docs/sdk.md +27 -27
  230. package/docs/skills.md +27 -27
  231. package/docs/startup-performance-optimization.md +301 -0
  232. package/docs/themes.md +27 -27
  233. package/docs/tui.md +27 -27
  234. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
  235. package/package.json +190 -162
  236. package/dist/packages/extension-sdk/src/index.js +0 -1
  237. package/docs/cc-agent-design.md +0 -1297
  238. package/docs/cc-tui-design.md +0 -1333
  239. package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
  240. /package/dist/packages/{extension-sdk/src/lifecycle.js → protocol/src/commands.js} +0 -0
  241. /package/dist/packages/{extension-sdk/src/tools.js → protocol/src/flags.js} +0 -0
@@ -1,421 +1,421 @@
1
- # PubMed / NCBI — Scraping & Data Extraction
2
-
3
- `https://pubmed.ncbi.nlm.nih.gov` — 37 M+ biomedical citations. **Never use the browser for PubMed.** All data is reachable via `http_get` using the NCBI E-utilities REST API. No API key required; a free key raises the rate limit from 3 to 10 req/s.
4
-
5
- ## Do this first
6
-
7
- **ESearch → ESummary is the fastest pipeline for most tasks — two calls, JSON responses, no XML parsing.**
8
-
9
- ```python
10
- import json
11
- from helpers import http_get
12
-
13
- # Step 1: search → get PMIDs
14
- search = json.loads(http_get(
15
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
16
- "?db=pubmed&term=deep+learning+radiology&retmax=10&retmode=json"
17
- ))
18
- pmids = search['esearchresult']['idlist'] # e.g. ['41999029', '41998456', ...]
19
- count = search['esearchresult']['count'] # total hits across all pages
20
-
21
- # Step 2: fetch lightweight metadata for all PMIDs in one call
22
- summary = json.loads(http_get(
23
- f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
24
- f"?db=pubmed&id={','.join(pmids)}&retmode=json"
25
- ))
26
- result = summary['result']
27
- for uid in result['uids']:
28
- art = result[uid]
29
- print(uid, art['pubdate'], art['source'])
30
- print(" ", art['title'][:80])
31
- print(" authors:", [a['name'] for a in art['authors'][:3]])
32
- # Confirmed output (2026-04-18):
33
- # 41999029 2026 Apr 18 Med Sci Monit
34
- # Use of Deep Learning Models in the Diagnosis of Proptosis Through Orbi
35
- # authors: ['Kesimal U', 'Akkaya HE', 'Polat Ö']
36
- # 41998456 2026 Apr 17 Sci Rep
37
- # ...
38
- ```
39
-
40
- Use **EFetch XML** when you need: full abstract text, MeSH terms, complete author names (not just "Last I"), structured abstract labels, or the DOI from within the article record.
41
-
42
- ## Common workflows
43
-
44
- ### Search PubMed (ESearch)
45
-
46
- ```python
47
- import json
48
- from helpers import http_get
49
-
50
- data = json.loads(http_get(
51
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
52
- "?db=pubmed"
53
- "&term=large+language+models+clinical"
54
- "&retmax=5"
55
- "&retmode=json"
56
- "&sort=pub+date" # newest first; default is relevance
57
- "&datetype=pdat" # filter by publication date
58
- "&mindate=2024/01/01&maxdate=2024/12/31" # YYYY/MM/DD format
59
- ))
60
- result = data['esearchresult']
61
- print("Total hits:", result['count']) # '24160' — note: string, not int
62
- print("PMIDs:", result['idlist'])
63
- print("Query translation:", result['querytranslation'])
64
- # Confirmed output (2026-04-18):
65
- # Total hits: 24160
66
- # PMIDs: ['41996895', '41996722', '41996006', '41995888', '41995759']
67
- # Query translation: "large language models"[MeSH Terms] OR ...
68
- ```
69
-
70
- #### ESearch field tags (append to term)
71
-
72
- ```
73
- machine learning[MeSH Terms] MeSH controlled vocabulary
74
- Hinton GE[Author] author last + initials
75
- attention is all you need[Title] title words
76
- Nature[Journal] journal name
77
- 2024[pdat] publication year
78
- ```
79
-
80
- Boolean operators: `AND`, `OR`, `NOT`. Phrase search: `"exact phrase"[Title]`.
81
-
82
- #### Sort options (`sort=`)
83
-
84
- | Value | Effect |
85
- |---|---|
86
- | *(omit)* | Relevance (default) |
87
- | `pub+date` | Most recent publication first |
88
- | `Author` | First author alphabetical |
89
- | `JournalName` | Journal alphabetical |
90
-
91
- ### Lightweight metadata — ESummary (JSON, no XML)
92
-
93
- ```python
94
- import json
95
- from helpers import http_get
96
-
97
- data = json.loads(http_get(
98
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
99
- "?db=pubmed&id=41999029,41998456,41997837&retmode=json"
100
- ))
101
- result = data['result']
102
- for uid in result['uids']:
103
- art = result[uid]
104
- # Key fields available:
105
- title = art['title'] # full title string
106
- source = art['source'] # abbreviated journal name
107
- fulljournalname = art['fulljournalname']
108
- pubdate = art['pubdate'] # e.g. '2026 Apr 18'
109
- epubdate = art['epubdate'] # e-pub ahead of print date (may be empty)
110
- authors = art['authors'] # list of {'name': 'Last I', 'authtype': ...}
111
- volume = art['volume']
112
- issue = art['issue']
113
- pages = art['pages']
114
- pubtype = art['pubtype'] # list: ['Journal Article', 'Review', ...]
115
- # Extract DOI from elocationid or articleids:
116
- doi_field = art['elocationid'] # e.g. 'doi: 10.12659/MSM.951157'
117
- article_ids = {x['idtype']: x['value'] for x in art['articleids']}
118
- doi = article_ids.get('doi')
119
- pmc_id = article_ids.get('pmc') # PMC ID if open access
120
- print(uid, pubdate, source)
121
- print(" ", title[:70])
122
- print(" doi:", doi, "| pmc:", pmc_id)
123
- # Confirmed output (2026-04-18):
124
- # 41999029 2026 Apr 18 Med Sci Monit
125
- # Use of Deep Learning Models in the Diagnosis of Proptosis Through Orbi
126
- # doi: 10.12659/MSM.951157 | pmc: None
127
- ```
128
-
129
- ### Full article metadata — EFetch XML
130
-
131
- Use this for full abstracts, complete author names, MeSH terms, structured abstract sections.
132
-
133
- ```python
134
- import json, xml.etree.ElementTree as ET
135
- from helpers import http_get
136
-
137
- raw = http_get(
138
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
139
- "?db=pubmed&id=41999029,36328784&retmode=xml&rettype=abstract"
140
- )
141
- root = ET.fromstring(raw)
142
-
143
- for art in root.findall('.//PubmedArticle'):
144
- mc = art.find('MedlineCitation')
145
- pmid = mc.find('PMID').text
146
- article = mc.find('Article')
147
-
148
- # Title — use itertext() to handle embedded tags like <i>, <sub>
149
- title = ''.join(article.find('ArticleTitle').itertext()).strip()
150
-
151
- # Abstract — plain or structured (BACKGROUND / METHODS / RESULTS / CONCLUSION)
152
- abstract_el = article.find('Abstract')
153
- if abstract_el is not None:
154
- sections = []
155
- for t in abstract_el.findall('AbstractText'):
156
- label = t.get('Label', '') # e.g. 'BACKGROUND', 'METHODS'
157
- text = ''.join(t.itertext()).strip()
158
- sections.append(f"[{label}] {text}" if label else text)
159
- abstract = ' '.join(sections)
160
- else:
161
- abstract = '' # ~15% of articles have no abstract
162
-
163
- # Journal + year
164
- journal = article.find('Journal')
165
- j_title = journal.find('Title').text if journal is not None else ''
166
- pub_date = journal.find('.//PubDate') if journal is not None else None
167
- if pub_date is not None:
168
- year_el = pub_date.find('Year')
169
- medline_el = pub_date.find('MedlineDate') # fallback for old/seasonal dates
170
- season_el = pub_date.find('Season') # e.g. 'Jul-Aug', 'Oct-Dec'
171
- year = (year_el.text if year_el is not None
172
- else medline_el.text[:4] if medline_el is not None else '')
173
-
174
- # DOI
175
- doi_el = next(
176
- (e for e in article.findall('ELocationID') if e.get('EIdType') == 'doi'),
177
- None
178
- )
179
- doi = doi_el.text if doi_el is not None else ''
180
-
181
- # Authors — handle CollectiveName (consortium/group authors)
182
- author_list = article.find('AuthorList')
183
- authors = []
184
- if author_list is not None:
185
- for a in author_list.findall('Author'):
186
- collective = a.find('CollectiveName')
187
- last = a.find('LastName')
188
- fore = a.find('ForeName')
189
- initials = a.find('Initials')
190
- if collective is not None:
191
- authors.append(collective.text)
192
- elif last is not None:
193
- full = last.text
194
- if fore is not None:
195
- full += f", {fore.text}"
196
- authors.append(full)
197
-
198
- # MeSH controlled vocabulary terms
199
- mesh_list = mc.find('MeshHeadingList')
200
- mesh_terms = []
201
- if mesh_list is not None:
202
- mesh_terms = [
203
- mh.find('DescriptorName').text
204
- for mh in mesh_list.findall('MeshHeading')
205
- if mh.find('DescriptorName') is not None
206
- ]
207
-
208
- print(f"PMID={pmid} ({year}) {j_title}")
209
- print(f" Title: {title[:70]}")
210
- print(f" Authors: {authors[:3]}")
211
- print(f" DOI: {doi}")
212
- print(f" MeSH: {mesh_terms[:4]}")
213
- print(f" Abstract: {abstract[:120]}")
214
- # Confirmed output (2026-04-18):
215
- # PMID=41999029 (2026) Medical science monitor : international medical...
216
- # Title: Use of Deep Learning Models in the Diagnosis of Proptosis Thro
217
- # Authors: ['Kesimal, Uğur', 'Akkaya, Habip Eser', 'Polat, Önder']
218
- # DOI: 10.12659/MSM.951157
219
- # MeSH: ['Humans', 'Deep Learning', 'Exophthalmos', 'Magnetic Resonance Imaging']
220
- # Abstract: BACKGROUND Proptosis is a common manifestation of orbital disease...
221
- # PMID=36328784 (...)
222
- # Abstract: [OBJECTIVES] Physical inactivity and sedentary behaviour... ← structured
223
- ```
224
-
225
- ### Large result sets — usehistory + WebEnv
226
-
227
- When `count` exceeds `retmax` (max 10 000), use server-side history to paginate EFetch without re-running ESearch on every page.
228
-
229
- ```python
230
- import json, xml.etree.ElementTree as ET
231
- from helpers import http_get
232
-
233
- # Step 1: ESearch with usehistory=y — NCBI holds result set on server
234
- search = json.loads(http_get(
235
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
236
- "?db=pubmed&term=CRISPR+gene+editing&retmax=0&retmode=json&usehistory=y"
237
- ))
238
- webenv = search['esearchresult']['webenv'] # server-side session token
239
- query_key = search['esearchresult']['querykey'] # result set ID within session
240
- total = int(search['esearchresult']['count'])
241
- print(f"Total: {total}, WebEnv: {webenv[:30]}..., query_key: {query_key}")
242
- # Confirmed output (2026-04-18):
243
- # Total: 24160, WebEnv: MCID_69e4203757db89391008d6f1..., query_key: 1
244
-
245
- # Step 2: EFetch pages using WebEnv (no re-searching)
246
- batch_size = 200
247
- for start in range(0, min(total, 1000), batch_size): # cap at 1000 for demo
248
- raw = http_get(
249
- f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
250
- f"?db=pubmed&query_key={query_key}&WebEnv={webenv}"
251
- f"&retstart={start}&retmax={batch_size}&retmode=xml&rettype=abstract"
252
- )
253
- root = ET.fromstring(raw)
254
- articles = root.findall('.//PubmedArticle')
255
- print(f" Fetched {len(articles)} articles (start={start})")
256
- # process articles here...
257
- ```
258
-
259
- ### EInfo — list available NCBI databases
260
-
261
- ```python
262
- import json
263
- from helpers import http_get
264
-
265
- data = json.loads(http_get(
266
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?retmode=json"
267
- ))
268
- dbs = data['einforesult']['dblist']
269
- print(f"Total databases: {len(dbs)}") # Confirmed: 39 (2026-04-18)
270
- print(dbs[:10])
271
- # ['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure',
272
- # 'genome', 'annotinfo', 'assembly', 'bioproject']
273
- ```
274
-
275
- Get PubMed-specific metadata (field list, link list):
276
-
277
- ```python
278
- import json
279
- from helpers import http_get
280
-
281
- data = json.loads(http_get(
282
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=pubmed&retmode=json"
283
- ))
284
- db_info = data['einforesult']['dbinfo'][0]
285
- print("DB name:", db_info['dbname'])
286
- print("Record count:", db_info['count']) # total PubMed records
287
- link_names = [l['name'] for l in db_info.get('linklist', [])]
288
- print(f"Link types ({len(link_names)}):", link_names[:5])
289
- # Confirmed (2026-04-18):
290
- # DB name: pubmed
291
- # Record count: 37620453
292
- # Link types (48): ['pubmed_assembly', 'pubmed_bioproject', ...]
293
- ```
294
-
295
- ### ELink — cross-database linking
296
-
297
- ELink connects a PubMed record to associated data in other NCBI databases. The `pubmed_pubmed` "related articles" linkname relies on a similarity server that is intermittently unavailable (returns `"Couldn't resolve #exLinkSrv2, the address table is empty."`). Use the non-similarity links below instead.
298
-
299
- ```python
300
- import json
301
- from helpers import http_get
302
-
303
- # Link a PMID to its free full-text in PMC (if open access)
304
- # linkname=pubmed_pmc — may also hit the server outage; check error field
305
- data = json.loads(http_get(
306
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi"
307
- "?dbfrom=pubmed&id=38325330&linkname=pubmed_pmc&retmode=json"
308
- ))
309
- error = data.get('ERROR', '')
310
- if error:
311
- print("ELink error:", error) # 'Couldn't resolve #exLinkSrv2...' — NCBI server issue
312
- else:
313
- for ls in data.get('linksets', []):
314
- for lsdb in ls.get('linksetdbs', []):
315
- print(lsdb['linkname'], "→", lsdb['links'][:5])
316
- ```
317
-
318
- Available ELink linknames from pubmed (48 total):
319
-
320
- | linkname | Target |
321
- |---|---|
322
- | `pubmed_pmc` | Free full text in PMC |
323
- | `pubmed_pubmed_citedin` | Articles citing this paper |
324
- | `pubmed_pubmed_refs` | References cited by this paper |
325
- | `pubmed_gene` | Related Gene records |
326
- | `pubmed_clinvar` | Clinical variants associated with publication |
327
- | `pubmed_gds` | Related GEO datasets |
328
-
329
- **Practical alternative**: If ELink is down, extract DOI from EFetch/ESummary and use `https://doi.org/{doi}` directly for the full-text link.
330
-
331
- ## URL and parameter reference
332
-
333
- ### E-utilities base URLs
334
-
335
- ```
336
- https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi # search → PMIDs
337
- https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi # PMIDs → JSON summary
338
- https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi # PMIDs → full XML
339
- https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi # cross-db links
340
- https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi # DB metadata
341
- ```
342
-
343
- ### ESearch parameters
344
-
345
- | Parameter | Values | Notes |
346
- |---|---|---|
347
- | `db` | `pubmed` | Always `pubmed` for PubMed |
348
- | `term` | query string | Supports field tags like `[Author]`, `[Title]`, `[MeSH Terms]` |
349
- | `retmax` | integer, max 10000 | Results returned per call |
350
- | `retmode` | `json` | JSON output |
351
- | `sort` | `pub+date`, `Author`, `JournalName` | Default is relevance |
352
- | `datetype` | `pdat` (pub), `edat` (entrez), `mdat` (modified) | |
353
- | `mindate`, `maxdate` | `YYYY/MM/DD` or `YYYY` | Requires `datetype` |
354
- | `usehistory` | `y` | Store results on server; returns `webenv` + `querykey` |
355
-
356
- ### EFetch parameters
357
-
358
- | Parameter | Values | Notes |
359
- |---|---|---|
360
- | `db` | `pubmed` | |
361
- | `id` | `38000000,37999999` | Comma-separated PMIDs; max ~200 per call |
362
- | `query_key` + `WebEnv` | from ESearch `usehistory=y` | Alternative to `id` for large sets |
363
- | `retstart` | integer | Offset for pagination with WebEnv |
364
- | `retmax` | integer, max 10000 | Batch size |
365
- | `retmode` | `xml` | Use XML for EFetch (JSON not available for full records) |
366
- | `rettype` | `abstract` | Returns abstract + core metadata |
367
-
368
- ### PubMed article URL construction
369
-
370
- ```python
371
- pmid = "41999029"
372
- pubmed_url = f"https://pubmed.ncbi.nlm.nih.gov/{pmid}/"
373
- doi = "10.12659/MSM.951157"
374
- doi_url = f"https://doi.org/{doi}" # resolves to publisher page
375
- pmc_id = "PMC9876543" # from ESummary articleids
376
- pmc_url = f"https://www.ncbi.nlm.nih.gov/pmc/articles/{pmc_id}/"
377
- ```
378
-
379
- ## Gotchas
380
-
381
- - **`count` is a string, not int.** `search['esearchresult']['count']` returns `'24160'`, not `24160`. Always cast with `int()` before arithmetic.
382
-
383
- - **EFetch retmode must be `xml` for full records.** Unlike ESearch and ESummary, EFetch with `retmode=json` returns flat text (the MEDLINE citation text format), not structured JSON. Parse EFetch responses with `xml.etree.ElementTree`.
384
-
385
- - **`ArticleTitle` may contain embedded XML tags.** Titles with italics (`<i>Staphylococcus aureus</i>`) or math (`<sub>2</sub>`) are mixed-content nodes. Always use `''.join(el.itertext())` instead of `el.text`, which silently drops everything after the first child tag.
386
-
387
- - **~15% of articles have no abstract.** `article.find('Abstract')` returns `None` for short communications, editorials, letters, and older records. Always guard with `if abstract_el is not None`.
388
-
389
- - **Author names vary in structure — always handle `CollectiveName`.** Consortium papers list a group name (`'GeKeR Study Group'`, `'Breast Cancer Association Consortium'`) under `<CollectiveName>` instead of `<LastName>/<ForeName>`. Individual authors have `<LastName>` + optionally `<ForeName>` and `<Initials>`. Check `CollectiveName` first; falling through to `LastName` without the check produces `None` errors.
390
- - Confirmed real examples (2026-04-18): PMID 37586835 (`GeKeR Study Group`), PMID 36328784 (`Breast Cancer Association Consortium`)
391
-
392
- - **PubDate has three possible structures.** Most articles have `<Year>` + optional `<Month>` + optional `<Day>`. Seasonal journals use `<Season>` (e.g. `Jul-Aug`, `Oct-Dec`) instead of `<Month>`. A minority of older records use `<MedlineDate>` (e.g. `1995 Fall`) with no `<Year>`. Safe extraction pattern:
393
- ```python
394
- pub_date = journal.find('.//PubDate')
395
- year_el = pub_date.find('Year') if pub_date is not None else None
396
- medline_el = pub_date.find('MedlineDate') if pub_date is not None else None
397
- year = (year_el.text if year_el is not None
398
- else medline_el.text[:4] if medline_el is not None else '')
399
- ```
400
-
401
- - **Batch EFetch: keep IDs to ~200 per call.** The API accepts comma-separated IDs in `id=`, but very large batches (500+) occasionally time out or return truncated XML. For >200 articles, iterate in chunks or use `usehistory` + `WebEnv`.
402
-
403
- - **ELink `pubmed_pubmed` (related articles) is intermittently broken.** The NCBI similarity server returns `"Couldn't resolve #exLinkSrv2, the address table is empty."` — this is a persistent server-side issue as of 2026-04-18, not a rate-limit error. Other linknames (`pubmed_gene`, `pubmed_pmc`, `pubmed_clinvar`) fail with the same error. Use the DOI as a fallback link to publisher full text.
404
-
405
- - **Rate limits: 3 req/s without API key, 10 req/s with free key.** Exceeding 3 req/s returns HTTP 429. Insert `time.sleep(0.34)` between sequential calls without a key. Get a free API key at https://www.ncbi.nlm.nih.gov/account/ and append `&api_key=YOUR_KEY` to all URLs.
406
-
407
- - **`retmax` upper bound is 10 000 for ESearch.** To retrieve more than 10 000 PMIDs for a search, use `usehistory=y` and page through EFetch with `retstart` offsets. EFetch itself also accepts `retmax` up to 10 000 per call.
408
-
409
- - **`retmax=0` in ESearch returns only the count, not IDs — useful for counting.** Combine with `usehistory=y` to store the result for later paging without fetching IDs upfront:
410
- ```python
411
- search = json.loads(http_get(
412
- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
413
- "?db=pubmed&term=cancer&retmax=0&retmode=json&usehistory=y"
414
- ))
415
- total = int(search['esearchresult']['count']) # e.g. 4800000
416
- webenv = search['esearchresult']['webenv']
417
- ```
418
-
419
- - **ESummary `authors` field uses abbreviated names (`Last I`), not full names.** Use EFetch XML to get `ForeName` (e.g. `'Kesimal, Uğur'` vs ESummary `'Kesimal U'`). For bulk tasks where full names are not needed, ESummary is faster.
420
-
421
- - **`querytranslation` shows how NCBI interpreted your term.** The ESearch response includes `esearchresult.querytranslation` — a MeSH-expanded version of your query. Inspect it to verify the search matched what you intended.
1
+ # PubMed / NCBI — Scraping & Data Extraction
2
+
3
+ `https://pubmed.ncbi.nlm.nih.gov` — 37 M+ biomedical citations. **Never use the browser for PubMed.** All data is reachable via `http_get` using the NCBI E-utilities REST API. No API key required; a free key raises the rate limit from 3 to 10 req/s.
4
+
5
+ ## Do this first
6
+
7
+ **ESearch → ESummary is the fastest pipeline for most tasks — two calls, JSON responses, no XML parsing.**
8
+
9
+ ```python
10
+ import json
11
+ from helpers import http_get
12
+
13
+ # Step 1: search → get PMIDs
14
+ search = json.loads(http_get(
15
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
16
+ "?db=pubmed&term=deep+learning+radiology&retmax=10&retmode=json"
17
+ ))
18
+ pmids = search['esearchresult']['idlist'] # e.g. ['41999029', '41998456', ...]
19
+ count = search['esearchresult']['count'] # total hits across all pages
20
+
21
+ # Step 2: fetch lightweight metadata for all PMIDs in one call
22
+ summary = json.loads(http_get(
23
+ f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
24
+ f"?db=pubmed&id={','.join(pmids)}&retmode=json"
25
+ ))
26
+ result = summary['result']
27
+ for uid in result['uids']:
28
+ art = result[uid]
29
+ print(uid, art['pubdate'], art['source'])
30
+ print(" ", art['title'][:80])
31
+ print(" authors:", [a['name'] for a in art['authors'][:3]])
32
+ # Confirmed output (2026-04-18):
33
+ # 41999029 2026 Apr 18 Med Sci Monit
34
+ # Use of Deep Learning Models in the Diagnosis of Proptosis Through Orbi
35
+ # authors: ['Kesimal U', 'Akkaya HE', 'Polat Ö']
36
+ # 41998456 2026 Apr 17 Sci Rep
37
+ # ...
38
+ ```
39
+
40
+ Use **EFetch XML** when you need: full abstract text, MeSH terms, complete author names (not just "Last I"), structured abstract labels, or the DOI from within the article record.
41
+
42
+ ## Common workflows
43
+
44
+ ### Search PubMed (ESearch)
45
+
46
+ ```python
47
+ import json
48
+ from helpers import http_get
49
+
50
+ data = json.loads(http_get(
51
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
52
+ "?db=pubmed"
53
+ "&term=large+language+models+clinical"
54
+ "&retmax=5"
55
+ "&retmode=json"
56
+ "&sort=pub+date" # newest first; default is relevance
57
+ "&datetype=pdat" # filter by publication date
58
+ "&mindate=2024/01/01&maxdate=2024/12/31" # YYYY/MM/DD format
59
+ ))
60
+ result = data['esearchresult']
61
+ print("Total hits:", result['count']) # '24160' — note: string, not int
62
+ print("PMIDs:", result['idlist'])
63
+ print("Query translation:", result['querytranslation'])
64
+ # Confirmed output (2026-04-18):
65
+ # Total hits: 24160
66
+ # PMIDs: ['41996895', '41996722', '41996006', '41995888', '41995759']
67
+ # Query translation: "large language models"[MeSH Terms] OR ...
68
+ ```
69
+
70
+ #### ESearch field tags (append to term)
71
+
72
+ ```
73
+ machine learning[MeSH Terms] MeSH controlled vocabulary
74
+ Hinton GE[Author] author last + initials
75
+ attention is all you need[Title] title words
76
+ Nature[Journal] journal name
77
+ 2024[pdat] publication year
78
+ ```
79
+
80
+ Boolean operators: `AND`, `OR`, `NOT`. Phrase search: `"exact phrase"[Title]`.
81
+
82
+ #### Sort options (`sort=`)
83
+
84
+ | Value | Effect |
85
+ |---|---|
86
+ | *(omit)* | Relevance (default) |
87
+ | `pub+date` | Most recent publication first |
88
+ | `Author` | First author alphabetical |
89
+ | `JournalName` | Journal alphabetical |
90
+
91
+ ### Lightweight metadata — ESummary (JSON, no XML)
92
+
93
+ ```python
94
+ import json
95
+ from helpers import http_get
96
+
97
+ data = json.loads(http_get(
98
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
99
+ "?db=pubmed&id=41999029,41998456,41997837&retmode=json"
100
+ ))
101
+ result = data['result']
102
+ for uid in result['uids']:
103
+ art = result[uid]
104
+ # Key fields available:
105
+ title = art['title'] # full title string
106
+ source = art['source'] # abbreviated journal name
107
+ fulljournalname = art['fulljournalname']
108
+ pubdate = art['pubdate'] # e.g. '2026 Apr 18'
109
+ epubdate = art['epubdate'] # e-pub ahead of print date (may be empty)
110
+ authors = art['authors'] # list of {'name': 'Last I', 'authtype': ...}
111
+ volume = art['volume']
112
+ issue = art['issue']
113
+ pages = art['pages']
114
+ pubtype = art['pubtype'] # list: ['Journal Article', 'Review', ...]
115
+ # Extract DOI from elocationid or articleids:
116
+ doi_field = art['elocationid'] # e.g. 'doi: 10.12659/MSM.951157'
117
+ article_ids = {x['idtype']: x['value'] for x in art['articleids']}
118
+ doi = article_ids.get('doi')
119
+ pmc_id = article_ids.get('pmc') # PMC ID if open access
120
+ print(uid, pubdate, source)
121
+ print(" ", title[:70])
122
+ print(" doi:", doi, "| pmc:", pmc_id)
123
+ # Confirmed output (2026-04-18):
124
+ # 41999029 2026 Apr 18 Med Sci Monit
125
+ # Use of Deep Learning Models in the Diagnosis of Proptosis Through Orbi
126
+ # doi: 10.12659/MSM.951157 | pmc: None
127
+ ```
128
+
129
+ ### Full article metadata — EFetch XML
130
+
131
+ Use this for full abstracts, complete author names, MeSH terms, structured abstract sections.
132
+
133
+ ```python
134
+ import json, xml.etree.ElementTree as ET
135
+ from helpers import http_get
136
+
137
+ raw = http_get(
138
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
139
+ "?db=pubmed&id=41999029,36328784&retmode=xml&rettype=abstract"
140
+ )
141
+ root = ET.fromstring(raw)
142
+
143
+ for art in root.findall('.//PubmedArticle'):
144
+ mc = art.find('MedlineCitation')
145
+ pmid = mc.find('PMID').text
146
+ article = mc.find('Article')
147
+
148
+ # Title — use itertext() to handle embedded tags like <i>, <sub>
149
+ title = ''.join(article.find('ArticleTitle').itertext()).strip()
150
+
151
+ # Abstract — plain or structured (BACKGROUND / METHODS / RESULTS / CONCLUSION)
152
+ abstract_el = article.find('Abstract')
153
+ if abstract_el is not None:
154
+ sections = []
155
+ for t in abstract_el.findall('AbstractText'):
156
+ label = t.get('Label', '') # e.g. 'BACKGROUND', 'METHODS'
157
+ text = ''.join(t.itertext()).strip()
158
+ sections.append(f"[{label}] {text}" if label else text)
159
+ abstract = ' '.join(sections)
160
+ else:
161
+ abstract = '' # ~15% of articles have no abstract
162
+
163
+ # Journal + year
164
+ journal = article.find('Journal')
165
+ j_title = journal.find('Title').text if journal is not None else ''
166
+ pub_date = journal.find('.//PubDate') if journal is not None else None
167
+ if pub_date is not None:
168
+ year_el = pub_date.find('Year')
169
+ medline_el = pub_date.find('MedlineDate') # fallback for old/seasonal dates
170
+ season_el = pub_date.find('Season') # e.g. 'Jul-Aug', 'Oct-Dec'
171
+ year = (year_el.text if year_el is not None
172
+ else medline_el.text[:4] if medline_el is not None else '')
173
+
174
+ # DOI
175
+ doi_el = next(
176
+ (e for e in article.findall('ELocationID') if e.get('EIdType') == 'doi'),
177
+ None
178
+ )
179
+ doi = doi_el.text if doi_el is not None else ''
180
+
181
+ # Authors — handle CollectiveName (consortium/group authors)
182
+ author_list = article.find('AuthorList')
183
+ authors = []
184
+ if author_list is not None:
185
+ for a in author_list.findall('Author'):
186
+ collective = a.find('CollectiveName')
187
+ last = a.find('LastName')
188
+ fore = a.find('ForeName')
189
+ initials = a.find('Initials')
190
+ if collective is not None:
191
+ authors.append(collective.text)
192
+ elif last is not None:
193
+ full = last.text
194
+ if fore is not None:
195
+ full += f", {fore.text}"
196
+ authors.append(full)
197
+
198
+ # MeSH controlled vocabulary terms
199
+ mesh_list = mc.find('MeshHeadingList')
200
+ mesh_terms = []
201
+ if mesh_list is not None:
202
+ mesh_terms = [
203
+ mh.find('DescriptorName').text
204
+ for mh in mesh_list.findall('MeshHeading')
205
+ if mh.find('DescriptorName') is not None
206
+ ]
207
+
208
+ print(f"PMID={pmid} ({year}) {j_title}")
209
+ print(f" Title: {title[:70]}")
210
+ print(f" Authors: {authors[:3]}")
211
+ print(f" DOI: {doi}")
212
+ print(f" MeSH: {mesh_terms[:4]}")
213
+ print(f" Abstract: {abstract[:120]}")
214
+ # Confirmed output (2026-04-18):
215
+ # PMID=41999029 (2026) Medical science monitor : international medical...
216
+ # Title: Use of Deep Learning Models in the Diagnosis of Proptosis Thro
217
+ # Authors: ['Kesimal, Uğur', 'Akkaya, Habip Eser', 'Polat, Önder']
218
+ # DOI: 10.12659/MSM.951157
219
+ # MeSH: ['Humans', 'Deep Learning', 'Exophthalmos', 'Magnetic Resonance Imaging']
220
+ # Abstract: BACKGROUND Proptosis is a common manifestation of orbital disease...
221
+ # PMID=36328784 (...)
222
+ # Abstract: [OBJECTIVES] Physical inactivity and sedentary behaviour... ← structured
223
+ ```
224
+
225
+ ### Large result sets — usehistory + WebEnv
226
+
227
+ When `count` exceeds `retmax` (max 10 000), use server-side history to paginate EFetch without re-running ESearch on every page.
228
+
229
+ ```python
230
+ import json, xml.etree.ElementTree as ET
231
+ from helpers import http_get
232
+
233
+ # Step 1: ESearch with usehistory=y — NCBI holds result set on server
234
+ search = json.loads(http_get(
235
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
236
+ "?db=pubmed&term=CRISPR+gene+editing&retmax=0&retmode=json&usehistory=y"
237
+ ))
238
+ webenv = search['esearchresult']['webenv'] # server-side session token
239
+ query_key = search['esearchresult']['querykey'] # result set ID within session
240
+ total = int(search['esearchresult']['count'])
241
+ print(f"Total: {total}, WebEnv: {webenv[:30]}..., query_key: {query_key}")
242
+ # Confirmed output (2026-04-18):
243
+ # Total: 24160, WebEnv: MCID_69e4203757db89391008d6f1..., query_key: 1
244
+
245
+ # Step 2: EFetch pages using WebEnv (no re-searching)
246
+ batch_size = 200
247
+ for start in range(0, min(total, 1000), batch_size): # cap at 1000 for demo
248
+ raw = http_get(
249
+ f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
250
+ f"?db=pubmed&query_key={query_key}&WebEnv={webenv}"
251
+ f"&retstart={start}&retmax={batch_size}&retmode=xml&rettype=abstract"
252
+ )
253
+ root = ET.fromstring(raw)
254
+ articles = root.findall('.//PubmedArticle')
255
+ print(f" Fetched {len(articles)} articles (start={start})")
256
+ # process articles here...
257
+ ```
258
+
259
+ ### EInfo — list available NCBI databases
260
+
261
+ ```python
262
+ import json
263
+ from helpers import http_get
264
+
265
+ data = json.loads(http_get(
266
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?retmode=json"
267
+ ))
268
+ dbs = data['einforesult']['dblist']
269
+ print(f"Total databases: {len(dbs)}") # Confirmed: 39 (2026-04-18)
270
+ print(dbs[:10])
271
+ # ['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure',
272
+ # 'genome', 'annotinfo', 'assembly', 'bioproject']
273
+ ```
274
+
275
+ Get PubMed-specific metadata (field list, link list):
276
+
277
+ ```python
278
+ import json
279
+ from helpers import http_get
280
+
281
+ data = json.loads(http_get(
282
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=pubmed&retmode=json"
283
+ ))
284
+ db_info = data['einforesult']['dbinfo'][0]
285
+ print("DB name:", db_info['dbname'])
286
+ print("Record count:", db_info['count']) # total PubMed records
287
+ link_names = [l['name'] for l in db_info.get('linklist', [])]
288
+ print(f"Link types ({len(link_names)}):", link_names[:5])
289
+ # Confirmed (2026-04-18):
290
+ # DB name: pubmed
291
+ # Record count: 37620453
292
+ # Link types (48): ['pubmed_assembly', 'pubmed_bioproject', ...]
293
+ ```
294
+
295
+ ### ELink — cross-database linking
296
+
297
+ ELink connects a PubMed record to associated data in other NCBI databases. The `pubmed_pubmed` "related articles" linkname relies on a similarity server that is intermittently unavailable (returns `"Couldn't resolve #exLinkSrv2, the address table is empty."`). Use the non-similarity links below instead.
298
+
299
+ ```python
300
+ import json
301
+ from helpers import http_get
302
+
303
+ # Link a PMID to its free full-text in PMC (if open access)
304
+ # linkname=pubmed_pmc — may also hit the server outage; check error field
305
+ data = json.loads(http_get(
306
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi"
307
+ "?dbfrom=pubmed&id=38325330&linkname=pubmed_pmc&retmode=json"
308
+ ))
309
+ error = data.get('ERROR', '')
310
+ if error:
311
+ print("ELink error:", error) # 'Couldn't resolve #exLinkSrv2...' — NCBI server issue
312
+ else:
313
+ for ls in data.get('linksets', []):
314
+ for lsdb in ls.get('linksetdbs', []):
315
+ print(lsdb['linkname'], "→", lsdb['links'][:5])
316
+ ```
317
+
318
+ Available ELink linknames from pubmed (48 total):
319
+
320
+ | linkname | Target |
321
+ |---|---|
322
+ | `pubmed_pmc` | Free full text in PMC |
323
+ | `pubmed_pubmed_citedin` | Articles citing this paper |
324
+ | `pubmed_pubmed_refs` | References cited by this paper |
325
+ | `pubmed_gene` | Related Gene records |
326
+ | `pubmed_clinvar` | Clinical variants associated with publication |
327
+ | `pubmed_gds` | Related GEO datasets |
328
+
329
+ **Practical alternative**: If ELink is down, extract DOI from EFetch/ESummary and use `https://doi.org/{doi}` directly for the full-text link.
330
+
331
+ ## URL and parameter reference
332
+
333
+ ### E-utilities base URLs
334
+
335
+ ```
336
+ https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi # search → PMIDs
337
+ https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi # PMIDs → JSON summary
338
+ https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi # PMIDs → full XML
339
+ https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi # cross-db links
340
+ https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi # DB metadata
341
+ ```
342
+
343
+ ### ESearch parameters
344
+
345
+ | Parameter | Values | Notes |
346
+ |---|---|---|
347
+ | `db` | `pubmed` | Always `pubmed` for PubMed |
348
+ | `term` | query string | Supports field tags like `[Author]`, `[Title]`, `[MeSH Terms]` |
349
+ | `retmax` | integer, max 10000 | Results returned per call |
350
+ | `retmode` | `json` | JSON output |
351
+ | `sort` | `pub+date`, `Author`, `JournalName` | Default is relevance |
352
+ | `datetype` | `pdat` (pub), `edat` (entrez), `mdat` (modified) | |
353
+ | `mindate`, `maxdate` | `YYYY/MM/DD` or `YYYY` | Requires `datetype` |
354
+ | `usehistory` | `y` | Store results on server; returns `webenv` + `querykey` |
355
+
356
+ ### EFetch parameters
357
+
358
+ | Parameter | Values | Notes |
359
+ |---|---|---|
360
+ | `db` | `pubmed` | |
361
+ | `id` | `38000000,37999999` | Comma-separated PMIDs; max ~200 per call |
362
+ | `query_key` + `WebEnv` | from ESearch `usehistory=y` | Alternative to `id` for large sets |
363
+ | `retstart` | integer | Offset for pagination with WebEnv |
364
+ | `retmax` | integer, max 10000 | Batch size |
365
+ | `retmode` | `xml` | Use XML for EFetch (JSON not available for full records) |
366
+ | `rettype` | `abstract` | Returns abstract + core metadata |
367
+
368
+ ### PubMed article URL construction
369
+
370
+ ```python
371
+ pmid = "41999029"
372
+ pubmed_url = f"https://pubmed.ncbi.nlm.nih.gov/{pmid}/"
373
+ doi = "10.12659/MSM.951157"
374
+ doi_url = f"https://doi.org/{doi}" # resolves to publisher page
375
+ pmc_id = "PMC9876543" # from ESummary articleids
376
+ pmc_url = f"https://www.ncbi.nlm.nih.gov/pmc/articles/{pmc_id}/"
377
+ ```
378
+
379
+ ## Gotchas
380
+
381
+ - **`count` is a string, not int.** `search['esearchresult']['count']` returns `'24160'`, not `24160`. Always cast with `int()` before arithmetic.
382
+
383
+ - **EFetch retmode must be `xml` for full records.** Unlike ESearch and ESummary, EFetch with `retmode=json` returns flat text (the MEDLINE citation text format), not structured JSON. Parse EFetch responses with `xml.etree.ElementTree`.
384
+
385
+ - **`ArticleTitle` may contain embedded XML tags.** Titles with italics (`<i>Staphylococcus aureus</i>`) or math (`<sub>2</sub>`) are mixed-content nodes. Always use `''.join(el.itertext())` instead of `el.text`, which silently drops everything after the first child tag.
386
+
387
+ - **~15% of articles have no abstract.** `article.find('Abstract')` returns `None` for short communications, editorials, letters, and older records. Always guard with `if abstract_el is not None`.
388
+
389
+ - **Author names vary in structure — always handle `CollectiveName`.** Consortium papers list a group name (`'GeKeR Study Group'`, `'Breast Cancer Association Consortium'`) under `<CollectiveName>` instead of `<LastName>/<ForeName>`. Individual authors have `<LastName>` + optionally `<ForeName>` and `<Initials>`. Check `CollectiveName` first; falling through to `LastName` without the check produces `None` errors.
390
+ - Confirmed real examples (2026-04-18): PMID 37586835 (`GeKeR Study Group`), PMID 36328784 (`Breast Cancer Association Consortium`)
391
+
392
+ - **PubDate has three possible structures.** Most articles have `<Year>` + optional `<Month>` + optional `<Day>`. Seasonal journals use `<Season>` (e.g. `Jul-Aug`, `Oct-Dec`) instead of `<Month>`. A minority of older records use `<MedlineDate>` (e.g. `1995 Fall`) with no `<Year>`. Safe extraction pattern:
393
+ ```python
394
+ pub_date = journal.find('.//PubDate')
395
+ year_el = pub_date.find('Year') if pub_date is not None else None
396
+ medline_el = pub_date.find('MedlineDate') if pub_date is not None else None
397
+ year = (year_el.text if year_el is not None
398
+ else medline_el.text[:4] if medline_el is not None else '')
399
+ ```
400
+
401
+ - **Batch EFetch: keep IDs to ~200 per call.** The API accepts comma-separated IDs in `id=`, but very large batches (500+) occasionally time out or return truncated XML. For >200 articles, iterate in chunks or use `usehistory` + `WebEnv`.
402
+
403
+ - **ELink `pubmed_pubmed` (related articles) is intermittently broken.** The NCBI similarity server returns `"Couldn't resolve #exLinkSrv2, the address table is empty."` — this is a persistent server-side issue as of 2026-04-18, not a rate-limit error. Other linknames (`pubmed_gene`, `pubmed_pmc`, `pubmed_clinvar`) fail with the same error. Use the DOI as a fallback link to publisher full text.
404
+
405
+ - **Rate limits: 3 req/s without API key, 10 req/s with free key.** Exceeding 3 req/s returns HTTP 429. Insert `time.sleep(0.34)` between sequential calls without a key. Get a free API key at https://www.ncbi.nlm.nih.gov/account/ and append `&api_key=YOUR_KEY` to all URLs.
406
+
407
+ - **`retmax` upper bound is 10 000 for ESearch.** To retrieve more than 10 000 PMIDs for a search, use `usehistory=y` and page through EFetch with `retstart` offsets. EFetch itself also accepts `retmax` up to 10 000 per call.
408
+
409
+ - **`retmax=0` in ESearch returns only the count, not IDs — useful for counting.** Combine with `usehistory=y` to store the result for later paging without fetching IDs upfront:
410
+ ```python
411
+ search = json.loads(http_get(
412
+ "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
413
+ "?db=pubmed&term=cancer&retmax=0&retmode=json&usehistory=y"
414
+ ))
415
+ total = int(search['esearchresult']['count']) # e.g. 4800000
416
+ webenv = search['esearchresult']['webenv']
417
+ ```
418
+
419
+ - **ESummary `authors` field uses abbreviated names (`Last I`), not full names.** Use EFetch XML to get `ForeName` (e.g. `'Kesimal, Uğur'` vs ESummary `'Kesimal U'`). For bulk tasks where full names are not needed, ESummary is faster.
420
+
421
+ - **`querytranslation` shows how NCBI interpreted your term.** The ESearch response includes `esearchresult.querytranslation` — a MeSH-expanded version of your query. Inspect it to verify the search matched what you intended.