@pencil-agent/nano-pencil 2.0.0 → 2.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (195) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/mcp/mcp-client.d.ts +3 -1
  7. package/dist/core/mcp/mcp-client.js +6 -6
  8. package/dist/core/mcp/mcp-config.d.ts +3 -3
  9. package/dist/core/mcp/mcp-config.js +1 -1
  10. package/dist/core/mcp/mcp-manager.d.ts +5 -1
  11. package/dist/core/mcp/mcp-manager.js +1 -1
  12. package/dist/core/platform/config/resource-loader.d.ts +2 -0
  13. package/dist/core/platform/config/resource-loader.js +2 -2
  14. package/dist/core/runtime/agent-session.d.ts +12 -0
  15. package/dist/core/runtime/agent-session.js +8 -8
  16. package/dist/core/runtime/sdk.d.ts +8 -0
  17. package/dist/core/runtime/sdk.js +1 -1
  18. package/dist/extensions/builtin/AGENT.md +115 -115
  19. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  20. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  91. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  92. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  93. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  94. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  95. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  96. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  97. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  98. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  99. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  100. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  101. package/dist/extensions/builtin/browser/browser.md +73 -73
  102. package/dist/extensions/builtin/browser/install.md +142 -142
  103. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  104. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  105. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  107. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  108. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  109. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  110. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  111. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  112. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  113. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  114. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  115. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  116. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  117. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  118. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  119. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  120. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  121. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  122. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  123. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  124. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  125. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  126. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  127. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  128. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  129. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  130. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  131. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  132. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  133. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  134. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  135. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  136. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  137. package/dist/extensions/builtin/goal/README.md +67 -67
  138. package/dist/extensions/builtin/grub/README.md +112 -112
  139. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  140. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  141. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  142. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  143. package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
  144. package/dist/extensions/builtin/loop/README.md +92 -92
  145. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  146. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  147. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  148. package/dist/extensions/builtin/sal/README.md +72 -72
  149. package/dist/extensions/builtin/security-audit/README.md +289 -289
  150. package/dist/extensions/builtin/team/AGENT.md +112 -112
  151. package/dist/extensions/builtin/team/TESTING.md +299 -299
  152. package/dist/extensions/builtin/token-save/README.md +56 -56
  153. package/dist/extensions/optional/AGENT.md +10 -10
  154. package/dist/modes/interactive/interactive-mode.js +36 -36
  155. package/dist/modes/interactive/theme/dark.json +85 -85
  156. package/dist/modes/interactive/theme/light.json +84 -84
  157. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  158. package/dist/modes/interactive/theme/warm.json +81 -81
  159. package/dist/node_modules/@pencil-agent/agent-core/dist/agent-loop.js +3 -2
  160. package/dist/node_modules/@pencil-agent/agent-core/dist/structured-adaptive-agent-loop.js +2 -1
  161. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  162. package/docs/cc-agent-design.md +1297 -0
  163. package/docs/cc-tui-design.md +1333 -0
  164. package/docs/codex-goal-command-impl.md +1055 -1055
  165. package/docs/codex-goal-vs-grub.md +500 -500
  166. package/docs/custom-provider.md +27 -27
  167. package/docs/extensions.md +27 -27
  168. package/docs/keybindings.md +27 -27
  169. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  170. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  171. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  172. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  173. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  174. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  175. package/docs/loop-usage-examples.md +214 -214
  176. package/docs/models.md +27 -27
  177. package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +170 -0
  178. package/docs/packages.md +27 -27
  179. package/docs/pi-design-philosophy.md +457 -457
  180. package/docs/planmode.md +1987 -1987
  181. package/docs/prompt-templates.md +27 -27
  182. package/docs/providers.md +27 -27
  183. package/docs/scan-report.md +3820 -0
  184. package/docs/sdk.md +27 -27
  185. package/docs/skills.md +27 -27
  186. package/docs/themes.md +27 -27
  187. package/docs/tui.md +27 -27
  188. package/docs//345/257/271/346/240/207Claude-Code.md +1775 -0
  189. package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +261 -0
  190. package/package.json +190 -190
  191. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +0 -851
  192. package/docs/SDK-TESTING.md +0 -364
  193. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +0 -593
  194. package/docs/startup-performance-optimization.md +0 -301
  195. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +0 -47
@@ -1,568 +1,568 @@
1
- # CrossRef — Scraping & Data Extraction
2
-
3
- `https://api.crossref.org` — scholarly DOI and citation metadata. **Never use the browser for CrossRef.** Completely free, no auth required. All workflows use `http_get`.
4
-
5
- ## Do this first
6
-
7
- **Always add `mailto=your@email.com` to every request** — it moves you into the polite pool, which doubles the rate limit and concurrency allowance. The difference is measurable and the cost is zero.
8
-
9
- ```python
10
- from helpers import http_get
11
- import json
12
-
13
- MAILTO = "mailto=your@email.com" # set once, append to every URL
14
-
15
- # Single DOI lookup — fastest way to get metadata for a known paper
16
- data = json.loads(http_get(f"https://api.crossref.org/works/10.1038/s41586-021-03819-2?{MAILTO}"))
17
- msg = data['message']
18
- # msg keys: DOI, title, author, published, type, container-title, volume, issue,
19
- # page, is-referenced-by-count, references-count, abstract (optional), ...
20
- ```
21
-
22
- ## Common workflows
23
-
24
- ### DOI lookup — single paper
25
-
26
- ```python
27
- from helpers import http_get
28
- import json, re
29
-
30
- MAILTO = "mailto=your@email.com"
31
-
32
- def fetch_work(doi):
33
- data = json.loads(http_get(f"https://api.crossref.org/works/{doi}?{MAILTO}"))
34
- return data['message']
35
-
36
- def parse_date(d):
37
- """[[2021, 7, 15]] -> '2021-7-15'. Handles partial dates like [[2021]]."""
38
- if not d: return None
39
- parts = d.get('date-parts', [[]])[0]
40
- return '-'.join(str(p) for p in parts if p is not None)
41
-
42
- def clean_abstract(raw):
43
- """Strip JATS XML tags. Abstract field contains tags like <jats:p>, <jats:italic>."""
44
- return re.sub(r'<[^>]+>', ' ', raw).strip() if raw else None
45
-
46
- w = fetch_work("10.1038/s41586-021-03819-2") # AlphaFold2
47
-
48
- print("DOI:", w['DOI']) # 10.1038/s41586-021-03819-2
49
- print("Title:", w['title'][0]) # Highly accurate protein structure...
50
- print("Type:", w['type']) # journal-article
51
- print("Publisher:", w['publisher']) # Springer Science and Business Media LLC
52
- print("Journal:", w.get('container-title', [''])[0]) # Nature
53
- print("Volume:", w.get('volume')) # 596
54
- print("Issue:", w.get('issue')) # 7873
55
- print("Page:", w.get('page')) # 583-589
56
- print("published:", parse_date(w.get('published'))) # 2021-7-15 (online date)
57
- print("published-online:", parse_date(w.get('published-online'))) # 2021-7-15
58
- print("published-print:", parse_date(w.get('published-print'))) # 2021-8-26
59
- print("Citations:", w.get('is-referenced-by-count')) # 40260
60
- print("References:", w.get('references-count')) # 84
61
- print("Abstract:", clean_abstract(w.get('abstract', ''))[:100] if w.get('abstract') else None)
62
- # Confirmed output (2026-04-18):
63
- # DOI: 10.1038/s41586-021-03819-2
64
- # Title: Highly accurate protein structure prediction with AlphaFold
65
- # Type: journal-article
66
- # Journal: Nature
67
- # Volume: 596 | Issue: 7873 | Page: 583-589
68
- # published: 2021-7-15 | published-print: 2021-8-26
69
- # Citations: 40260
70
- ```
71
-
72
- ### DOI lookup — extract authors with ORCID
73
-
74
- ```python
75
- from helpers import http_get
76
- import json
77
-
78
- MAILTO = "mailto=your@email.com"
79
- data = json.loads(http_get(f"https://api.crossref.org/works/10.1038/s41586-021-03819-2?{MAILTO}"))
80
- authors = data['message'].get('author', [])
81
-
82
- for a in authors[:3]:
83
- name = f"{a.get('given', '')} {a.get('family', '')}".strip()
84
- # ORCID is a full URL, not a bare ID — strip the prefix
85
- orcid_url = a.get('ORCID') # e.g. 'https://orcid.org/0000-0001-6169-6580'
86
- orcid_id = orcid_url.replace('https://orcid.org/', '') if orcid_url else None
87
- authenticated = a.get('authenticated-orcid', False) # False = self-reported, True = verified
88
- affiliations = [aff.get('name', '') for aff in a.get('affiliation', [])]
89
- print(f"{name} | ORCID: {orcid_id} | auth={authenticated} | seq={a['sequence']}")
90
- # Confirmed output:
91
- # John Jumper | ORCID: 0000-0001-6169-6580 | auth=False | seq=first
92
- # Richard Evans | ORCID: None | auth=False | seq=additional
93
- # Alexander Pritzel | ORCID: None | auth=False | seq=additional
94
- ```
95
-
96
- ### Batch DOI lookup (parallel — 5 calls in ~0.3s)
97
-
98
- ```python
99
- from helpers import http_get
100
- from concurrent.futures import ThreadPoolExecutor
101
- import json
102
-
103
- MAILTO = "mailto=your@email.com"
104
-
105
- def fetch_work(doi):
106
- try:
107
- data = json.loads(http_get(f"https://api.crossref.org/works/{doi}?{MAILTO}"))
108
- msg = data['message']
109
- return {
110
- 'doi': doi,
111
- 'title': msg.get('title', [''])[0],
112
- 'year': (msg.get('published', {}).get('date-parts') or [[None]])[0][0],
113
- 'citations': msg.get('is-referenced-by-count'),
114
- 'type': msg.get('type'),
115
- }
116
- except Exception as e:
117
- return {'doi': doi, 'error': str(e)}
118
-
119
- dois = [
120
- "10.1038/nature12345",
121
- "10.1038/s41586-021-03819-2",
122
- "10.1056/NEJMoa2034577",
123
- "10.1126/science.1260419",
124
- "10.1038/s41586-024-07487-w",
125
- ]
126
-
127
- # max_workers=5 safe; polite pool: 10 req/s, concurrency=3 (see Rate limits)
128
- with ThreadPoolExecutor(max_workers=5) as ex:
129
- results = list(ex.map(fetch_work, dois))
130
-
131
- for r in results:
132
- print(r['year'], f"cites={r['citations']}", r['title'][:50])
133
- # Confirmed output (2026-04-18, ~0.296s total):
134
- # 2013 cites=465 LRG1 promotes angiogenesis by modulating endotheli
135
- # 2021 cites=40260 Highly accurate protein structure prediction with
136
- # 2020 cites=13752 Safety and Efficacy of the BNT162b2 mRNA Covid-19
137
- # 2015 cites=13553 Tissue-based map of the human proteome
138
- # 2024 cites=12037 Accurate structure prediction of biomolecular inte
139
- ```
140
-
141
- ### Search works by keyword
142
-
143
- ```python
144
- from helpers import http_get
145
- import json
146
-
147
- MAILTO = "mailto=your@email.com"
148
-
149
- # Broad keyword search
150
- data = json.loads(http_get(
151
- f"https://api.crossref.org/works?query=machine+learning&rows=5&{MAILTO}"
152
- ))
153
- msg = data['message']
154
- print("Total results:", msg['total-results']) # 2,805,391
155
- for item in msg['items']:
156
- title = item.get('title', ['(no title)'])[0][:60]
157
- doi = item.get('DOI', '')
158
- year = (item.get('published', {}).get('date-parts') or [[None]])[0][0]
159
- type_ = item.get('type', '')
160
- print(f" [{type_}] {year} {title}")
161
- print(f" DOI: {doi}")
162
- ```
163
-
164
- ### Search by author + title (targeted)
165
-
166
- ```python
167
- from helpers import http_get
168
- import json
169
-
170
- MAILTO = "mailto=your@email.com"
171
-
172
- data = json.loads(http_get(
173
- f"https://api.crossref.org/works?query.author=Lecun&query.title=deep+learning&rows=5&{MAILTO}"
174
- ))
175
- msg = data['message']
176
- print("Total results:", msg['total-results']) # 62
177
- for item in msg['items'][:3]:
178
- title = item.get('title', [''])[0][:60]
179
- authors = ', '.join(a.get('family', '') for a in item.get('author', [])[:2])
180
- year = (item.get('published', {}).get('date-parts') or [[None]])[0][0]
181
- print(f" {year} {title}")
182
- print(f" Authors: {authors} DOI: {item.get('DOI')}")
183
- # Confirmed output:
184
- # 2015 Deep learning & convolutional networks
185
- # Authors: LeCun DOI: 10.1109/hotchips.2015.7477328
186
- ```
187
-
188
- ### Filter by date, type, and sort by citations
189
-
190
- ```python
191
- from helpers import http_get
192
- import json
193
-
194
- MAILTO = "mailto=your@email.com"
195
-
196
- data = json.loads(http_get(
197
- f"https://api.crossref.org/works"
198
- f"?filter=from-pub-date:2024-01-01,type:journal-article"
199
- f"&rows=5&sort=is-referenced-by-count&order=desc&{MAILTO}"
200
- ))
201
- msg = data['message']
202
- print("Total 2024+ journal articles:", msg['total-results']) # 14,565,456
203
- for item in msg['items'][:3]:
204
- title = item.get('title', [''])[0][:60]
205
- cites = item.get('is-referenced-by-count', 0)
206
- year = (item.get('published', {}).get('date-parts') or [[None]])[0][0]
207
- print(f" {year} cites={cites} {title}")
208
- # Confirmed output:
209
- # 2024 cites=17371 Global cancer statistics 2022: GLOBOCAN estimates...
210
- # 2024 cites=12037 Accurate structure prediction of biomolecular int...
211
- ```
212
-
213
- ### Filter with `has-abstract:true`
214
-
215
- ```python
216
- from helpers import http_get
217
- import json
218
-
219
- MAILTO = "mailto=your@email.com"
220
-
221
- # Only return works that have an abstract (useful since ~30-70% do not)
222
- data = json.loads(http_get(
223
- f"https://api.crossref.org/works"
224
- f"?filter=from-pub-date:2023-01-01,until-pub-date:2023-12-31"
225
- f",type:journal-article,has-abstract:true"
226
- f"&rows=3&sort=is-referenced-by-count&order=desc&{MAILTO}"
227
- ))
228
- msg = data['message']
229
- print("2023 journal articles with abstract:", msg['total-results']) # 3,041,841
230
- for item in msg['items']:
231
- print(item.get('title', [''])[0][:60], '| cites:', item.get('is-referenced-by-count'))
232
- # Confirmed output:
233
- # Cancer statistics, 2023 | cites: 12919
234
- # Evolutionary-scale prediction of atomic-level protein struct | cites: 4352
235
- ```
236
-
237
- ### Cursor pagination (large result sets)
238
-
239
- Standard offset pagination (`start=`) caps at a few thousand results. Use cursor for full sweeps.
240
-
241
- ```python
242
- from helpers import http_get
243
- from urllib.parse import quote
244
- import json
245
-
246
- MAILTO = "mailto=your@email.com"
247
-
248
- # First page: cursor=*
249
- data = json.loads(http_get(
250
- f"https://api.crossref.org/works?query=covid&rows=100&cursor=*&{MAILTO}"
251
- ))
252
- msg = data['message']
253
- print("Total results:", msg['total-results']) # 897,660
254
- items = msg['items']
255
- next_cursor = msg['next-cursor'] # base64 string like "DnF1ZXJ5VGhlbkZldGNoJA..."
256
-
257
- # Next pages: pass URL-encoded cursor
258
- while next_cursor and items:
259
- data = json.loads(http_get(
260
- f"https://api.crossref.org/works?query=covid&rows=100"
261
- f"&cursor={quote(next_cursor)}&{MAILTO}"
262
- ))
263
- msg = data['message']
264
- items = msg.get('items', [])
265
- next_cursor = msg.get('next-cursor')
266
- # process items...
267
- break # remove for full sweep
268
- ```
269
-
270
- ### Fetch specific fields only (`select=`)
271
-
272
- Reduces response size significantly for bulk operations:
273
-
274
- ```python
275
- from helpers import http_get
276
- import json
277
-
278
- MAILTO = "mailto=your@email.com"
279
-
280
- data = json.loads(http_get(
281
- f"https://api.crossref.org/works?query=cancer&rows=5"
282
- f"&select=DOI,title,author&{MAILTO}"
283
- ))
284
- # Warning: if a field is absent for a record, it simply won't appear in that item
285
- for item in data['message']['items']:
286
- print(list(item.keys())) # only ['DOI', 'title'] or ['DOI', 'title', 'author']
287
- # Note: select= does NOT guarantee the field appears — absent fields are just omitted
288
- ```
289
-
290
- ### Count by type using facets
291
-
292
- ```python
293
- from helpers import http_get
294
- import json
295
-
296
- MAILTO = "mailto=your@email.com"
297
-
298
- data = json.loads(http_get(
299
- f"https://api.crossref.org/works?query=machine+learning&rows=0"
300
- f"&facet=type-name:*&{MAILTO}"
301
- ))
302
- msg = data['message']
303
- type_facet = msg['facets']['type-name']
304
- for k, v in sorted(type_facet['values'].items(), key=lambda x: -x[1]):
305
- print(f" {k}: {v:,}")
306
- # Confirmed output (all CrossRef, 2026-04-18):
307
- # Journal Article: 1,628,997 (for query=machine+learning scope)
308
- # Conference Paper: 501,433
309
- # Chapter: 455,907
310
- # Posted Content: 87,937
311
- # ...
312
- ```
313
-
314
- ### Journal info by ISSN
315
-
316
- ```python
317
- from helpers import http_get
318
- import json
319
-
320
- MAILTO = "mailto=your@email.com"
321
-
322
- # Nature (ISSN 0028-0836)
323
- data = json.loads(http_get(f"https://api.crossref.org/journals/0028-0836?{MAILTO}"))
324
- msg = data['message']
325
- print("Title:", msg['title']) # Nature
326
- print("Publisher:", msg['publisher']) # Springer Science and Business Media LLC
327
- print("ISSN:", msg['ISSN']) # ['0028-0836', '1476-4687']
328
- print("Total DOIs:", msg['counts']['total-dois']) # 445,417
329
- print("Subjects:", msg.get('subjects', [])) # [] (not always populated)
330
-
331
- # Search journals by name
332
- data2 = json.loads(http_get(f"https://api.crossref.org/journals?query=nature&rows=3&{MAILTO}"))
333
- for j in data2['message']['items']:
334
- print(f"{j.get('title')} | ISSN: {j.get('ISSN')} | DOIs: {j.get('counts', {}).get('total-dois')}")
335
- # Confirmed output:
336
- # NatureJobs | ISSN: [] | DOIs: 0
337
- # Naturen | ISSN: ['0028-0887', '1504-3118'] | DOIs: 1055
338
- ```
339
-
340
- ### Funder search
341
-
342
- ```python
343
- from helpers import http_get
344
- import json
345
-
346
- MAILTO = "mailto=your@email.com"
347
-
348
- data = json.loads(http_get(
349
- f"https://api.crossref.org/funders?query=national+science+foundation&rows=3&{MAILTO}"
350
- ))
351
- msg = data['message']
352
- print("Total funders:", msg['total-results']) # 108
353
- for f in msg['items']:
354
- print(f" ID: {f['id']} | {f['name']}")
355
- print(f" Alt names: {f.get('alt-names', [])[:2]}")
356
- print(f" URI: {f.get('uri')}")
357
- # Confirmed output:
358
- # ID: 501100001711 | Schweizerischer Nationalfonds zur Förderung...
359
- # ID: 100000143 | Division of Computing and Communication Foundations
360
- ```
361
-
362
- ### DOI content negotiation (alternative, no CrossRef API needed)
363
-
364
- The `doi.org` resolver can return formatted metadata directly via `Accept` header:
365
-
366
- ```python
367
- import urllib.request, json
368
-
369
- def doi_to_csl(doi):
370
- """Fetch CSL-JSON via DOI content negotiation. Same data as CrossRef API."""
371
- req = urllib.request.Request(
372
- f"https://doi.org/{doi}",
373
- headers={"Accept": "application/vnd.citationstyles.csl+json",
374
- "User-Agent": "Mozilla/5.0"}
375
- )
376
- with urllib.request.urlopen(req, timeout=20) as r:
377
- return json.loads(r.read().decode())
378
-
379
- def doi_to_bibtex(doi):
380
- """Fetch BibTeX via DOI content negotiation."""
381
- req = urllib.request.Request(
382
- f"https://doi.org/{doi}",
383
- headers={"Accept": "application/x-bibtex", "User-Agent": "Mozilla/5.0"}
384
- )
385
- with urllib.request.urlopen(req, timeout=20) as r:
386
- return r.read().decode()
387
-
388
- csl = doi_to_csl("10.1038/nature12345")
389
- print("Title:", csl['title']) # LRG1 promotes angiogenesis...
390
- print("Type:", csl['type']) # journal-article
391
-
392
- bib = doi_to_bibtex("10.1038/nature12345")
393
- print(bib[:200])
394
- # @article{Wang_2013, title={LRG1 promotes angiogenesis...
395
- ```
396
-
397
- ## Field reference
398
-
399
- ### Work object — complete field list
400
-
401
- All fields are potentially absent unless marked required. Fields marked (R) are always present.
402
-
403
- | Field | Type | Notes |
404
- |---|---|---|
405
- | `DOI` (R) | string | e.g. `"10.1038/s41586-021-03819-2"` |
406
- | `URL` (R) | string | `"https://doi.org/10.1038/s41586-021-03819-2"` |
407
- | `title` (R) | list[str] | Always a list; access `title[0]` |
408
- | `type` (R) | string | e.g. `"journal-article"` — see type table below |
409
- | `publisher` | string | |
410
- | `container-title` | list[str] | Journal name; access `[0]` |
411
- | `short-container-title` | list[str] | Abbreviated journal name |
412
- | `ISSN` | list[str] | May contain print and online ISSN |
413
- | `volume` | string | Note: string not int (`"596"`) |
414
- | `issue` | string | |
415
- | `page` | string | e.g. `"583-589"` |
416
- | `author` | list[object] | See author fields below |
417
- | `published` | date-object | Best single date — use this |
418
- | `published-online` | date-object | Online-first date |
419
- | `published-print` | date-object | Print edition date |
420
- | `issued` | date-object | Usually same as `published` |
421
- | `is-referenced-by-count` | int | Inbound citations to this work |
422
- | `references-count` | int | Outbound references from this work |
423
- | `reference` | list[object] | Full reference list (when deposited) |
424
- | `abstract` | string | JATS XML markup; ~30-70% of works; strip tags before use |
425
- | `subject` | list[str] | Subject classification (often empty) |
426
- | `language` | string | e.g. `"en"` |
427
- | `license` | list[object] | Each: `{URL, start, delay-in-days, content-version}` |
428
- | `funder` | list[object] | Each: `{name, DOI, award}` |
429
- | `link` | list[object] | Full-text links |
430
- | `relation` | object | Related DOIs (e.g. preprint → article) |
431
- | `assertion` | list[object] | Publisher-specific metadata |
432
- | `alternative-id` | list[str] | Publisher's internal IDs |
433
- | `member` | string | CrossRef member ID |
434
- | `prefix` | string | DOI prefix |
435
- | `score` | float | Relevance score (search results only) |
436
- | `source` | string | e.g. `"Crossref"` |
437
- | `indexed` | date-object | When CrossRef indexed this record |
438
- | `deposited` | date-object | When publisher last deposited metadata |
439
- | `created` | date-object | When CrossRef record was first created |
440
-
441
- ### Author object fields
442
-
443
- | Field | Notes |
444
- |---|---|
445
- | `given` | Given/first name |
446
- | `family` | Family/last name |
447
- | `sequence` | `"first"` or `"additional"` |
448
- | `affiliation` | list of `{name, place}` — usually `[]` |
449
- | `ORCID` | Full URL `"https://orcid.org/0000-0001-..."` — strip prefix to get bare ID |
450
- | `authenticated-orcid` | `true` = verified via ORCID OAuth; `false` = self-reported |
451
- | `name` | Used instead of given/family for organizations |
452
-
453
- ### Date object structure
454
-
455
- ```python
456
- # All date fields share this structure:
457
- date_obj = {
458
- "date-parts": [[2021, 7, 15]], # [[year, month, day]] — month/day may be absent
459
- "date-time": "2021-07-15T00:00:00Z", # not always present
460
- "timestamp": 1626307200000 # not always present
461
- }
462
-
463
- # Safe extraction (handles [[2021]] or [[2021, 7]] partial dates):
464
- def parse_date(d):
465
- if not d: return None
466
- parts = (d.get('date-parts') or [[]])[0]
467
- return '-'.join(str(p) for p in parts if p is not None)
468
- ```
469
-
470
- ### Type identifiers (filter param values vs facet display names)
471
-
472
- Use these exact strings in `filter=type:...`. The facet `type-name` values are display names only.
473
-
474
- | filter `type:` value | Facet display name | Count (all CrossRef) |
475
- |---|---|---|
476
- | `journal-article` | Journal Article | 121,030,194 |
477
- | `book-chapter` | Chapter | 24,359,059 |
478
- | `proceedings-article` | Conference Paper | 9,744,754 |
479
- | `dataset` | Dataset | 3,424,142 |
480
- | `posted-content` | Posted Content (preprints) | 3,203,320 |
481
- | `dissertation` | Dissertation | 1,044,461 |
482
- | `peer-review` | Peer Review | 1,028,287 |
483
- | `report` | Report | 906,301 |
484
- | `book` | Book | 870,949 |
485
- | `monograph` | Monograph | 788,401 |
486
-
487
- ### Query parameters reference
488
-
489
- | Parameter | Notes |
490
- |---|---|
491
- | `query` | Full-text keyword search across title, abstract, author |
492
- | `query.author` | Author name search only |
493
- | `query.title` | Title search only |
494
- | `query.bibliographic` | Combined title + author + journal search |
495
- | `rows` | Results per page (default 20, max 1000) |
496
- | `offset` | Offset for pagination (max ~10,000 effective) |
497
- | `cursor` | Use `cursor=*` for first page, then URL-encode `next-cursor` value |
498
- | `sort` | `relevance`, `is-referenced-by-count`, `published`, `indexed` |
499
- | `order` | `asc` or `desc` |
500
- | `filter` | Comma-separated `key:value` pairs (see filters below) |
501
- | `select` | Comma-separated field names to return |
502
- | `facet` | `type-name:*` for type counts; `publisher-name:10` for top publishers |
503
- | `mailto` | Your email — enables polite pool (higher limits) |
504
-
505
- ### Filter keys reference
506
-
507
- | Filter key | Example | Notes |
508
- |---|---|---|
509
- | `doi` | `doi:10.1038/nature12345` | Exact DOI match |
510
- | `type` | `type:journal-article` | See type table above for valid values |
511
- | `from-pub-date` | `from-pub-date:2024-01-01` | ISO date or `YYYY` |
512
- | `until-pub-date` | `until-pub-date:2024-12-31` | |
513
- | `from-index-date` | `from-index-date:2024-01-01` | When CrossRef indexed it |
514
- | `has-abstract` | `has-abstract:true` | Only works with deposited abstract |
515
- | `has-orcid` | `has-orcid:true` | At least one author has ORCID |
516
- | `has-full-text` | `has-full-text:true` | Has full-text link |
517
- | `has-references` | `has-references:true` | Has deposited reference list |
518
- | `is-update` | `is-update:true` | Corrections, retractions |
519
- | `issn` | `issn:0028-0836` | Filter by journal ISSN |
520
- | `publisher-name` | `publisher-name:elsevier` | Partial match |
521
- | `funder` | `funder:100000001` | Funder DOI or CrossRef funder ID |
522
-
523
- ## Rate limits
524
-
525
- CrossRef has two pools based on whether `mailto=` is present:
526
-
527
- | Pool | Triggered by | Rate limit | Concurrency |
528
- |---|---|---|---|
529
- | **polite** | `mailto=` param present | 10 req/s | 3 concurrent |
530
- | **public** | no `mailto=` | 5 req/s | 1 concurrent |
531
-
532
- Headers returned: `x-rate-limit-limit`, `x-rate-limit-interval`, `x-concurrency-limit`, `x-api-pool`.
533
-
534
- In practice with polite pool: 10 rapid sequential calls complete in ~2.7s (avg 0.27s/req) with no throttling. 5 parallel calls complete in ~0.3s. Stay at `max_workers=5` to respect the concurrency limit.
535
-
536
- No per-day or per-hour cap. If you exceed limits, responses slow or return HTTP 429. No ban. Add `time.sleep(0.1)` between calls for sustained bulk crawls.
537
-
538
- ## Gotchas
539
-
540
- - **`mailto=` doubles your rate limit and concurrency.** Public pool: 5 req/s, concurrency=1. Polite pool: 10 req/s, concurrency=3. Always add `?mailto=your@email.com` to every request — confirmed by reading `x-api-pool` response header.
541
-
542
- - **`title`, `container-title`, `ISSN` are always lists, not strings.** Access with `title[0]`, `container-title[0]` etc. Do not rely on there being only one entry — `container-title` can have multiple values.
543
-
544
- - **Abstract contains JATS XML markup.** The `abstract` field is not plain text — it contains tags like `<jats:p>`, `<jats:italic>`, `<jats:sup>`. Strip with `re.sub(r'<[^>]+>', ' ', abstract)`. About 30-70% of works have an abstract at all; journal articles 2023 with `has-abstract:true` filter: 3,041,841 / ~5.5M total = ~55%.
545
-
546
- - **ORCID is a full URL, not just the ID.** `a['ORCID']` = `"https://orcid.org/0000-0001-6169-6580"`. Strip with `.replace('https://orcid.org/', '')` to get the bare ID. `authenticated-orcid: false` means self-asserted (not verified via OAuth).
547
-
548
- - **`published` vs `published-print` vs `published-online`.** Online-first is common in journals — a paper may be online months before its print issue. `published` is CrossRef's best single date and equals `published-online` when both exist. For preprints (`posted-content` type), look for `posted` instead of `published-print` — it may only have `posted` and `published`. Partial dates like `[[2023]]` (year only) are valid — always use `parse_date()` to handle missing month/day.
549
-
550
- - **404 raises `HTTPError`, not a JSON error response.** An invalid DOI (e.g. `10.9999/doesnotexist`) raises `urllib.error.HTTPError: HTTP Error 404: Not Found`. Wrap `fetch_work()` in try/except for any untrusted DOI list.
551
-
552
- - **`volume` and `issue` are strings, not integers.** CrossRef stores them as strings — `"596"`, not `596`. Don't compare with `==` to an int.
553
-
554
- - **Filter type values are hyphenated lowercase, not the facet display names.** `filter=type:journal-article` works. `filter=type:journal article`, `filter=type:Journal Article`, and `filter=type:conference-paper` all return HTTP 400. Conference papers are `proceedings-article`.
555
-
556
- - **`select=` does not guarantee field presence.** When you `select=DOI,title,author`, a record that has no author still omits the `author` key — it doesn't return `author: []`. Always use `.get()`.
557
-
558
- - **Cursor pagination required for >10,000 results.** Offset pagination (`offset=`) is limited to around 10,000 results. For bulk sweeps, use `cursor=*` for the first page, then URL-encode the returned `next-cursor` value with `urllib.parse.quote()`. The cursor expires if unused for too long.
559
-
560
- - **`rows` max is 1000 per call.** Requesting more silently returns 1000. For cursor-based sweeps of large result sets (millions of records), `rows=1000` with cursor is the most efficient approach.
561
-
562
- - **HTML entities in titles.** Titles may contain HTML entities like `&amp;` — `"Deep learning &amp; convolutional networks"`. Decode with `html.unescape()` if needed.
563
-
564
- - **`funder` search `works-count` field is `None`.** The funder search result object has a `works-count` key that is always `None` in the search response. To get actual work counts for a funder, fetch the funder directly: `GET /funders/{id}`.
565
-
566
- - **`subject` is often an empty list.** The `subject` field in works is populated inconsistently — many journal articles have `subject: []` even for well-indexed journals like Nature.
567
-
568
- - **Affiliation is usually empty.** `author[i]['affiliation']` is `[]` for the majority of records, even for papers published in 2024. CrossRef has been working on affiliation deposit, but coverage is inconsistent.
1
+ # CrossRef — Scraping & Data Extraction
2
+
3
+ `https://api.crossref.org` — scholarly DOI and citation metadata. **Never use the browser for CrossRef.** Completely free, no auth required. All workflows use `http_get`.
4
+
5
+ ## Do this first
6
+
7
+ **Always add `mailto=your@email.com` to every request** — it moves you into the polite pool, which doubles the rate limit and concurrency allowance. The difference is measurable and the cost is zero.
8
+
9
+ ```python
10
+ from helpers import http_get
11
+ import json
12
+
13
+ MAILTO = "mailto=your@email.com" # set once, append to every URL
14
+
15
+ # Single DOI lookup — fastest way to get metadata for a known paper
16
+ data = json.loads(http_get(f"https://api.crossref.org/works/10.1038/s41586-021-03819-2?{MAILTO}"))
17
+ msg = data['message']
18
+ # msg keys: DOI, title, author, published, type, container-title, volume, issue,
19
+ # page, is-referenced-by-count, references-count, abstract (optional), ...
20
+ ```
21
+
22
+ ## Common workflows
23
+
24
+ ### DOI lookup — single paper
25
+
26
+ ```python
27
+ from helpers import http_get
28
+ import json, re
29
+
30
+ MAILTO = "mailto=your@email.com"
31
+
32
+ def fetch_work(doi):
33
+ data = json.loads(http_get(f"https://api.crossref.org/works/{doi}?{MAILTO}"))
34
+ return data['message']
35
+
36
+ def parse_date(d):
37
+ """[[2021, 7, 15]] -> '2021-7-15'. Handles partial dates like [[2021]]."""
38
+ if not d: return None
39
+ parts = d.get('date-parts', [[]])[0]
40
+ return '-'.join(str(p) for p in parts if p is not None)
41
+
42
+ def clean_abstract(raw):
43
+ """Strip JATS XML tags. Abstract field contains tags like <jats:p>, <jats:italic>."""
44
+ return re.sub(r'<[^>]+>', ' ', raw).strip() if raw else None
45
+
46
+ w = fetch_work("10.1038/s41586-021-03819-2") # AlphaFold2
47
+
48
+ print("DOI:", w['DOI']) # 10.1038/s41586-021-03819-2
49
+ print("Title:", w['title'][0]) # Highly accurate protein structure...
50
+ print("Type:", w['type']) # journal-article
51
+ print("Publisher:", w['publisher']) # Springer Science and Business Media LLC
52
+ print("Journal:", w.get('container-title', [''])[0]) # Nature
53
+ print("Volume:", w.get('volume')) # 596
54
+ print("Issue:", w.get('issue')) # 7873
55
+ print("Page:", w.get('page')) # 583-589
56
+ print("published:", parse_date(w.get('published'))) # 2021-7-15 (online date)
57
+ print("published-online:", parse_date(w.get('published-online'))) # 2021-7-15
58
+ print("published-print:", parse_date(w.get('published-print'))) # 2021-8-26
59
+ print("Citations:", w.get('is-referenced-by-count')) # 40260
60
+ print("References:", w.get('references-count')) # 84
61
+ print("Abstract:", clean_abstract(w.get('abstract', ''))[:100] if w.get('abstract') else None)
62
+ # Confirmed output (2026-04-18):
63
+ # DOI: 10.1038/s41586-021-03819-2
64
+ # Title: Highly accurate protein structure prediction with AlphaFold
65
+ # Type: journal-article
66
+ # Journal: Nature
67
+ # Volume: 596 | Issue: 7873 | Page: 583-589
68
+ # published: 2021-7-15 | published-print: 2021-8-26
69
+ # Citations: 40260
70
+ ```
71
+
72
+ ### DOI lookup — extract authors with ORCID
73
+
74
+ ```python
75
+ from helpers import http_get
76
+ import json
77
+
78
+ MAILTO = "mailto=your@email.com"
79
+ data = json.loads(http_get(f"https://api.crossref.org/works/10.1038/s41586-021-03819-2?{MAILTO}"))
80
+ authors = data['message'].get('author', [])
81
+
82
+ for a in authors[:3]:
83
+ name = f"{a.get('given', '')} {a.get('family', '')}".strip()
84
+ # ORCID is a full URL, not a bare ID — strip the prefix
85
+ orcid_url = a.get('ORCID') # e.g. 'https://orcid.org/0000-0001-6169-6580'
86
+ orcid_id = orcid_url.replace('https://orcid.org/', '') if orcid_url else None
87
+ authenticated = a.get('authenticated-orcid', False) # False = self-reported, True = verified
88
+ affiliations = [aff.get('name', '') for aff in a.get('affiliation', [])]
89
+ print(f"{name} | ORCID: {orcid_id} | auth={authenticated} | seq={a['sequence']}")
90
+ # Confirmed output:
91
+ # John Jumper | ORCID: 0000-0001-6169-6580 | auth=False | seq=first
92
+ # Richard Evans | ORCID: None | auth=False | seq=additional
93
+ # Alexander Pritzel | ORCID: None | auth=False | seq=additional
94
+ ```
95
+
96
+ ### Batch DOI lookup (parallel — 5 calls in ~0.3s)
97
+
98
+ ```python
99
+ from helpers import http_get
100
+ from concurrent.futures import ThreadPoolExecutor
101
+ import json
102
+
103
+ MAILTO = "mailto=your@email.com"
104
+
105
+ def fetch_work(doi):
106
+ try:
107
+ data = json.loads(http_get(f"https://api.crossref.org/works/{doi}?{MAILTO}"))
108
+ msg = data['message']
109
+ return {
110
+ 'doi': doi,
111
+ 'title': msg.get('title', [''])[0],
112
+ 'year': (msg.get('published', {}).get('date-parts') or [[None]])[0][0],
113
+ 'citations': msg.get('is-referenced-by-count'),
114
+ 'type': msg.get('type'),
115
+ }
116
+ except Exception as e:
117
+ return {'doi': doi, 'error': str(e)}
118
+
119
+ dois = [
120
+ "10.1038/nature12345",
121
+ "10.1038/s41586-021-03819-2",
122
+ "10.1056/NEJMoa2034577",
123
+ "10.1126/science.1260419",
124
+ "10.1038/s41586-024-07487-w",
125
+ ]
126
+
127
+ # max_workers=5 safe; polite pool: 10 req/s, concurrency=3 (see Rate limits)
128
+ with ThreadPoolExecutor(max_workers=5) as ex:
129
+ results = list(ex.map(fetch_work, dois))
130
+
131
+ for r in results:
132
+ print(r['year'], f"cites={r['citations']}", r['title'][:50])
133
+ # Confirmed output (2026-04-18, ~0.296s total):
134
+ # 2013 cites=465 LRG1 promotes angiogenesis by modulating endotheli
135
+ # 2021 cites=40260 Highly accurate protein structure prediction with
136
+ # 2020 cites=13752 Safety and Efficacy of the BNT162b2 mRNA Covid-19
137
+ # 2015 cites=13553 Tissue-based map of the human proteome
138
+ # 2024 cites=12037 Accurate structure prediction of biomolecular inte
139
+ ```
140
+
141
+ ### Search works by keyword
142
+
143
+ ```python
144
+ from helpers import http_get
145
+ import json
146
+
147
+ MAILTO = "mailto=your@email.com"
148
+
149
+ # Broad keyword search
150
+ data = json.loads(http_get(
151
+ f"https://api.crossref.org/works?query=machine+learning&rows=5&{MAILTO}"
152
+ ))
153
+ msg = data['message']
154
+ print("Total results:", msg['total-results']) # 2,805,391
155
+ for item in msg['items']:
156
+ title = item.get('title', ['(no title)'])[0][:60]
157
+ doi = item.get('DOI', '')
158
+ year = (item.get('published', {}).get('date-parts') or [[None]])[0][0]
159
+ type_ = item.get('type', '')
160
+ print(f" [{type_}] {year} {title}")
161
+ print(f" DOI: {doi}")
162
+ ```
163
+
164
+ ### Search by author + title (targeted)
165
+
166
+ ```python
167
+ from helpers import http_get
168
+ import json
169
+
170
+ MAILTO = "mailto=your@email.com"
171
+
172
+ data = json.loads(http_get(
173
+ f"https://api.crossref.org/works?query.author=Lecun&query.title=deep+learning&rows=5&{MAILTO}"
174
+ ))
175
+ msg = data['message']
176
+ print("Total results:", msg['total-results']) # 62
177
+ for item in msg['items'][:3]:
178
+ title = item.get('title', [''])[0][:60]
179
+ authors = ', '.join(a.get('family', '') for a in item.get('author', [])[:2])
180
+ year = (item.get('published', {}).get('date-parts') or [[None]])[0][0]
181
+ print(f" {year} {title}")
182
+ print(f" Authors: {authors} DOI: {item.get('DOI')}")
183
+ # Confirmed output:
184
+ # 2015 Deep learning & convolutional networks
185
+ # Authors: LeCun DOI: 10.1109/hotchips.2015.7477328
186
+ ```
187
+
188
+ ### Filter by date, type, and sort by citations
189
+
190
+ ```python
191
+ from helpers import http_get
192
+ import json
193
+
194
+ MAILTO = "mailto=your@email.com"
195
+
196
+ data = json.loads(http_get(
197
+ f"https://api.crossref.org/works"
198
+ f"?filter=from-pub-date:2024-01-01,type:journal-article"
199
+ f"&rows=5&sort=is-referenced-by-count&order=desc&{MAILTO}"
200
+ ))
201
+ msg = data['message']
202
+ print("Total 2024+ journal articles:", msg['total-results']) # 14,565,456
203
+ for item in msg['items'][:3]:
204
+ title = item.get('title', [''])[0][:60]
205
+ cites = item.get('is-referenced-by-count', 0)
206
+ year = (item.get('published', {}).get('date-parts') or [[None]])[0][0]
207
+ print(f" {year} cites={cites} {title}")
208
+ # Confirmed output:
209
+ # 2024 cites=17371 Global cancer statistics 2022: GLOBOCAN estimates...
210
+ # 2024 cites=12037 Accurate structure prediction of biomolecular int...
211
+ ```
212
+
213
+ ### Filter with `has-abstract:true`
214
+
215
+ ```python
216
+ from helpers import http_get
217
+ import json
218
+
219
+ MAILTO = "mailto=your@email.com"
220
+
221
+ # Only return works that have an abstract (useful since ~30-70% do not)
222
+ data = json.loads(http_get(
223
+ f"https://api.crossref.org/works"
224
+ f"?filter=from-pub-date:2023-01-01,until-pub-date:2023-12-31"
225
+ f",type:journal-article,has-abstract:true"
226
+ f"&rows=3&sort=is-referenced-by-count&order=desc&{MAILTO}"
227
+ ))
228
+ msg = data['message']
229
+ print("2023 journal articles with abstract:", msg['total-results']) # 3,041,841
230
+ for item in msg['items']:
231
+ print(item.get('title', [''])[0][:60], '| cites:', item.get('is-referenced-by-count'))
232
+ # Confirmed output:
233
+ # Cancer statistics, 2023 | cites: 12919
234
+ # Evolutionary-scale prediction of atomic-level protein struct | cites: 4352
235
+ ```
236
+
237
+ ### Cursor pagination (large result sets)
238
+
239
+ Standard offset pagination (`start=`) caps at a few thousand results. Use cursor for full sweeps.
240
+
241
+ ```python
242
+ from helpers import http_get
243
+ from urllib.parse import quote
244
+ import json
245
+
246
+ MAILTO = "mailto=your@email.com"
247
+
248
+ # First page: cursor=*
249
+ data = json.loads(http_get(
250
+ f"https://api.crossref.org/works?query=covid&rows=100&cursor=*&{MAILTO}"
251
+ ))
252
+ msg = data['message']
253
+ print("Total results:", msg['total-results']) # 897,660
254
+ items = msg['items']
255
+ next_cursor = msg['next-cursor'] # base64 string like "DnF1ZXJ5VGhlbkZldGNoJA..."
256
+
257
+ # Next pages: pass URL-encoded cursor
258
+ while next_cursor and items:
259
+ data = json.loads(http_get(
260
+ f"https://api.crossref.org/works?query=covid&rows=100"
261
+ f"&cursor={quote(next_cursor)}&{MAILTO}"
262
+ ))
263
+ msg = data['message']
264
+ items = msg.get('items', [])
265
+ next_cursor = msg.get('next-cursor')
266
+ # process items...
267
+ break # remove for full sweep
268
+ ```
269
+
270
+ ### Fetch specific fields only (`select=`)
271
+
272
+ Reduces response size significantly for bulk operations:
273
+
274
+ ```python
275
+ from helpers import http_get
276
+ import json
277
+
278
+ MAILTO = "mailto=your@email.com"
279
+
280
+ data = json.loads(http_get(
281
+ f"https://api.crossref.org/works?query=cancer&rows=5"
282
+ f"&select=DOI,title,author&{MAILTO}"
283
+ ))
284
+ # Warning: if a field is absent for a record, it simply won't appear in that item
285
+ for item in data['message']['items']:
286
+ print(list(item.keys())) # only ['DOI', 'title'] or ['DOI', 'title', 'author']
287
+ # Note: select= does NOT guarantee the field appears — absent fields are just omitted
288
+ ```
289
+
290
+ ### Count by type using facets
291
+
292
+ ```python
293
+ from helpers import http_get
294
+ import json
295
+
296
+ MAILTO = "mailto=your@email.com"
297
+
298
+ data = json.loads(http_get(
299
+ f"https://api.crossref.org/works?query=machine+learning&rows=0"
300
+ f"&facet=type-name:*&{MAILTO}"
301
+ ))
302
+ msg = data['message']
303
+ type_facet = msg['facets']['type-name']
304
+ for k, v in sorted(type_facet['values'].items(), key=lambda x: -x[1]):
305
+ print(f" {k}: {v:,}")
306
+ # Confirmed output (all CrossRef, 2026-04-18):
307
+ # Journal Article: 1,628,997 (for query=machine+learning scope)
308
+ # Conference Paper: 501,433
309
+ # Chapter: 455,907
310
+ # Posted Content: 87,937
311
+ # ...
312
+ ```
313
+
314
+ ### Journal info by ISSN
315
+
316
+ ```python
317
+ from helpers import http_get
318
+ import json
319
+
320
+ MAILTO = "mailto=your@email.com"
321
+
322
+ # Nature (ISSN 0028-0836)
323
+ data = json.loads(http_get(f"https://api.crossref.org/journals/0028-0836?{MAILTO}"))
324
+ msg = data['message']
325
+ print("Title:", msg['title']) # Nature
326
+ print("Publisher:", msg['publisher']) # Springer Science and Business Media LLC
327
+ print("ISSN:", msg['ISSN']) # ['0028-0836', '1476-4687']
328
+ print("Total DOIs:", msg['counts']['total-dois']) # 445,417
329
+ print("Subjects:", msg.get('subjects', [])) # [] (not always populated)
330
+
331
+ # Search journals by name
332
+ data2 = json.loads(http_get(f"https://api.crossref.org/journals?query=nature&rows=3&{MAILTO}"))
333
+ for j in data2['message']['items']:
334
+ print(f"{j.get('title')} | ISSN: {j.get('ISSN')} | DOIs: {j.get('counts', {}).get('total-dois')}")
335
+ # Confirmed output:
336
+ # NatureJobs | ISSN: [] | DOIs: 0
337
+ # Naturen | ISSN: ['0028-0887', '1504-3118'] | DOIs: 1055
338
+ ```
339
+
340
+ ### Funder search
341
+
342
+ ```python
343
+ from helpers import http_get
344
+ import json
345
+
346
+ MAILTO = "mailto=your@email.com"
347
+
348
+ data = json.loads(http_get(
349
+ f"https://api.crossref.org/funders?query=national+science+foundation&rows=3&{MAILTO}"
350
+ ))
351
+ msg = data['message']
352
+ print("Total funders:", msg['total-results']) # 108
353
+ for f in msg['items']:
354
+ print(f" ID: {f['id']} | {f['name']}")
355
+ print(f" Alt names: {f.get('alt-names', [])[:2]}")
356
+ print(f" URI: {f.get('uri')}")
357
+ # Confirmed output:
358
+ # ID: 501100001711 | Schweizerischer Nationalfonds zur Förderung...
359
+ # ID: 100000143 | Division of Computing and Communication Foundations
360
+ ```
361
+
362
+ ### DOI content negotiation (alternative, no CrossRef API needed)
363
+
364
+ The `doi.org` resolver can return formatted metadata directly via `Accept` header:
365
+
366
+ ```python
367
+ import urllib.request, json
368
+
369
+ def doi_to_csl(doi):
370
+ """Fetch CSL-JSON via DOI content negotiation. Same data as CrossRef API."""
371
+ req = urllib.request.Request(
372
+ f"https://doi.org/{doi}",
373
+ headers={"Accept": "application/vnd.citationstyles.csl+json",
374
+ "User-Agent": "Mozilla/5.0"}
375
+ )
376
+ with urllib.request.urlopen(req, timeout=20) as r:
377
+ return json.loads(r.read().decode())
378
+
379
+ def doi_to_bibtex(doi):
380
+ """Fetch BibTeX via DOI content negotiation."""
381
+ req = urllib.request.Request(
382
+ f"https://doi.org/{doi}",
383
+ headers={"Accept": "application/x-bibtex", "User-Agent": "Mozilla/5.0"}
384
+ )
385
+ with urllib.request.urlopen(req, timeout=20) as r:
386
+ return r.read().decode()
387
+
388
+ csl = doi_to_csl("10.1038/nature12345")
389
+ print("Title:", csl['title']) # LRG1 promotes angiogenesis...
390
+ print("Type:", csl['type']) # journal-article
391
+
392
+ bib = doi_to_bibtex("10.1038/nature12345")
393
+ print(bib[:200])
394
+ # @article{Wang_2013, title={LRG1 promotes angiogenesis...
395
+ ```
396
+
397
+ ## Field reference
398
+
399
+ ### Work object — complete field list
400
+
401
+ All fields are potentially absent unless marked required. Fields marked (R) are always present.
402
+
403
+ | Field | Type | Notes |
404
+ |---|---|---|
405
+ | `DOI` (R) | string | e.g. `"10.1038/s41586-021-03819-2"` |
406
+ | `URL` (R) | string | `"https://doi.org/10.1038/s41586-021-03819-2"` |
407
+ | `title` (R) | list[str] | Always a list; access `title[0]` |
408
+ | `type` (R) | string | e.g. `"journal-article"` — see type table below |
409
+ | `publisher` | string | |
410
+ | `container-title` | list[str] | Journal name; access `[0]` |
411
+ | `short-container-title` | list[str] | Abbreviated journal name |
412
+ | `ISSN` | list[str] | May contain print and online ISSN |
413
+ | `volume` | string | Note: string not int (`"596"`) |
414
+ | `issue` | string | |
415
+ | `page` | string | e.g. `"583-589"` |
416
+ | `author` | list[object] | See author fields below |
417
+ | `published` | date-object | Best single date — use this |
418
+ | `published-online` | date-object | Online-first date |
419
+ | `published-print` | date-object | Print edition date |
420
+ | `issued` | date-object | Usually same as `published` |
421
+ | `is-referenced-by-count` | int | Inbound citations to this work |
422
+ | `references-count` | int | Outbound references from this work |
423
+ | `reference` | list[object] | Full reference list (when deposited) |
424
+ | `abstract` | string | JATS XML markup; ~30-70% of works; strip tags before use |
425
+ | `subject` | list[str] | Subject classification (often empty) |
426
+ | `language` | string | e.g. `"en"` |
427
+ | `license` | list[object] | Each: `{URL, start, delay-in-days, content-version}` |
428
+ | `funder` | list[object] | Each: `{name, DOI, award}` |
429
+ | `link` | list[object] | Full-text links |
430
+ | `relation` | object | Related DOIs (e.g. preprint → article) |
431
+ | `assertion` | list[object] | Publisher-specific metadata |
432
+ | `alternative-id` | list[str] | Publisher's internal IDs |
433
+ | `member` | string | CrossRef member ID |
434
+ | `prefix` | string | DOI prefix |
435
+ | `score` | float | Relevance score (search results only) |
436
+ | `source` | string | e.g. `"Crossref"` |
437
+ | `indexed` | date-object | When CrossRef indexed this record |
438
+ | `deposited` | date-object | When publisher last deposited metadata |
439
+ | `created` | date-object | When CrossRef record was first created |
440
+
441
+ ### Author object fields
442
+
443
+ | Field | Notes |
444
+ |---|---|
445
+ | `given` | Given/first name |
446
+ | `family` | Family/last name |
447
+ | `sequence` | `"first"` or `"additional"` |
448
+ | `affiliation` | list of `{name, place}` — usually `[]` |
449
+ | `ORCID` | Full URL `"https://orcid.org/0000-0001-..."` — strip prefix to get bare ID |
450
+ | `authenticated-orcid` | `true` = verified via ORCID OAuth; `false` = self-reported |
451
+ | `name` | Used instead of given/family for organizations |
452
+
453
+ ### Date object structure
454
+
455
+ ```python
456
+ # All date fields share this structure:
457
+ date_obj = {
458
+ "date-parts": [[2021, 7, 15]], # [[year, month, day]] — month/day may be absent
459
+ "date-time": "2021-07-15T00:00:00Z", # not always present
460
+ "timestamp": 1626307200000 # not always present
461
+ }
462
+
463
+ # Safe extraction (handles [[2021]] or [[2021, 7]] partial dates):
464
+ def parse_date(d):
465
+ if not d: return None
466
+ parts = (d.get('date-parts') or [[]])[0]
467
+ return '-'.join(str(p) for p in parts if p is not None)
468
+ ```
469
+
470
+ ### Type identifiers (filter param values vs facet display names)
471
+
472
+ Use these exact strings in `filter=type:...`. The facet `type-name` values are display names only.
473
+
474
+ | filter `type:` value | Facet display name | Count (all CrossRef) |
475
+ |---|---|---|
476
+ | `journal-article` | Journal Article | 121,030,194 |
477
+ | `book-chapter` | Chapter | 24,359,059 |
478
+ | `proceedings-article` | Conference Paper | 9,744,754 |
479
+ | `dataset` | Dataset | 3,424,142 |
480
+ | `posted-content` | Posted Content (preprints) | 3,203,320 |
481
+ | `dissertation` | Dissertation | 1,044,461 |
482
+ | `peer-review` | Peer Review | 1,028,287 |
483
+ | `report` | Report | 906,301 |
484
+ | `book` | Book | 870,949 |
485
+ | `monograph` | Monograph | 788,401 |
486
+
487
+ ### Query parameters reference
488
+
489
+ | Parameter | Notes |
490
+ |---|---|
491
+ | `query` | Full-text keyword search across title, abstract, author |
492
+ | `query.author` | Author name search only |
493
+ | `query.title` | Title search only |
494
+ | `query.bibliographic` | Combined title + author + journal search |
495
+ | `rows` | Results per page (default 20, max 1000) |
496
+ | `offset` | Offset for pagination (max ~10,000 effective) |
497
+ | `cursor` | Use `cursor=*` for first page, then URL-encode `next-cursor` value |
498
+ | `sort` | `relevance`, `is-referenced-by-count`, `published`, `indexed` |
499
+ | `order` | `asc` or `desc` |
500
+ | `filter` | Comma-separated `key:value` pairs (see filters below) |
501
+ | `select` | Comma-separated field names to return |
502
+ | `facet` | `type-name:*` for type counts; `publisher-name:10` for top publishers |
503
+ | `mailto` | Your email — enables polite pool (higher limits) |
504
+
505
+ ### Filter keys reference
506
+
507
+ | Filter key | Example | Notes |
508
+ |---|---|---|
509
+ | `doi` | `doi:10.1038/nature12345` | Exact DOI match |
510
+ | `type` | `type:journal-article` | See type table above for valid values |
511
+ | `from-pub-date` | `from-pub-date:2024-01-01` | ISO date or `YYYY` |
512
+ | `until-pub-date` | `until-pub-date:2024-12-31` | |
513
+ | `from-index-date` | `from-index-date:2024-01-01` | When CrossRef indexed it |
514
+ | `has-abstract` | `has-abstract:true` | Only works with deposited abstract |
515
+ | `has-orcid` | `has-orcid:true` | At least one author has ORCID |
516
+ | `has-full-text` | `has-full-text:true` | Has full-text link |
517
+ | `has-references` | `has-references:true` | Has deposited reference list |
518
+ | `is-update` | `is-update:true` | Corrections, retractions |
519
+ | `issn` | `issn:0028-0836` | Filter by journal ISSN |
520
+ | `publisher-name` | `publisher-name:elsevier` | Partial match |
521
+ | `funder` | `funder:100000001` | Funder DOI or CrossRef funder ID |
522
+
523
+ ## Rate limits
524
+
525
+ CrossRef has two pools based on whether `mailto=` is present:
526
+
527
+ | Pool | Triggered by | Rate limit | Concurrency |
528
+ |---|---|---|---|
529
+ | **polite** | `mailto=` param present | 10 req/s | 3 concurrent |
530
+ | **public** | no `mailto=` | 5 req/s | 1 concurrent |
531
+
532
+ Headers returned: `x-rate-limit-limit`, `x-rate-limit-interval`, `x-concurrency-limit`, `x-api-pool`.
533
+
534
+ In practice with polite pool: 10 rapid sequential calls complete in ~2.7s (avg 0.27s/req) with no throttling. 5 parallel calls complete in ~0.3s. Stay at `max_workers=5` to respect the concurrency limit.
535
+
536
+ No per-day or per-hour cap. If you exceed limits, responses slow or return HTTP 429. No ban. Add `time.sleep(0.1)` between calls for sustained bulk crawls.
537
+
538
+ ## Gotchas
539
+
540
+ - **`mailto=` doubles your rate limit and concurrency.** Public pool: 5 req/s, concurrency=1. Polite pool: 10 req/s, concurrency=3. Always add `?mailto=your@email.com` to every request — confirmed by reading `x-api-pool` response header.
541
+
542
+ - **`title`, `container-title`, `ISSN` are always lists, not strings.** Access with `title[0]`, `container-title[0]` etc. Do not rely on there being only one entry — `container-title` can have multiple values.
543
+
544
+ - **Abstract contains JATS XML markup.** The `abstract` field is not plain text — it contains tags like `<jats:p>`, `<jats:italic>`, `<jats:sup>`. Strip with `re.sub(r'<[^>]+>', ' ', abstract)`. About 30-70% of works have an abstract at all; journal articles 2023 with `has-abstract:true` filter: 3,041,841 / ~5.5M total = ~55%.
545
+
546
+ - **ORCID is a full URL, not just the ID.** `a['ORCID']` = `"https://orcid.org/0000-0001-6169-6580"`. Strip with `.replace('https://orcid.org/', '')` to get the bare ID. `authenticated-orcid: false` means self-asserted (not verified via OAuth).
547
+
548
+ - **`published` vs `published-print` vs `published-online`.** Online-first is common in journals — a paper may be online months before its print issue. `published` is CrossRef's best single date and equals `published-online` when both exist. For preprints (`posted-content` type), look for `posted` instead of `published-print` — it may only have `posted` and `published`. Partial dates like `[[2023]]` (year only) are valid — always use `parse_date()` to handle missing month/day.
549
+
550
+ - **404 raises `HTTPError`, not a JSON error response.** An invalid DOI (e.g. `10.9999/doesnotexist`) raises `urllib.error.HTTPError: HTTP Error 404: Not Found`. Wrap `fetch_work()` in try/except for any untrusted DOI list.
551
+
552
+ - **`volume` and `issue` are strings, not integers.** CrossRef stores them as strings — `"596"`, not `596`. Don't compare with `==` to an int.
553
+
554
+ - **Filter type values are hyphenated lowercase, not the facet display names.** `filter=type:journal-article` works. `filter=type:journal article`, `filter=type:Journal Article`, and `filter=type:conference-paper` all return HTTP 400. Conference papers are `proceedings-article`.
555
+
556
+ - **`select=` does not guarantee field presence.** When you `select=DOI,title,author`, a record that has no author still omits the `author` key — it doesn't return `author: []`. Always use `.get()`.
557
+
558
+ - **Cursor pagination required for >10,000 results.** Offset pagination (`offset=`) is limited to around 10,000 results. For bulk sweeps, use `cursor=*` for the first page, then URL-encode the returned `next-cursor` value with `urllib.parse.quote()`. The cursor expires if unused for too long.
559
+
560
+ - **`rows` max is 1000 per call.** Requesting more silently returns 1000. For cursor-based sweeps of large result sets (millions of records), `rows=1000` with cursor is the most efficient approach.
561
+
562
+ - **HTML entities in titles.** Titles may contain HTML entities like `&amp;` — `"Deep learning &amp; convolutional networks"`. Decode with `html.unescape()` if needed.
563
+
564
+ - **`funder` search `works-count` field is `None`.** The funder search result object has a `works-count` key that is always `None` in the search response. To get actual work counts for a funder, fetch the funder directly: `GET /funders/{id}`.
565
+
566
+ - **`subject` is often an empty list.** The `subject` field in works is populated inconsistently — many journal articles have `subject: []` even for well-indexed journals like Nature.
567
+
568
+ - **Affiliation is usually empty.** `author[i]['affiliation']` is `[]` for the majority of records, even for papers published in 2024. CrossRef has been working on affiliation deposit, but coverage is inconsistent.