@pencil-agent/nano-pencil 2.0.1 → 2.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (188) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/model/custom-providers.js +1 -1
  7. package/dist/core/model-registry.js +5 -5
  8. package/dist/extensions/builtin/AGENT.md +115 -115
  9. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  10. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  11. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  12. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  13. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  14. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  15. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  16. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  17. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  18. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  19. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  20. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  91. package/dist/extensions/builtin/browser/browser.md +73 -73
  92. package/dist/extensions/builtin/browser/install.md +142 -142
  93. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  94. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  95. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  96. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  97. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  98. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  99. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  100. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  101. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  102. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  103. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  104. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  105. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  107. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  108. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  109. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  110. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  111. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  112. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  113. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  114. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  115. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  116. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  117. package/dist/extensions/builtin/debug/index.js +9 -9
  118. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  119. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  120. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  121. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  122. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  123. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  124. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  125. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  126. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  127. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  128. package/dist/extensions/builtin/goal/README.md +67 -67
  129. package/dist/extensions/builtin/goal/index.js +6 -6
  130. package/dist/extensions/builtin/grub/README.md +112 -112
  131. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  132. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  133. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  134. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  135. package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
  136. package/dist/extensions/builtin/loop/README.md +92 -92
  137. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  138. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  139. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  140. package/dist/extensions/builtin/sal/README.md +72 -72
  141. package/dist/extensions/builtin/security-audit/README.md +289 -289
  142. package/dist/extensions/builtin/team/AGENT.md +112 -112
  143. package/dist/extensions/builtin/team/TESTING.md +299 -299
  144. package/dist/extensions/builtin/token-save/README.md +56 -56
  145. package/dist/extensions/optional/AGENT.md +10 -10
  146. package/dist/modes/interactive/controllers/input-submit-controller.js +2 -2
  147. package/dist/modes/interactive/controllers/stream-render-controller.js +2 -2
  148. package/dist/modes/interactive/interactive-mode.js +19 -19
  149. package/dist/modes/interactive/theme/dark.json +85 -85
  150. package/dist/modes/interactive/theme/light.json +84 -84
  151. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  152. package/dist/modes/interactive/theme/warm.json +81 -81
  153. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  154. package/dist/node_modules/@pencil-agent/ai/dist/models.generated.js +1 -1
  155. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
  156. package/docs/SDK-TESTING.md +364 -0
  157. package/docs/codex-goal-command-impl.md +1055 -1055
  158. package/docs/codex-goal-vs-grub.md +500 -500
  159. package/docs/custom-provider.md +27 -27
  160. package/docs/extensions.md +27 -27
  161. package/docs/keybindings.md +27 -27
  162. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  163. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  164. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  165. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  166. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  167. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  168. package/docs/loop-usage-examples.md +214 -214
  169. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
  170. package/docs/models.md +27 -27
  171. package/docs/packages.md +27 -27
  172. package/docs/pi-design-philosophy.md +457 -457
  173. package/docs/planmode.md +1987 -1987
  174. package/docs/prompt-templates.md +27 -27
  175. package/docs/providers.md +27 -27
  176. package/docs/sdk.md +27 -27
  177. package/docs/skills.md +27 -27
  178. package/docs/startup-performance-optimization.md +301 -0
  179. package/docs/themes.md +27 -27
  180. package/docs/tui.md +27 -27
  181. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
  182. package/package.json +190 -190
  183. package/docs/cc-agent-design.md +0 -1297
  184. package/docs/cc-tui-design.md +0 -1333
  185. package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +0 -170
  186. package/docs/scan-report.md +0 -3820
  187. package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
  188. package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +0 -261
@@ -1,568 +1,568 @@
1
- # CrossRef — Scraping & Data Extraction
2
-
3
- `https://api.crossref.org` — scholarly DOI and citation metadata. **Never use the browser for CrossRef.** Completely free, no auth required. All workflows use `http_get`.
4
-
5
- ## Do this first
6
-
7
- **Always add `mailto=your@email.com` to every request** — it moves you into the polite pool, which doubles the rate limit and concurrency allowance. The difference is measurable and the cost is zero.
8
-
9
- ```python
10
- from helpers import http_get
11
- import json
12
-
13
- MAILTO = "mailto=your@email.com" # set once, append to every URL
14
-
15
- # Single DOI lookup — fastest way to get metadata for a known paper
16
- data = json.loads(http_get(f"https://api.crossref.org/works/10.1038/s41586-021-03819-2?{MAILTO}"))
17
- msg = data['message']
18
- # msg keys: DOI, title, author, published, type, container-title, volume, issue,
19
- # page, is-referenced-by-count, references-count, abstract (optional), ...
20
- ```
21
-
22
- ## Common workflows
23
-
24
- ### DOI lookup — single paper
25
-
26
- ```python
27
- from helpers import http_get
28
- import json, re
29
-
30
- MAILTO = "mailto=your@email.com"
31
-
32
- def fetch_work(doi):
33
- data = json.loads(http_get(f"https://api.crossref.org/works/{doi}?{MAILTO}"))
34
- return data['message']
35
-
36
- def parse_date(d):
37
- """[[2021, 7, 15]] -> '2021-7-15'. Handles partial dates like [[2021]]."""
38
- if not d: return None
39
- parts = d.get('date-parts', [[]])[0]
40
- return '-'.join(str(p) for p in parts if p is not None)
41
-
42
- def clean_abstract(raw):
43
- """Strip JATS XML tags. Abstract field contains tags like <jats:p>, <jats:italic>."""
44
- return re.sub(r'<[^>]+>', ' ', raw).strip() if raw else None
45
-
46
- w = fetch_work("10.1038/s41586-021-03819-2") # AlphaFold2
47
-
48
- print("DOI:", w['DOI']) # 10.1038/s41586-021-03819-2
49
- print("Title:", w['title'][0]) # Highly accurate protein structure...
50
- print("Type:", w['type']) # journal-article
51
- print("Publisher:", w['publisher']) # Springer Science and Business Media LLC
52
- print("Journal:", w.get('container-title', [''])[0]) # Nature
53
- print("Volume:", w.get('volume')) # 596
54
- print("Issue:", w.get('issue')) # 7873
55
- print("Page:", w.get('page')) # 583-589
56
- print("published:", parse_date(w.get('published'))) # 2021-7-15 (online date)
57
- print("published-online:", parse_date(w.get('published-online'))) # 2021-7-15
58
- print("published-print:", parse_date(w.get('published-print'))) # 2021-8-26
59
- print("Citations:", w.get('is-referenced-by-count')) # 40260
60
- print("References:", w.get('references-count')) # 84
61
- print("Abstract:", clean_abstract(w.get('abstract', ''))[:100] if w.get('abstract') else None)
62
- # Confirmed output (2026-04-18):
63
- # DOI: 10.1038/s41586-021-03819-2
64
- # Title: Highly accurate protein structure prediction with AlphaFold
65
- # Type: journal-article
66
- # Journal: Nature
67
- # Volume: 596 | Issue: 7873 | Page: 583-589
68
- # published: 2021-7-15 | published-print: 2021-8-26
69
- # Citations: 40260
70
- ```
71
-
72
- ### DOI lookup — extract authors with ORCID
73
-
74
- ```python
75
- from helpers import http_get
76
- import json
77
-
78
- MAILTO = "mailto=your@email.com"
79
- data = json.loads(http_get(f"https://api.crossref.org/works/10.1038/s41586-021-03819-2?{MAILTO}"))
80
- authors = data['message'].get('author', [])
81
-
82
- for a in authors[:3]:
83
- name = f"{a.get('given', '')} {a.get('family', '')}".strip()
84
- # ORCID is a full URL, not a bare ID — strip the prefix
85
- orcid_url = a.get('ORCID') # e.g. 'https://orcid.org/0000-0001-6169-6580'
86
- orcid_id = orcid_url.replace('https://orcid.org/', '') if orcid_url else None
87
- authenticated = a.get('authenticated-orcid', False) # False = self-reported, True = verified
88
- affiliations = [aff.get('name', '') for aff in a.get('affiliation', [])]
89
- print(f"{name} | ORCID: {orcid_id} | auth={authenticated} | seq={a['sequence']}")
90
- # Confirmed output:
91
- # John Jumper | ORCID: 0000-0001-6169-6580 | auth=False | seq=first
92
- # Richard Evans | ORCID: None | auth=False | seq=additional
93
- # Alexander Pritzel | ORCID: None | auth=False | seq=additional
94
- ```
95
-
96
- ### Batch DOI lookup (parallel — 5 calls in ~0.3s)
97
-
98
- ```python
99
- from helpers import http_get
100
- from concurrent.futures import ThreadPoolExecutor
101
- import json
102
-
103
- MAILTO = "mailto=your@email.com"
104
-
105
- def fetch_work(doi):
106
- try:
107
- data = json.loads(http_get(f"https://api.crossref.org/works/{doi}?{MAILTO}"))
108
- msg = data['message']
109
- return {
110
- 'doi': doi,
111
- 'title': msg.get('title', [''])[0],
112
- 'year': (msg.get('published', {}).get('date-parts') or [[None]])[0][0],
113
- 'citations': msg.get('is-referenced-by-count'),
114
- 'type': msg.get('type'),
115
- }
116
- except Exception as e:
117
- return {'doi': doi, 'error': str(e)}
118
-
119
- dois = [
120
- "10.1038/nature12345",
121
- "10.1038/s41586-021-03819-2",
122
- "10.1056/NEJMoa2034577",
123
- "10.1126/science.1260419",
124
- "10.1038/s41586-024-07487-w",
125
- ]
126
-
127
- # max_workers=5 safe; polite pool: 10 req/s, concurrency=3 (see Rate limits)
128
- with ThreadPoolExecutor(max_workers=5) as ex:
129
- results = list(ex.map(fetch_work, dois))
130
-
131
- for r in results:
132
- print(r['year'], f"cites={r['citations']}", r['title'][:50])
133
- # Confirmed output (2026-04-18, ~0.296s total):
134
- # 2013 cites=465 LRG1 promotes angiogenesis by modulating endotheli
135
- # 2021 cites=40260 Highly accurate protein structure prediction with
136
- # 2020 cites=13752 Safety and Efficacy of the BNT162b2 mRNA Covid-19
137
- # 2015 cites=13553 Tissue-based map of the human proteome
138
- # 2024 cites=12037 Accurate structure prediction of biomolecular inte
139
- ```
140
-
141
- ### Search works by keyword
142
-
143
- ```python
144
- from helpers import http_get
145
- import json
146
-
147
- MAILTO = "mailto=your@email.com"
148
-
149
- # Broad keyword search
150
- data = json.loads(http_get(
151
- f"https://api.crossref.org/works?query=machine+learning&rows=5&{MAILTO}"
152
- ))
153
- msg = data['message']
154
- print("Total results:", msg['total-results']) # 2,805,391
155
- for item in msg['items']:
156
- title = item.get('title', ['(no title)'])[0][:60]
157
- doi = item.get('DOI', '')
158
- year = (item.get('published', {}).get('date-parts') or [[None]])[0][0]
159
- type_ = item.get('type', '')
160
- print(f" [{type_}] {year} {title}")
161
- print(f" DOI: {doi}")
162
- ```
163
-
164
- ### Search by author + title (targeted)
165
-
166
- ```python
167
- from helpers import http_get
168
- import json
169
-
170
- MAILTO = "mailto=your@email.com"
171
-
172
- data = json.loads(http_get(
173
- f"https://api.crossref.org/works?query.author=Lecun&query.title=deep+learning&rows=5&{MAILTO}"
174
- ))
175
- msg = data['message']
176
- print("Total results:", msg['total-results']) # 62
177
- for item in msg['items'][:3]:
178
- title = item.get('title', [''])[0][:60]
179
- authors = ', '.join(a.get('family', '') for a in item.get('author', [])[:2])
180
- year = (item.get('published', {}).get('date-parts') or [[None]])[0][0]
181
- print(f" {year} {title}")
182
- print(f" Authors: {authors} DOI: {item.get('DOI')}")
183
- # Confirmed output:
184
- # 2015 Deep learning & convolutional networks
185
- # Authors: LeCun DOI: 10.1109/hotchips.2015.7477328
186
- ```
187
-
188
- ### Filter by date, type, and sort by citations
189
-
190
- ```python
191
- from helpers import http_get
192
- import json
193
-
194
- MAILTO = "mailto=your@email.com"
195
-
196
- data = json.loads(http_get(
197
- f"https://api.crossref.org/works"
198
- f"?filter=from-pub-date:2024-01-01,type:journal-article"
199
- f"&rows=5&sort=is-referenced-by-count&order=desc&{MAILTO}"
200
- ))
201
- msg = data['message']
202
- print("Total 2024+ journal articles:", msg['total-results']) # 14,565,456
203
- for item in msg['items'][:3]:
204
- title = item.get('title', [''])[0][:60]
205
- cites = item.get('is-referenced-by-count', 0)
206
- year = (item.get('published', {}).get('date-parts') or [[None]])[0][0]
207
- print(f" {year} cites={cites} {title}")
208
- # Confirmed output:
209
- # 2024 cites=17371 Global cancer statistics 2022: GLOBOCAN estimates...
210
- # 2024 cites=12037 Accurate structure prediction of biomolecular int...
211
- ```
212
-
213
- ### Filter with `has-abstract:true`
214
-
215
- ```python
216
- from helpers import http_get
217
- import json
218
-
219
- MAILTO = "mailto=your@email.com"
220
-
221
- # Only return works that have an abstract (useful since ~30-70% do not)
222
- data = json.loads(http_get(
223
- f"https://api.crossref.org/works"
224
- f"?filter=from-pub-date:2023-01-01,until-pub-date:2023-12-31"
225
- f",type:journal-article,has-abstract:true"
226
- f"&rows=3&sort=is-referenced-by-count&order=desc&{MAILTO}"
227
- ))
228
- msg = data['message']
229
- print("2023 journal articles with abstract:", msg['total-results']) # 3,041,841
230
- for item in msg['items']:
231
- print(item.get('title', [''])[0][:60], '| cites:', item.get('is-referenced-by-count'))
232
- # Confirmed output:
233
- # Cancer statistics, 2023 | cites: 12919
234
- # Evolutionary-scale prediction of atomic-level protein struct | cites: 4352
235
- ```
236
-
237
- ### Cursor pagination (large result sets)
238
-
239
- Standard offset pagination (`start=`) caps at a few thousand results. Use cursor for full sweeps.
240
-
241
- ```python
242
- from helpers import http_get
243
- from urllib.parse import quote
244
- import json
245
-
246
- MAILTO = "mailto=your@email.com"
247
-
248
- # First page: cursor=*
249
- data = json.loads(http_get(
250
- f"https://api.crossref.org/works?query=covid&rows=100&cursor=*&{MAILTO}"
251
- ))
252
- msg = data['message']
253
- print("Total results:", msg['total-results']) # 897,660
254
- items = msg['items']
255
- next_cursor = msg['next-cursor'] # base64 string like "DnF1ZXJ5VGhlbkZldGNoJA..."
256
-
257
- # Next pages: pass URL-encoded cursor
258
- while next_cursor and items:
259
- data = json.loads(http_get(
260
- f"https://api.crossref.org/works?query=covid&rows=100"
261
- f"&cursor={quote(next_cursor)}&{MAILTO}"
262
- ))
263
- msg = data['message']
264
- items = msg.get('items', [])
265
- next_cursor = msg.get('next-cursor')
266
- # process items...
267
- break # remove for full sweep
268
- ```
269
-
270
- ### Fetch specific fields only (`select=`)
271
-
272
- Reduces response size significantly for bulk operations:
273
-
274
- ```python
275
- from helpers import http_get
276
- import json
277
-
278
- MAILTO = "mailto=your@email.com"
279
-
280
- data = json.loads(http_get(
281
- f"https://api.crossref.org/works?query=cancer&rows=5"
282
- f"&select=DOI,title,author&{MAILTO}"
283
- ))
284
- # Warning: if a field is absent for a record, it simply won't appear in that item
285
- for item in data['message']['items']:
286
- print(list(item.keys())) # only ['DOI', 'title'] or ['DOI', 'title', 'author']
287
- # Note: select= does NOT guarantee the field appears — absent fields are just omitted
288
- ```
289
-
290
- ### Count by type using facets
291
-
292
- ```python
293
- from helpers import http_get
294
- import json
295
-
296
- MAILTO = "mailto=your@email.com"
297
-
298
- data = json.loads(http_get(
299
- f"https://api.crossref.org/works?query=machine+learning&rows=0"
300
- f"&facet=type-name:*&{MAILTO}"
301
- ))
302
- msg = data['message']
303
- type_facet = msg['facets']['type-name']
304
- for k, v in sorted(type_facet['values'].items(), key=lambda x: -x[1]):
305
- print(f" {k}: {v:,}")
306
- # Confirmed output (all CrossRef, 2026-04-18):
307
- # Journal Article: 1,628,997 (for query=machine+learning scope)
308
- # Conference Paper: 501,433
309
- # Chapter: 455,907
310
- # Posted Content: 87,937
311
- # ...
312
- ```
313
-
314
- ### Journal info by ISSN
315
-
316
- ```python
317
- from helpers import http_get
318
- import json
319
-
320
- MAILTO = "mailto=your@email.com"
321
-
322
- # Nature (ISSN 0028-0836)
323
- data = json.loads(http_get(f"https://api.crossref.org/journals/0028-0836?{MAILTO}"))
324
- msg = data['message']
325
- print("Title:", msg['title']) # Nature
326
- print("Publisher:", msg['publisher']) # Springer Science and Business Media LLC
327
- print("ISSN:", msg['ISSN']) # ['0028-0836', '1476-4687']
328
- print("Total DOIs:", msg['counts']['total-dois']) # 445,417
329
- print("Subjects:", msg.get('subjects', [])) # [] (not always populated)
330
-
331
- # Search journals by name
332
- data2 = json.loads(http_get(f"https://api.crossref.org/journals?query=nature&rows=3&{MAILTO}"))
333
- for j in data2['message']['items']:
334
- print(f"{j.get('title')} | ISSN: {j.get('ISSN')} | DOIs: {j.get('counts', {}).get('total-dois')}")
335
- # Confirmed output:
336
- # NatureJobs | ISSN: [] | DOIs: 0
337
- # Naturen | ISSN: ['0028-0887', '1504-3118'] | DOIs: 1055
338
- ```
339
-
340
- ### Funder search
341
-
342
- ```python
343
- from helpers import http_get
344
- import json
345
-
346
- MAILTO = "mailto=your@email.com"
347
-
348
- data = json.loads(http_get(
349
- f"https://api.crossref.org/funders?query=national+science+foundation&rows=3&{MAILTO}"
350
- ))
351
- msg = data['message']
352
- print("Total funders:", msg['total-results']) # 108
353
- for f in msg['items']:
354
- print(f" ID: {f['id']} | {f['name']}")
355
- print(f" Alt names: {f.get('alt-names', [])[:2]}")
356
- print(f" URI: {f.get('uri')}")
357
- # Confirmed output:
358
- # ID: 501100001711 | Schweizerischer Nationalfonds zur Förderung...
359
- # ID: 100000143 | Division of Computing and Communication Foundations
360
- ```
361
-
362
- ### DOI content negotiation (alternative, no CrossRef API needed)
363
-
364
- The `doi.org` resolver can return formatted metadata directly via `Accept` header:
365
-
366
- ```python
367
- import urllib.request, json
368
-
369
- def doi_to_csl(doi):
370
- """Fetch CSL-JSON via DOI content negotiation. Same data as CrossRef API."""
371
- req = urllib.request.Request(
372
- f"https://doi.org/{doi}",
373
- headers={"Accept": "application/vnd.citationstyles.csl+json",
374
- "User-Agent": "Mozilla/5.0"}
375
- )
376
- with urllib.request.urlopen(req, timeout=20) as r:
377
- return json.loads(r.read().decode())
378
-
379
- def doi_to_bibtex(doi):
380
- """Fetch BibTeX via DOI content negotiation."""
381
- req = urllib.request.Request(
382
- f"https://doi.org/{doi}",
383
- headers={"Accept": "application/x-bibtex", "User-Agent": "Mozilla/5.0"}
384
- )
385
- with urllib.request.urlopen(req, timeout=20) as r:
386
- return r.read().decode()
387
-
388
- csl = doi_to_csl("10.1038/nature12345")
389
- print("Title:", csl['title']) # LRG1 promotes angiogenesis...
390
- print("Type:", csl['type']) # journal-article
391
-
392
- bib = doi_to_bibtex("10.1038/nature12345")
393
- print(bib[:200])
394
- # @article{Wang_2013, title={LRG1 promotes angiogenesis...
395
- ```
396
-
397
- ## Field reference
398
-
399
- ### Work object — complete field list
400
-
401
- All fields are potentially absent unless marked required. Fields marked (R) are always present.
402
-
403
- | Field | Type | Notes |
404
- |---|---|---|
405
- | `DOI` (R) | string | e.g. `"10.1038/s41586-021-03819-2"` |
406
- | `URL` (R) | string | `"https://doi.org/10.1038/s41586-021-03819-2"` |
407
- | `title` (R) | list[str] | Always a list; access `title[0]` |
408
- | `type` (R) | string | e.g. `"journal-article"` — see type table below |
409
- | `publisher` | string | |
410
- | `container-title` | list[str] | Journal name; access `[0]` |
411
- | `short-container-title` | list[str] | Abbreviated journal name |
412
- | `ISSN` | list[str] | May contain print and online ISSN |
413
- | `volume` | string | Note: string not int (`"596"`) |
414
- | `issue` | string | |
415
- | `page` | string | e.g. `"583-589"` |
416
- | `author` | list[object] | See author fields below |
417
- | `published` | date-object | Best single date — use this |
418
- | `published-online` | date-object | Online-first date |
419
- | `published-print` | date-object | Print edition date |
420
- | `issued` | date-object | Usually same as `published` |
421
- | `is-referenced-by-count` | int | Inbound citations to this work |
422
- | `references-count` | int | Outbound references from this work |
423
- | `reference` | list[object] | Full reference list (when deposited) |
424
- | `abstract` | string | JATS XML markup; ~30-70% of works; strip tags before use |
425
- | `subject` | list[str] | Subject classification (often empty) |
426
- | `language` | string | e.g. `"en"` |
427
- | `license` | list[object] | Each: `{URL, start, delay-in-days, content-version}` |
428
- | `funder` | list[object] | Each: `{name, DOI, award}` |
429
- | `link` | list[object] | Full-text links |
430
- | `relation` | object | Related DOIs (e.g. preprint → article) |
431
- | `assertion` | list[object] | Publisher-specific metadata |
432
- | `alternative-id` | list[str] | Publisher's internal IDs |
433
- | `member` | string | CrossRef member ID |
434
- | `prefix` | string | DOI prefix |
435
- | `score` | float | Relevance score (search results only) |
436
- | `source` | string | e.g. `"Crossref"` |
437
- | `indexed` | date-object | When CrossRef indexed this record |
438
- | `deposited` | date-object | When publisher last deposited metadata |
439
- | `created` | date-object | When CrossRef record was first created |
440
-
441
- ### Author object fields
442
-
443
- | Field | Notes |
444
- |---|---|
445
- | `given` | Given/first name |
446
- | `family` | Family/last name |
447
- | `sequence` | `"first"` or `"additional"` |
448
- | `affiliation` | list of `{name, place}` — usually `[]` |
449
- | `ORCID` | Full URL `"https://orcid.org/0000-0001-..."` — strip prefix to get bare ID |
450
- | `authenticated-orcid` | `true` = verified via ORCID OAuth; `false` = self-reported |
451
- | `name` | Used instead of given/family for organizations |
452
-
453
- ### Date object structure
454
-
455
- ```python
456
- # All date fields share this structure:
457
- date_obj = {
458
- "date-parts": [[2021, 7, 15]], # [[year, month, day]] — month/day may be absent
459
- "date-time": "2021-07-15T00:00:00Z", # not always present
460
- "timestamp": 1626307200000 # not always present
461
- }
462
-
463
- # Safe extraction (handles [[2021]] or [[2021, 7]] partial dates):
464
- def parse_date(d):
465
- if not d: return None
466
- parts = (d.get('date-parts') or [[]])[0]
467
- return '-'.join(str(p) for p in parts if p is not None)
468
- ```
469
-
470
- ### Type identifiers (filter param values vs facet display names)
471
-
472
- Use these exact strings in `filter=type:...`. The facet `type-name` values are display names only.
473
-
474
- | filter `type:` value | Facet display name | Count (all CrossRef) |
475
- |---|---|---|
476
- | `journal-article` | Journal Article | 121,030,194 |
477
- | `book-chapter` | Chapter | 24,359,059 |
478
- | `proceedings-article` | Conference Paper | 9,744,754 |
479
- | `dataset` | Dataset | 3,424,142 |
480
- | `posted-content` | Posted Content (preprints) | 3,203,320 |
481
- | `dissertation` | Dissertation | 1,044,461 |
482
- | `peer-review` | Peer Review | 1,028,287 |
483
- | `report` | Report | 906,301 |
484
- | `book` | Book | 870,949 |
485
- | `monograph` | Monograph | 788,401 |
486
-
487
- ### Query parameters reference
488
-
489
- | Parameter | Notes |
490
- |---|---|
491
- | `query` | Full-text keyword search across title, abstract, author |
492
- | `query.author` | Author name search only |
493
- | `query.title` | Title search only |
494
- | `query.bibliographic` | Combined title + author + journal search |
495
- | `rows` | Results per page (default 20, max 1000) |
496
- | `offset` | Offset for pagination (max ~10,000 effective) |
497
- | `cursor` | Use `cursor=*` for first page, then URL-encode `next-cursor` value |
498
- | `sort` | `relevance`, `is-referenced-by-count`, `published`, `indexed` |
499
- | `order` | `asc` or `desc` |
500
- | `filter` | Comma-separated `key:value` pairs (see filters below) |
501
- | `select` | Comma-separated field names to return |
502
- | `facet` | `type-name:*` for type counts; `publisher-name:10` for top publishers |
503
- | `mailto` | Your email — enables polite pool (higher limits) |
504
-
505
- ### Filter keys reference
506
-
507
- | Filter key | Example | Notes |
508
- |---|---|---|
509
- | `doi` | `doi:10.1038/nature12345` | Exact DOI match |
510
- | `type` | `type:journal-article` | See type table above for valid values |
511
- | `from-pub-date` | `from-pub-date:2024-01-01` | ISO date or `YYYY` |
512
- | `until-pub-date` | `until-pub-date:2024-12-31` | |
513
- | `from-index-date` | `from-index-date:2024-01-01` | When CrossRef indexed it |
514
- | `has-abstract` | `has-abstract:true` | Only works with deposited abstract |
515
- | `has-orcid` | `has-orcid:true` | At least one author has ORCID |
516
- | `has-full-text` | `has-full-text:true` | Has full-text link |
517
- | `has-references` | `has-references:true` | Has deposited reference list |
518
- | `is-update` | `is-update:true` | Corrections, retractions |
519
- | `issn` | `issn:0028-0836` | Filter by journal ISSN |
520
- | `publisher-name` | `publisher-name:elsevier` | Partial match |
521
- | `funder` | `funder:100000001` | Funder DOI or CrossRef funder ID |
522
-
523
- ## Rate limits
524
-
525
- CrossRef has two pools based on whether `mailto=` is present:
526
-
527
- | Pool | Triggered by | Rate limit | Concurrency |
528
- |---|---|---|---|
529
- | **polite** | `mailto=` param present | 10 req/s | 3 concurrent |
530
- | **public** | no `mailto=` | 5 req/s | 1 concurrent |
531
-
532
- Headers returned: `x-rate-limit-limit`, `x-rate-limit-interval`, `x-concurrency-limit`, `x-api-pool`.
533
-
534
- In practice with polite pool: 10 rapid sequential calls complete in ~2.7s (avg 0.27s/req) with no throttling. 5 parallel calls complete in ~0.3s. Stay at `max_workers=5` to respect the concurrency limit.
535
-
536
- No per-day or per-hour cap. If you exceed limits, responses slow or return HTTP 429. No ban. Add `time.sleep(0.1)` between calls for sustained bulk crawls.
537
-
538
- ## Gotchas
539
-
540
- - **`mailto=` doubles your rate limit and concurrency.** Public pool: 5 req/s, concurrency=1. Polite pool: 10 req/s, concurrency=3. Always add `?mailto=your@email.com` to every request — confirmed by reading `x-api-pool` response header.
541
-
542
- - **`title`, `container-title`, `ISSN` are always lists, not strings.** Access with `title[0]`, `container-title[0]` etc. Do not rely on there being only one entry — `container-title` can have multiple values.
543
-
544
- - **Abstract contains JATS XML markup.** The `abstract` field is not plain text — it contains tags like `<jats:p>`, `<jats:italic>`, `<jats:sup>`. Strip with `re.sub(r'<[^>]+>', ' ', abstract)`. About 30-70% of works have an abstract at all; journal articles 2023 with `has-abstract:true` filter: 3,041,841 / ~5.5M total = ~55%.
545
-
546
- - **ORCID is a full URL, not just the ID.** `a['ORCID']` = `"https://orcid.org/0000-0001-6169-6580"`. Strip with `.replace('https://orcid.org/', '')` to get the bare ID. `authenticated-orcid: false` means self-asserted (not verified via OAuth).
547
-
548
- - **`published` vs `published-print` vs `published-online`.** Online-first is common in journals — a paper may be online months before its print issue. `published` is CrossRef's best single date and equals `published-online` when both exist. For preprints (`posted-content` type), look for `posted` instead of `published-print` — it may only have `posted` and `published`. Partial dates like `[[2023]]` (year only) are valid — always use `parse_date()` to handle missing month/day.
549
-
550
- - **404 raises `HTTPError`, not a JSON error response.** An invalid DOI (e.g. `10.9999/doesnotexist`) raises `urllib.error.HTTPError: HTTP Error 404: Not Found`. Wrap `fetch_work()` in try/except for any untrusted DOI list.
551
-
552
- - **`volume` and `issue` are strings, not integers.** CrossRef stores them as strings — `"596"`, not `596`. Don't compare with `==` to an int.
553
-
554
- - **Filter type values are hyphenated lowercase, not the facet display names.** `filter=type:journal-article` works. `filter=type:journal article`, `filter=type:Journal Article`, and `filter=type:conference-paper` all return HTTP 400. Conference papers are `proceedings-article`.
555
-
556
- - **`select=` does not guarantee field presence.** When you `select=DOI,title,author`, a record that has no author still omits the `author` key — it doesn't return `author: []`. Always use `.get()`.
557
-
558
- - **Cursor pagination required for >10,000 results.** Offset pagination (`offset=`) is limited to around 10,000 results. For bulk sweeps, use `cursor=*` for the first page, then URL-encode the returned `next-cursor` value with `urllib.parse.quote()`. The cursor expires if unused for too long.
559
-
560
- - **`rows` max is 1000 per call.** Requesting more silently returns 1000. For cursor-based sweeps of large result sets (millions of records), `rows=1000` with cursor is the most efficient approach.
561
-
562
- - **HTML entities in titles.** Titles may contain HTML entities like `&amp;` — `"Deep learning &amp; convolutional networks"`. Decode with `html.unescape()` if needed.
563
-
564
- - **`funder` search `works-count` field is `None`.** The funder search result object has a `works-count` key that is always `None` in the search response. To get actual work counts for a funder, fetch the funder directly: `GET /funders/{id}`.
565
-
566
- - **`subject` is often an empty list.** The `subject` field in works is populated inconsistently — many journal articles have `subject: []` even for well-indexed journals like Nature.
567
-
568
- - **Affiliation is usually empty.** `author[i]['affiliation']` is `[]` for the majority of records, even for papers published in 2024. CrossRef has been working on affiliation deposit, but coverage is inconsistent.
1
+ # CrossRef — Scraping & Data Extraction
2
+
3
+ `https://api.crossref.org` — scholarly DOI and citation metadata. **Never use the browser for CrossRef.** Completely free, no auth required. All workflows use `http_get`.
4
+
5
+ ## Do this first
6
+
7
+ **Always add `mailto=your@email.com` to every request** — it moves you into the polite pool, which doubles the rate limit and concurrency allowance. The difference is measurable and the cost is zero.
8
+
9
+ ```python
10
+ from helpers import http_get
11
+ import json
12
+
13
+ MAILTO = "mailto=your@email.com" # set once, append to every URL
14
+
15
+ # Single DOI lookup — fastest way to get metadata for a known paper
16
+ data = json.loads(http_get(f"https://api.crossref.org/works/10.1038/s41586-021-03819-2?{MAILTO}"))
17
+ msg = data['message']
18
+ # msg keys: DOI, title, author, published, type, container-title, volume, issue,
19
+ # page, is-referenced-by-count, references-count, abstract (optional), ...
20
+ ```
21
+
22
+ ## Common workflows
23
+
24
+ ### DOI lookup — single paper
25
+
26
+ ```python
27
+ from helpers import http_get
28
+ import json, re
29
+
30
+ MAILTO = "mailto=your@email.com"
31
+
32
+ def fetch_work(doi):
33
+ data = json.loads(http_get(f"https://api.crossref.org/works/{doi}?{MAILTO}"))
34
+ return data['message']
35
+
36
+ def parse_date(d):
37
+ """[[2021, 7, 15]] -> '2021-7-15'. Handles partial dates like [[2021]]."""
38
+ if not d: return None
39
+ parts = d.get('date-parts', [[]])[0]
40
+ return '-'.join(str(p) for p in parts if p is not None)
41
+
42
+ def clean_abstract(raw):
43
+ """Strip JATS XML tags. Abstract field contains tags like <jats:p>, <jats:italic>."""
44
+ return re.sub(r'<[^>]+>', ' ', raw).strip() if raw else None
45
+
46
+ w = fetch_work("10.1038/s41586-021-03819-2") # AlphaFold2
47
+
48
+ print("DOI:", w['DOI']) # 10.1038/s41586-021-03819-2
49
+ print("Title:", w['title'][0]) # Highly accurate protein structure...
50
+ print("Type:", w['type']) # journal-article
51
+ print("Publisher:", w['publisher']) # Springer Science and Business Media LLC
52
+ print("Journal:", w.get('container-title', [''])[0]) # Nature
53
+ print("Volume:", w.get('volume')) # 596
54
+ print("Issue:", w.get('issue')) # 7873
55
+ print("Page:", w.get('page')) # 583-589
56
+ print("published:", parse_date(w.get('published'))) # 2021-7-15 (online date)
57
+ print("published-online:", parse_date(w.get('published-online'))) # 2021-7-15
58
+ print("published-print:", parse_date(w.get('published-print'))) # 2021-8-26
59
+ print("Citations:", w.get('is-referenced-by-count')) # 40260
60
+ print("References:", w.get('references-count')) # 84
61
+ print("Abstract:", clean_abstract(w.get('abstract', ''))[:100] if w.get('abstract') else None)
62
+ # Confirmed output (2026-04-18):
63
+ # DOI: 10.1038/s41586-021-03819-2
64
+ # Title: Highly accurate protein structure prediction with AlphaFold
65
+ # Type: journal-article
66
+ # Journal: Nature
67
+ # Volume: 596 | Issue: 7873 | Page: 583-589
68
+ # published: 2021-7-15 | published-print: 2021-8-26
69
+ # Citations: 40260
70
+ ```
71
+
72
+ ### DOI lookup — extract authors with ORCID
73
+
74
+ ```python
75
+ from helpers import http_get
76
+ import json
77
+
78
+ MAILTO = "mailto=your@email.com"
79
+ data = json.loads(http_get(f"https://api.crossref.org/works/10.1038/s41586-021-03819-2?{MAILTO}"))
80
+ authors = data['message'].get('author', [])
81
+
82
+ for a in authors[:3]:
83
+ name = f"{a.get('given', '')} {a.get('family', '')}".strip()
84
+ # ORCID is a full URL, not a bare ID — strip the prefix
85
+ orcid_url = a.get('ORCID') # e.g. 'https://orcid.org/0000-0001-6169-6580'
86
+ orcid_id = orcid_url.replace('https://orcid.org/', '') if orcid_url else None
87
+ authenticated = a.get('authenticated-orcid', False) # False = self-reported, True = verified
88
+ affiliations = [aff.get('name', '') for aff in a.get('affiliation', [])]
89
+ print(f"{name} | ORCID: {orcid_id} | auth={authenticated} | seq={a['sequence']}")
90
+ # Confirmed output:
91
+ # John Jumper | ORCID: 0000-0001-6169-6580 | auth=False | seq=first
92
+ # Richard Evans | ORCID: None | auth=False | seq=additional
93
+ # Alexander Pritzel | ORCID: None | auth=False | seq=additional
94
+ ```
95
+
96
+ ### Batch DOI lookup (parallel — 5 calls in ~0.3s)
97
+
98
+ ```python
99
+ from helpers import http_get
100
+ from concurrent.futures import ThreadPoolExecutor
101
+ import json
102
+
103
+ MAILTO = "mailto=your@email.com"
104
+
105
+ def fetch_work(doi):
106
+ try:
107
+ data = json.loads(http_get(f"https://api.crossref.org/works/{doi}?{MAILTO}"))
108
+ msg = data['message']
109
+ return {
110
+ 'doi': doi,
111
+ 'title': msg.get('title', [''])[0],
112
+ 'year': (msg.get('published', {}).get('date-parts') or [[None]])[0][0],
113
+ 'citations': msg.get('is-referenced-by-count'),
114
+ 'type': msg.get('type'),
115
+ }
116
+ except Exception as e:
117
+ return {'doi': doi, 'error': str(e)}
118
+
119
+ dois = [
120
+ "10.1038/nature12345",
121
+ "10.1038/s41586-021-03819-2",
122
+ "10.1056/NEJMoa2034577",
123
+ "10.1126/science.1260419",
124
+ "10.1038/s41586-024-07487-w",
125
+ ]
126
+
127
+ # max_workers=5 safe; polite pool: 10 req/s, concurrency=3 (see Rate limits)
128
+ with ThreadPoolExecutor(max_workers=5) as ex:
129
+ results = list(ex.map(fetch_work, dois))
130
+
131
+ for r in results:
132
+ print(r['year'], f"cites={r['citations']}", r['title'][:50])
133
+ # Confirmed output (2026-04-18, ~0.296s total):
134
+ # 2013 cites=465 LRG1 promotes angiogenesis by modulating endotheli
135
+ # 2021 cites=40260 Highly accurate protein structure prediction with
136
+ # 2020 cites=13752 Safety and Efficacy of the BNT162b2 mRNA Covid-19
137
+ # 2015 cites=13553 Tissue-based map of the human proteome
138
+ # 2024 cites=12037 Accurate structure prediction of biomolecular inte
139
+ ```
140
+
141
+ ### Search works by keyword
142
+
143
+ ```python
144
+ from helpers import http_get
145
+ import json
146
+
147
+ MAILTO = "mailto=your@email.com"
148
+
149
+ # Broad keyword search
150
+ data = json.loads(http_get(
151
+ f"https://api.crossref.org/works?query=machine+learning&rows=5&{MAILTO}"
152
+ ))
153
+ msg = data['message']
154
+ print("Total results:", msg['total-results']) # 2,805,391
155
+ for item in msg['items']:
156
+ title = item.get('title', ['(no title)'])[0][:60]
157
+ doi = item.get('DOI', '')
158
+ year = (item.get('published', {}).get('date-parts') or [[None]])[0][0]
159
+ type_ = item.get('type', '')
160
+ print(f" [{type_}] {year} {title}")
161
+ print(f" DOI: {doi}")
162
+ ```
163
+
164
+ ### Search by author + title (targeted)
165
+
166
+ ```python
167
+ from helpers import http_get
168
+ import json
169
+
170
+ MAILTO = "mailto=your@email.com"
171
+
172
+ data = json.loads(http_get(
173
+ f"https://api.crossref.org/works?query.author=Lecun&query.title=deep+learning&rows=5&{MAILTO}"
174
+ ))
175
+ msg = data['message']
176
+ print("Total results:", msg['total-results']) # 62
177
+ for item in msg['items'][:3]:
178
+ title = item.get('title', [''])[0][:60]
179
+ authors = ', '.join(a.get('family', '') for a in item.get('author', [])[:2])
180
+ year = (item.get('published', {}).get('date-parts') or [[None]])[0][0]
181
+ print(f" {year} {title}")
182
+ print(f" Authors: {authors} DOI: {item.get('DOI')}")
183
+ # Confirmed output:
184
+ # 2015 Deep learning & convolutional networks
185
+ # Authors: LeCun DOI: 10.1109/hotchips.2015.7477328
186
+ ```
187
+
188
+ ### Filter by date, type, and sort by citations
189
+
190
+ ```python
191
+ from helpers import http_get
192
+ import json
193
+
194
+ MAILTO = "mailto=your@email.com"
195
+
196
+ data = json.loads(http_get(
197
+ f"https://api.crossref.org/works"
198
+ f"?filter=from-pub-date:2024-01-01,type:journal-article"
199
+ f"&rows=5&sort=is-referenced-by-count&order=desc&{MAILTO}"
200
+ ))
201
+ msg = data['message']
202
+ print("Total 2024+ journal articles:", msg['total-results']) # 14,565,456
203
+ for item in msg['items'][:3]:
204
+ title = item.get('title', [''])[0][:60]
205
+ cites = item.get('is-referenced-by-count', 0)
206
+ year = (item.get('published', {}).get('date-parts') or [[None]])[0][0]
207
+ print(f" {year} cites={cites} {title}")
208
+ # Confirmed output:
209
+ # 2024 cites=17371 Global cancer statistics 2022: GLOBOCAN estimates...
210
+ # 2024 cites=12037 Accurate structure prediction of biomolecular int...
211
+ ```
212
+
213
+ ### Filter with `has-abstract:true`
214
+
215
+ ```python
216
+ from helpers import http_get
217
+ import json
218
+
219
+ MAILTO = "mailto=your@email.com"
220
+
221
+ # Only return works that have an abstract (useful since ~30-70% do not)
222
+ data = json.loads(http_get(
223
+ f"https://api.crossref.org/works"
224
+ f"?filter=from-pub-date:2023-01-01,until-pub-date:2023-12-31"
225
+ f",type:journal-article,has-abstract:true"
226
+ f"&rows=3&sort=is-referenced-by-count&order=desc&{MAILTO}"
227
+ ))
228
+ msg = data['message']
229
+ print("2023 journal articles with abstract:", msg['total-results']) # 3,041,841
230
+ for item in msg['items']:
231
+ print(item.get('title', [''])[0][:60], '| cites:', item.get('is-referenced-by-count'))
232
+ # Confirmed output:
233
+ # Cancer statistics, 2023 | cites: 12919
234
+ # Evolutionary-scale prediction of atomic-level protein struct | cites: 4352
235
+ ```
236
+
237
+ ### Cursor pagination (large result sets)
238
+
239
+ Standard offset pagination (`start=`) caps at a few thousand results. Use cursor for full sweeps.
240
+
241
+ ```python
242
+ from helpers import http_get
243
+ from urllib.parse import quote
244
+ import json
245
+
246
+ MAILTO = "mailto=your@email.com"
247
+
248
+ # First page: cursor=*
249
+ data = json.loads(http_get(
250
+ f"https://api.crossref.org/works?query=covid&rows=100&cursor=*&{MAILTO}"
251
+ ))
252
+ msg = data['message']
253
+ print("Total results:", msg['total-results']) # 897,660
254
+ items = msg['items']
255
+ next_cursor = msg['next-cursor'] # base64 string like "DnF1ZXJ5VGhlbkZldGNoJA..."
256
+
257
+ # Next pages: pass URL-encoded cursor
258
+ while next_cursor and items:
259
+ data = json.loads(http_get(
260
+ f"https://api.crossref.org/works?query=covid&rows=100"
261
+ f"&cursor={quote(next_cursor)}&{MAILTO}"
262
+ ))
263
+ msg = data['message']
264
+ items = msg.get('items', [])
265
+ next_cursor = msg.get('next-cursor')
266
+ # process items...
267
+ break # remove for full sweep
268
+ ```
269
+
270
+ ### Fetch specific fields only (`select=`)
271
+
272
+ Reduces response size significantly for bulk operations:
273
+
274
+ ```python
275
+ from helpers import http_get
276
+ import json
277
+
278
+ MAILTO = "mailto=your@email.com"
279
+
280
+ data = json.loads(http_get(
281
+ f"https://api.crossref.org/works?query=cancer&rows=5"
282
+ f"&select=DOI,title,author&{MAILTO}"
283
+ ))
284
+ # Warning: if a field is absent for a record, it simply won't appear in that item
285
+ for item in data['message']['items']:
286
+ print(list(item.keys())) # only ['DOI', 'title'] or ['DOI', 'title', 'author']
287
+ # Note: select= does NOT guarantee the field appears — absent fields are just omitted
288
+ ```
289
+
290
+ ### Count by type using facets
291
+
292
+ ```python
293
+ from helpers import http_get
294
+ import json
295
+
296
+ MAILTO = "mailto=your@email.com"
297
+
298
+ data = json.loads(http_get(
299
+ f"https://api.crossref.org/works?query=machine+learning&rows=0"
300
+ f"&facet=type-name:*&{MAILTO}"
301
+ ))
302
+ msg = data['message']
303
+ type_facet = msg['facets']['type-name']
304
+ for k, v in sorted(type_facet['values'].items(), key=lambda x: -x[1]):
305
+ print(f" {k}: {v:,}")
306
+ # Confirmed output (all CrossRef, 2026-04-18):
307
+ # Journal Article: 1,628,997 (for query=machine+learning scope)
308
+ # Conference Paper: 501,433
309
+ # Chapter: 455,907
310
+ # Posted Content: 87,937
311
+ # ...
312
+ ```
313
+
314
+ ### Journal info by ISSN
315
+
316
+ ```python
317
+ from helpers import http_get
318
+ import json
319
+
320
+ MAILTO = "mailto=your@email.com"
321
+
322
+ # Nature (ISSN 0028-0836)
323
+ data = json.loads(http_get(f"https://api.crossref.org/journals/0028-0836?{MAILTO}"))
324
+ msg = data['message']
325
+ print("Title:", msg['title']) # Nature
326
+ print("Publisher:", msg['publisher']) # Springer Science and Business Media LLC
327
+ print("ISSN:", msg['ISSN']) # ['0028-0836', '1476-4687']
328
+ print("Total DOIs:", msg['counts']['total-dois']) # 445,417
329
+ print("Subjects:", msg.get('subjects', [])) # [] (not always populated)
330
+
331
+ # Search journals by name
332
+ data2 = json.loads(http_get(f"https://api.crossref.org/journals?query=nature&rows=3&{MAILTO}"))
333
+ for j in data2['message']['items']:
334
+ print(f"{j.get('title')} | ISSN: {j.get('ISSN')} | DOIs: {j.get('counts', {}).get('total-dois')}")
335
+ # Confirmed output:
336
+ # NatureJobs | ISSN: [] | DOIs: 0
337
+ # Naturen | ISSN: ['0028-0887', '1504-3118'] | DOIs: 1055
338
+ ```
339
+
340
+ ### Funder search
341
+
342
+ ```python
343
+ from helpers import http_get
344
+ import json
345
+
346
+ MAILTO = "mailto=your@email.com"
347
+
348
+ data = json.loads(http_get(
349
+ f"https://api.crossref.org/funders?query=national+science+foundation&rows=3&{MAILTO}"
350
+ ))
351
+ msg = data['message']
352
+ print("Total funders:", msg['total-results']) # 108
353
+ for f in msg['items']:
354
+ print(f" ID: {f['id']} | {f['name']}")
355
+ print(f" Alt names: {f.get('alt-names', [])[:2]}")
356
+ print(f" URI: {f.get('uri')}")
357
+ # Confirmed output:
358
+ # ID: 501100001711 | Schweizerischer Nationalfonds zur Förderung...
359
+ # ID: 100000143 | Division of Computing and Communication Foundations
360
+ ```
361
+
362
+ ### DOI content negotiation (alternative, no CrossRef API needed)
363
+
364
+ The `doi.org` resolver can return formatted metadata directly via `Accept` header:
365
+
366
+ ```python
367
+ import urllib.request, json
368
+
369
+ def doi_to_csl(doi):
370
+ """Fetch CSL-JSON via DOI content negotiation. Same data as CrossRef API."""
371
+ req = urllib.request.Request(
372
+ f"https://doi.org/{doi}",
373
+ headers={"Accept": "application/vnd.citationstyles.csl+json",
374
+ "User-Agent": "Mozilla/5.0"}
375
+ )
376
+ with urllib.request.urlopen(req, timeout=20) as r:
377
+ return json.loads(r.read().decode())
378
+
379
+ def doi_to_bibtex(doi):
380
+ """Fetch BibTeX via DOI content negotiation."""
381
+ req = urllib.request.Request(
382
+ f"https://doi.org/{doi}",
383
+ headers={"Accept": "application/x-bibtex", "User-Agent": "Mozilla/5.0"}
384
+ )
385
+ with urllib.request.urlopen(req, timeout=20) as r:
386
+ return r.read().decode()
387
+
388
+ csl = doi_to_csl("10.1038/nature12345")
389
+ print("Title:", csl['title']) # LRG1 promotes angiogenesis...
390
+ print("Type:", csl['type']) # journal-article
391
+
392
+ bib = doi_to_bibtex("10.1038/nature12345")
393
+ print(bib[:200])
394
+ # @article{Wang_2013, title={LRG1 promotes angiogenesis...
395
+ ```
396
+
397
+ ## Field reference
398
+
399
+ ### Work object — complete field list
400
+
401
+ All fields are potentially absent unless marked required. Fields marked (R) are always present.
402
+
403
+ | Field | Type | Notes |
404
+ |---|---|---|
405
+ | `DOI` (R) | string | e.g. `"10.1038/s41586-021-03819-2"` |
406
+ | `URL` (R) | string | `"https://doi.org/10.1038/s41586-021-03819-2"` |
407
+ | `title` (R) | list[str] | Always a list; access `title[0]` |
408
+ | `type` (R) | string | e.g. `"journal-article"` — see type table below |
409
+ | `publisher` | string | |
410
+ | `container-title` | list[str] | Journal name; access `[0]` |
411
+ | `short-container-title` | list[str] | Abbreviated journal name |
412
+ | `ISSN` | list[str] | May contain print and online ISSN |
413
+ | `volume` | string | Note: string not int (`"596"`) |
414
+ | `issue` | string | |
415
+ | `page` | string | e.g. `"583-589"` |
416
+ | `author` | list[object] | See author fields below |
417
+ | `published` | date-object | Best single date — use this |
418
+ | `published-online` | date-object | Online-first date |
419
+ | `published-print` | date-object | Print edition date |
420
+ | `issued` | date-object | Usually same as `published` |
421
+ | `is-referenced-by-count` | int | Inbound citations to this work |
422
+ | `references-count` | int | Outbound references from this work |
423
+ | `reference` | list[object] | Full reference list (when deposited) |
424
+ | `abstract` | string | JATS XML markup; ~30-70% of works; strip tags before use |
425
+ | `subject` | list[str] | Subject classification (often empty) |
426
+ | `language` | string | e.g. `"en"` |
427
+ | `license` | list[object] | Each: `{URL, start, delay-in-days, content-version}` |
428
+ | `funder` | list[object] | Each: `{name, DOI, award}` |
429
+ | `link` | list[object] | Full-text links |
430
+ | `relation` | object | Related DOIs (e.g. preprint → article) |
431
+ | `assertion` | list[object] | Publisher-specific metadata |
432
+ | `alternative-id` | list[str] | Publisher's internal IDs |
433
+ | `member` | string | CrossRef member ID |
434
+ | `prefix` | string | DOI prefix |
435
+ | `score` | float | Relevance score (search results only) |
436
+ | `source` | string | e.g. `"Crossref"` |
437
+ | `indexed` | date-object | When CrossRef indexed this record |
438
+ | `deposited` | date-object | When publisher last deposited metadata |
439
+ | `created` | date-object | When CrossRef record was first created |
440
+
441
+ ### Author object fields
442
+
443
+ | Field | Notes |
444
+ |---|---|
445
+ | `given` | Given/first name |
446
+ | `family` | Family/last name |
447
+ | `sequence` | `"first"` or `"additional"` |
448
+ | `affiliation` | list of `{name, place}` — usually `[]` |
449
+ | `ORCID` | Full URL `"https://orcid.org/0000-0001-..."` — strip prefix to get bare ID |
450
+ | `authenticated-orcid` | `true` = verified via ORCID OAuth; `false` = self-reported |
451
+ | `name` | Used instead of given/family for organizations |
452
+
453
+ ### Date object structure
454
+
455
+ ```python
456
+ # All date fields share this structure:
457
+ date_obj = {
458
+ "date-parts": [[2021, 7, 15]], # [[year, month, day]] — month/day may be absent
459
+ "date-time": "2021-07-15T00:00:00Z", # not always present
460
+ "timestamp": 1626307200000 # not always present
461
+ }
462
+
463
+ # Safe extraction (handles [[2021]] or [[2021, 7]] partial dates):
464
+ def parse_date(d):
465
+ if not d: return None
466
+ parts = (d.get('date-parts') or [[]])[0]
467
+ return '-'.join(str(p) for p in parts if p is not None)
468
+ ```
469
+
470
+ ### Type identifiers (filter param values vs facet display names)
471
+
472
+ Use these exact strings in `filter=type:...`. The facet `type-name` values are display names only.
473
+
474
+ | filter `type:` value | Facet display name | Count (all CrossRef) |
475
+ |---|---|---|
476
+ | `journal-article` | Journal Article | 121,030,194 |
477
+ | `book-chapter` | Chapter | 24,359,059 |
478
+ | `proceedings-article` | Conference Paper | 9,744,754 |
479
+ | `dataset` | Dataset | 3,424,142 |
480
+ | `posted-content` | Posted Content (preprints) | 3,203,320 |
481
+ | `dissertation` | Dissertation | 1,044,461 |
482
+ | `peer-review` | Peer Review | 1,028,287 |
483
+ | `report` | Report | 906,301 |
484
+ | `book` | Book | 870,949 |
485
+ | `monograph` | Monograph | 788,401 |
486
+
487
+ ### Query parameters reference
488
+
489
+ | Parameter | Notes |
490
+ |---|---|
491
+ | `query` | Full-text keyword search across title, abstract, author |
492
+ | `query.author` | Author name search only |
493
+ | `query.title` | Title search only |
494
+ | `query.bibliographic` | Combined title + author + journal search |
495
+ | `rows` | Results per page (default 20, max 1000) |
496
+ | `offset` | Offset for pagination (max ~10,000 effective) |
497
+ | `cursor` | Use `cursor=*` for first page, then URL-encode `next-cursor` value |
498
+ | `sort` | `relevance`, `is-referenced-by-count`, `published`, `indexed` |
499
+ | `order` | `asc` or `desc` |
500
+ | `filter` | Comma-separated `key:value` pairs (see filters below) |
501
+ | `select` | Comma-separated field names to return |
502
+ | `facet` | `type-name:*` for type counts; `publisher-name:10` for top publishers |
503
+ | `mailto` | Your email — enables polite pool (higher limits) |
504
+
505
+ ### Filter keys reference
506
+
507
+ | Filter key | Example | Notes |
508
+ |---|---|---|
509
+ | `doi` | `doi:10.1038/nature12345` | Exact DOI match |
510
+ | `type` | `type:journal-article` | See type table above for valid values |
511
+ | `from-pub-date` | `from-pub-date:2024-01-01` | ISO date or `YYYY` |
512
+ | `until-pub-date` | `until-pub-date:2024-12-31` | |
513
+ | `from-index-date` | `from-index-date:2024-01-01` | When CrossRef indexed it |
514
+ | `has-abstract` | `has-abstract:true` | Only works with deposited abstract |
515
+ | `has-orcid` | `has-orcid:true` | At least one author has ORCID |
516
+ | `has-full-text` | `has-full-text:true` | Has full-text link |
517
+ | `has-references` | `has-references:true` | Has deposited reference list |
518
+ | `is-update` | `is-update:true` | Corrections, retractions |
519
+ | `issn` | `issn:0028-0836` | Filter by journal ISSN |
520
+ | `publisher-name` | `publisher-name:elsevier` | Partial match |
521
+ | `funder` | `funder:100000001` | Funder DOI or CrossRef funder ID |
522
+
523
+ ## Rate limits
524
+
525
+ CrossRef has two pools based on whether `mailto=` is present:
526
+
527
+ | Pool | Triggered by | Rate limit | Concurrency |
528
+ |---|---|---|---|
529
+ | **polite** | `mailto=` param present | 10 req/s | 3 concurrent |
530
+ | **public** | no `mailto=` | 5 req/s | 1 concurrent |
531
+
532
+ Headers returned: `x-rate-limit-limit`, `x-rate-limit-interval`, `x-concurrency-limit`, `x-api-pool`.
533
+
534
+ In practice with polite pool: 10 rapid sequential calls complete in ~2.7s (avg 0.27s/req) with no throttling. 5 parallel calls complete in ~0.3s. Stay at `max_workers=5` to respect the concurrency limit.
535
+
536
+ No per-day or per-hour cap. If you exceed limits, responses slow or return HTTP 429. No ban. Add `time.sleep(0.1)` between calls for sustained bulk crawls.
537
+
538
+ ## Gotchas
539
+
540
+ - **`mailto=` doubles your rate limit and concurrency.** Public pool: 5 req/s, concurrency=1. Polite pool: 10 req/s, concurrency=3. Always add `?mailto=your@email.com` to every request — confirmed by reading `x-api-pool` response header.
541
+
542
+ - **`title`, `container-title`, `ISSN` are always lists, not strings.** Access with `title[0]`, `container-title[0]` etc. Do not rely on there being only one entry — `container-title` can have multiple values.
543
+
544
+ - **Abstract contains JATS XML markup.** The `abstract` field is not plain text — it contains tags like `<jats:p>`, `<jats:italic>`, `<jats:sup>`. Strip with `re.sub(r'<[^>]+>', ' ', abstract)`. About 30-70% of works have an abstract at all; journal articles 2023 with `has-abstract:true` filter: 3,041,841 / ~5.5M total = ~55%.
545
+
546
+ - **ORCID is a full URL, not just the ID.** `a['ORCID']` = `"https://orcid.org/0000-0001-6169-6580"`. Strip with `.replace('https://orcid.org/', '')` to get the bare ID. `authenticated-orcid: false` means self-asserted (not verified via OAuth).
547
+
548
+ - **`published` vs `published-print` vs `published-online`.** Online-first is common in journals — a paper may be online months before its print issue. `published` is CrossRef's best single date and equals `published-online` when both exist. For preprints (`posted-content` type), look for `posted` instead of `published-print` — it may only have `posted` and `published`. Partial dates like `[[2023]]` (year only) are valid — always use `parse_date()` to handle missing month/day.
549
+
550
+ - **404 raises `HTTPError`, not a JSON error response.** An invalid DOI (e.g. `10.9999/doesnotexist`) raises `urllib.error.HTTPError: HTTP Error 404: Not Found`. Wrap `fetch_work()` in try/except for any untrusted DOI list.
551
+
552
+ - **`volume` and `issue` are strings, not integers.** CrossRef stores them as strings — `"596"`, not `596`. Don't compare with `==` to an int.
553
+
554
+ - **Filter type values are hyphenated lowercase, not the facet display names.** `filter=type:journal-article` works. `filter=type:journal article`, `filter=type:Journal Article`, and `filter=type:conference-paper` all return HTTP 400. Conference papers are `proceedings-article`.
555
+
556
+ - **`select=` does not guarantee field presence.** When you `select=DOI,title,author`, a record that has no author still omits the `author` key — it doesn't return `author: []`. Always use `.get()`.
557
+
558
+ - **Cursor pagination required for >10,000 results.** Offset pagination (`offset=`) is limited to around 10,000 results. For bulk sweeps, use `cursor=*` for the first page, then URL-encode the returned `next-cursor` value with `urllib.parse.quote()`. The cursor expires if unused for too long.
559
+
560
+ - **`rows` max is 1000 per call.** Requesting more silently returns 1000. For cursor-based sweeps of large result sets (millions of records), `rows=1000` with cursor is the most efficient approach.
561
+
562
+ - **HTML entities in titles.** Titles may contain HTML entities like `&amp;` — `"Deep learning &amp; convolutional networks"`. Decode with `html.unescape()` if needed.
563
+
564
+ - **`funder` search `works-count` field is `None`.** The funder search result object has a `works-count` key that is always `None` in the search response. To get actual work counts for a funder, fetch the funder directly: `GET /funders/{id}`.
565
+
566
+ - **`subject` is often an empty list.** The `subject` field in works is populated inconsistently — many journal articles have `subject: []` even for well-indexed journals like Nature.
567
+
568
+ - **Affiliation is usually empty.** `author[i]['affiliation']` is `[]` for the majority of records, even for papers published in 2024. CrossRef has been working on affiliation deposit, but coverage is inconsistent.