@pencil-agent/nano-pencil 2.0.1 → 2.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (188) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/model/custom-providers.js +1 -1
  7. package/dist/core/model-registry.js +5 -5
  8. package/dist/extensions/builtin/AGENT.md +115 -115
  9. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  10. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  11. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  12. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  13. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  14. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  15. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  16. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  17. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  18. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  19. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  20. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  91. package/dist/extensions/builtin/browser/browser.md +73 -73
  92. package/dist/extensions/builtin/browser/install.md +142 -142
  93. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  94. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  95. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  96. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  97. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  98. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  99. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  100. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  101. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  102. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  103. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  104. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  105. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  107. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  108. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  109. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  110. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  111. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  112. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  113. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  114. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  115. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  116. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  117. package/dist/extensions/builtin/debug/index.js +9 -9
  118. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  119. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  120. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  121. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  122. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  123. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  124. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  125. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  126. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  127. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  128. package/dist/extensions/builtin/goal/README.md +67 -67
  129. package/dist/extensions/builtin/goal/index.js +6 -6
  130. package/dist/extensions/builtin/grub/README.md +112 -112
  131. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  132. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  133. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  134. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  135. package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
  136. package/dist/extensions/builtin/loop/README.md +92 -92
  137. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  138. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  139. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  140. package/dist/extensions/builtin/sal/README.md +72 -72
  141. package/dist/extensions/builtin/security-audit/README.md +289 -289
  142. package/dist/extensions/builtin/team/AGENT.md +112 -112
  143. package/dist/extensions/builtin/team/TESTING.md +299 -299
  144. package/dist/extensions/builtin/token-save/README.md +56 -56
  145. package/dist/extensions/optional/AGENT.md +10 -10
  146. package/dist/modes/interactive/controllers/input-submit-controller.js +2 -2
  147. package/dist/modes/interactive/controllers/stream-render-controller.js +2 -2
  148. package/dist/modes/interactive/interactive-mode.js +19 -19
  149. package/dist/modes/interactive/theme/dark.json +85 -85
  150. package/dist/modes/interactive/theme/light.json +84 -84
  151. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  152. package/dist/modes/interactive/theme/warm.json +81 -81
  153. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  154. package/dist/node_modules/@pencil-agent/ai/dist/models.generated.js +1 -1
  155. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
  156. package/docs/SDK-TESTING.md +364 -0
  157. package/docs/codex-goal-command-impl.md +1055 -1055
  158. package/docs/codex-goal-vs-grub.md +500 -500
  159. package/docs/custom-provider.md +27 -27
  160. package/docs/extensions.md +27 -27
  161. package/docs/keybindings.md +27 -27
  162. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  163. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  164. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  165. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  166. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  167. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  168. package/docs/loop-usage-examples.md +214 -214
  169. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
  170. package/docs/models.md +27 -27
  171. package/docs/packages.md +27 -27
  172. package/docs/pi-design-philosophy.md +457 -457
  173. package/docs/planmode.md +1987 -1987
  174. package/docs/prompt-templates.md +27 -27
  175. package/docs/providers.md +27 -27
  176. package/docs/sdk.md +27 -27
  177. package/docs/skills.md +27 -27
  178. package/docs/startup-performance-optimization.md +301 -0
  179. package/docs/themes.md +27 -27
  180. package/docs/tui.md +27 -27
  181. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
  182. package/package.json +190 -190
  183. package/docs/cc-agent-design.md +0 -1297
  184. package/docs/cc-tui-design.md +0 -1333
  185. package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +0 -170
  186. package/docs/scan-report.md +0 -3820
  187. package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
  188. package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +0 -261
@@ -1,470 +1,470 @@
1
- # OpenAlex — Scraping & Data Extraction
2
-
3
- `https://api.openalex.org` — open academic knowledge graph covering 260M+ works, 90M+ authors, 110K+ institutions. **Never use the browser for OpenAlex.** The entire API is JSON over HTTPS, completely free, no API key required. Add `mailto=your@email.com` to every request to use the polite pool (10 req/s vs 100 req/s limit, more reliable).
4
-
5
- ## Do this first
6
-
7
- **Use `http_get` with the REST JSON API — one call, JSON response, no auth, no parsing library.**
8
-
9
- ```python
10
- from helpers import http_get
11
- import json
12
-
13
- data = json.loads(http_get(
14
- "https://api.openalex.org/works?search=transformer+attention&per-page=5&mailto=you@example.com"
15
- ))
16
- works = data["results"]
17
- total = data["meta"]["count"]
18
- ```
19
-
20
- Always include `mailto=` to stay in the polite pool. Always parse with `json.loads()`.
21
-
22
- ## Common workflows
23
-
24
- ### Search papers (works)
25
-
26
- ```python
27
- from helpers import http_get
28
- import json
29
-
30
- data = json.loads(http_get(
31
- "https://api.openalex.org/works"
32
- "?search=transformer+attention"
33
- "&per-page=5"
34
- "&sort=cited_by_count:desc"
35
- "&select=id,doi,display_name,publication_year,cited_by_count,open_access,primary_location"
36
- "&mailto=you@example.com"
37
- ))
38
- print("total matching:", data["meta"]["count"])
39
- for w in data["results"]:
40
- oa = w["open_access"]
41
- loc = w["primary_location"] or {}
42
- src = loc.get("source") or {}
43
- print(w["id"].split("/")[-1], w["publication_year"], w["cited_by_count"], w["display_name"][:60])
44
- print(" doi:", w["doi"])
45
- print(" open access:", oa["is_oa"], "| pdf:", oa["oa_url"])
46
- print(" journal:", src.get("display_name"))
47
- # Confirmed output (2026-04-18):
48
- # W3151130473 2021 1887 CrossViT: Cross-Attention Multi-Scale Vision Transformer for I
49
- # doi: https://doi.org/10.1109/iccv48922.2021.00041
50
- # open access: False | pdf: None
51
- # journal: 2021 IEEE/CVF International Conference on Computer Vision (ICCV)
52
- ```
53
-
54
- ### Fetch single paper by OpenAlex ID or DOI
55
-
56
- ```python
57
- from helpers import http_get
58
- import json
59
-
60
- # By OpenAlex ID (bare or full URL form both work)
61
- w = json.loads(http_get("https://api.openalex.org/works/W2626778328?mailto=you@example.com"))
62
- print(w["display_name"], w["cited_by_count"])
63
- # Confirmed: Attention Is All You Need 6526
64
-
65
- # By DOI (pass the full DOI URL as the entity ID)
66
- w = json.loads(http_get(
67
- "https://api.openalex.org/works/https://doi.org/10.1038/nature14539?mailto=you@example.com"
68
- ))
69
- print(w["display_name"], w["cited_by_count"])
70
- # Confirmed: Deep learning 79790
71
- ```
72
-
73
- ### Reconstruct abstract from inverted index
74
-
75
- OpenAlex does not return abstracts as plain strings — they come as an inverted index (`{word: [position, ...], ...}`) due to publisher agreements. Reconstruct as follows:
76
-
77
- ```python
78
- from helpers import http_get
79
- import json
80
-
81
- w = json.loads(http_get(
82
- "https://api.openalex.org/works/W2626778328"
83
- "?select=id,display_name,abstract_inverted_index"
84
- "&mailto=you@example.com"
85
- ))
86
- aii = w.get("abstract_inverted_index") or {}
87
- words_pos = [(pos, word) for word, positions in aii.items() for pos in positions]
88
- abstract = " ".join(word for _, word in sorted(words_pos))
89
- print(abstract[:200])
90
- # Confirmed: The dominant sequence transduction models are based on complex recurrent
91
- # or convolutional neural networks in an encoder-decoder configuration...
92
- ```
93
-
94
- ### Author lookup
95
-
96
- ```python
97
- from helpers import http_get
98
- import json
99
-
100
- # Search by name
101
- data = json.loads(http_get(
102
- "https://api.openalex.org/authors?search=geoffrey+hinton&per-page=3&mailto=you@example.com"
103
- ))
104
- for a in data["results"]:
105
- bare_id = a["id"].split("/")[-1] # e.g. A5108093963
106
- print(bare_id, a["display_name"], a["works_count"], "works |", a["cited_by_count"], "cites")
107
- affils = a.get("affiliations", [])
108
- if affils:
109
- print(" latest affil:", affils[0]["institution"]["display_name"])
110
- # Confirmed:
111
- # A5108093963 Geoffrey E. Hinton 384 works | 446018 cites
112
- # latest affil: University of New Brunswick
113
-
114
- # Fetch by bare ID
115
- a = json.loads(http_get("https://api.openalex.org/authors/A5108093963?mailto=you@example.com"))
116
- print(a["display_name"], a["works_count"])
117
- # Confirmed: Geoffrey E. Hinton 384
118
-
119
- # Get all works by this author (sorted by citations)
120
- works_data = json.loads(http_get(
121
- "https://api.openalex.org/works"
122
- "?filter=author.id:A5108093963"
123
- "&per-page=5&sort=cited_by_count:desc"
124
- "&select=id,display_name,cited_by_count,publication_year"
125
- "&mailto=you@example.com"
126
- ))
127
- for w in works_data["results"]:
128
- print(w["publication_year"], w["display_name"][:55], w["cited_by_count"])
129
- # Confirmed:
130
- # 2015 Deep learning 79790
131
- # 2017 ImageNet classification with deep convolutional neural netwo 75670
132
- # 2008 Visualizing Data using t-SNE 35710
133
- ```
134
-
135
- ### Institution lookup
136
-
137
- ```python
138
- from helpers import http_get
139
- import json
140
-
141
- data = json.loads(http_get(
142
- "https://api.openalex.org/institutions?search=MIT&per-page=3&mailto=you@example.com"
143
- ))
144
- for inst in data["results"]:
145
- bare_id = inst["id"].split("/")[-1] # e.g. I63966007
146
- print(bare_id, inst["display_name"], inst["country_code"], inst["works_count"], "works")
147
- # Confirmed: I63966007 Massachusetts Institute of Technology US 340302 works
148
-
149
- # Works from an institution
150
- works = json.loads(http_get(
151
- "https://api.openalex.org/works"
152
- "?filter=institutions.id:I63966007"
153
- "&per-page=3&sort=cited_by_count:desc"
154
- "&select=id,display_name,cited_by_count,publication_year"
155
- "&mailto=you@example.com"
156
- ))
157
- print("total MIT works:", works["meta"]["count"])
158
- # Confirmed: 323992
159
- ```
160
-
161
- ### Concept/Topic lookup
162
-
163
- Concepts (legacy, level-based hierarchy) and Topics (newer, 4-level hierarchy) are both available.
164
-
165
- ```python
166
- from helpers import http_get
167
- import json
168
-
169
- # Concepts endpoint (Wikidata-linked)
170
- data = json.loads(http_get(
171
- "https://api.openalex.org/concepts?search=machine+learning&per-page=5&mailto=you@example.com"
172
- ))
173
- for c in data["results"]:
174
- bare_id = c["id"].split("/")[-1] # e.g. C119857082
175
- print(bare_id, c["display_name"], "level:", c["level"], "works:", c["works_count"])
176
- # Confirmed: C119857082 Machine learning level: 1 works: 4960536
177
-
178
- # Topics endpoint (newer: domain > field > subfield > topic)
179
- data2 = json.loads(http_get(
180
- "https://api.openalex.org/topics?search=machine+learning&per-page=3&mailto=you@example.com"
181
- ))
182
- for t in data2["results"]:
183
- print(t["id"].split("/")[-1], t["display_name"])
184
- print(" ", t.get("domain", {}).get("display_name"), ">",
185
- t.get("field", {}).get("display_name"), ">",
186
- t.get("subfield", {}).get("display_name"))
187
- # Confirmed: T11948 Machine Learning in Materials Science
188
- # Physical Sciences > Materials Science > Materials Chemistry
189
- ```
190
-
191
- ### Source (journal/venue) lookup
192
-
193
- ```python
194
- from helpers import http_get
195
- import json
196
-
197
- data = json.loads(http_get(
198
- "https://api.openalex.org/sources?search=nature&per-page=3&mailto=you@example.com"
199
- ))
200
- for s in data["results"]:
201
- bare_id = s["id"].split("/")[-1] # e.g. S137773608
202
- print(bare_id, s["display_name"], s["type"], "issn:", s["issn"], "oa:", s["is_oa"])
203
- # Confirmed: S137773608 Nature journal issn: ['0028-0836', '1476-4687'] oa: False
204
-
205
- # Works in a source
206
- works = json.loads(http_get(
207
- "https://api.openalex.org/works?filter=primary_location.source.id:S137773608"
208
- "&per-page=3&sort=cited_by_count:desc"
209
- "&select=id,display_name,cited_by_count"
210
- "&mailto=you@example.com"
211
- ))
212
- print("Nature works:", works["meta"]["count"])
213
- ```
214
-
215
- ### Funder lookup
216
-
217
- ```python
218
- from helpers import http_get
219
- import json
220
-
221
- data = json.loads(http_get(
222
- "https://api.openalex.org/funders?search=national+science+foundation&per-page=3&mailto=you@example.com"
223
- ))
224
- for f in data["results"]:
225
- bare_id = f["id"].split("/")[-1] # e.g. F4320306076
226
- print(bare_id, f["display_name"], f["country_code"], f["works_count"], "works")
227
- # Confirmed: F4320306076 National Science Foundation US
228
- ```
229
-
230
- ### Citation traversal
231
-
232
- ```python
233
- from helpers import http_get
234
- import json
235
-
236
- paper_id = "W2626778328" # Attention Is All You Need
237
-
238
- # Papers that CITE this paper (forward citations)
239
- citing = json.loads(http_get(
240
- f"https://api.openalex.org/works?filter=cites:{paper_id}"
241
- "&per-page=5&sort=cited_by_count:desc"
242
- "&select=id,display_name,publication_year,cited_by_count"
243
- "&mailto=you@example.com"
244
- ))
245
- print("papers citing Attention:", citing["meta"]["count"])
246
- for w in citing["results"]:
247
- print(f" {w['publication_year']} {w['display_name'][:55]} ({w['cited_by_count']} cites)")
248
- # Confirmed: 6536 papers cite it; top: AlphaFold2 (43435), ViT (21409)
249
-
250
- # Papers THIS paper cites (backward — list of IDs in the work object)
251
- paper = json.loads(http_get(
252
- f"https://api.openalex.org/works/{paper_id}?select=referenced_works&mailto=you@example.com"
253
- ))
254
- refs = paper.get("referenced_works", [])
255
- ref_ids = [r.split("/")[-1] for r in refs] # bare IDs like W1632114991
256
- print(f"references {len(ref_ids)} works:", ref_ids[:3])
257
- # Confirmed: references 28 works
258
- ```
259
-
260
- ### Cursor pagination (bulk harvest)
261
-
262
- Use cursor pagination (not page-based) for more than 10,000 results. Page-based fails with HTTP 400 beyond page 50 at per-page=200.
263
-
264
- ```python
265
- from helpers import http_get
266
- import json, urllib.parse
267
-
268
- def harvest_works(query_filter, max_results=1000, mailto="you@example.com"):
269
- """Yield work dicts using cursor pagination."""
270
- cursor = "*"
271
- collected = 0
272
- while collected < max_results:
273
- per_page = min(200, max_results - collected)
274
- encoded_cursor = urllib.parse.quote(cursor, safe="")
275
- url = (
276
- f"https://api.openalex.org/works"
277
- f"?filter={query_filter}"
278
- f"&per-page={per_page}"
279
- f"&cursor={encoded_cursor}"
280
- f"&select=id,display_name,publication_year,cited_by_count"
281
- f"&mailto={mailto}"
282
- )
283
- data = json.loads(http_get(url))
284
- results = data.get("results", [])
285
- if not results:
286
- break
287
- for w in results:
288
- yield w
289
- collected += len(results)
290
- next_cursor = data["meta"].get("next_cursor")
291
- if not next_cursor:
292
- break
293
- cursor = next_cursor
294
-
295
- for w in harvest_works("concepts.id:C119857082,publication_year:2023", max_results=400):
296
- print(w["id"].split("/")[-1], w["display_name"][:55])
297
- ```
298
-
299
- ### Group-by analytics
300
-
301
- ```python
302
- from helpers import http_get
303
- import json
304
-
305
- # Publication counts by year for machine learning papers
306
- data = json.loads(http_get(
307
- "https://api.openalex.org/works"
308
- "?filter=concepts.id:C119857082" # C119857082 = Machine learning concept
309
- "&group_by=publication_year"
310
- "&mailto=you@example.com"
311
- ))
312
- print("groups_count:", data["meta"]["groups_count"])
313
- for g in data.get("group_by", [])[:5]:
314
- print(f" {g['key']}: {g['count']:,} works")
315
- # Confirmed (2026-04-18):
316
- # 2026: 5,678,538 works
317
- # 2025: 5,332,194 works
318
- # 2020: 3,966,880 works
319
-
320
- # Other useful group_by fields: open_access.oa_status, type, institutions.country_code
321
- # authorships.institutions.country_code, primary_location.source.id
322
- ```
323
-
324
- ## Filter syntax reference
325
-
326
- Filters go in the `filter=` param as comma-separated `field:value` pairs. All conditions are AND-ed.
327
-
328
- ```
329
- # Exact match
330
- filter=publication_year:2023
331
-
332
- # Full-text search on a field
333
- filter=title.search:deep+learning
334
-
335
- # Combine multiple (AND)
336
- filter=title.search:CRISPR,publication_year:2022,open_access.is_oa:true
337
-
338
- # OR within one field (pipe operator)
339
- filter=publication_year:2022|2023
340
-
341
- # Negation
342
- filter=publication_year:!2020
343
-
344
- # Range
345
- filter=cited_by_count:>1000
346
- filter=publication_year:<2010
347
- filter=cited_by_count:100-500
348
-
349
- # Nested field access
350
- filter=author.id:A5108093963
351
- filter=institutions.id:I63966007
352
- filter=concepts.id:C119857082
353
- filter=primary_location.source.id:S137773608
354
- filter=open_access.is_oa:true
355
- filter=cites:W2626778328 # papers citing this work
356
- ```
357
-
358
- Commonly useful filter fields for works:
359
-
360
- | Filter field | Example | Notes |
361
- |---|---|---|
362
- | `title.search` | `title.search:machine+learning` | Full-text on title |
363
- | `abstract.search` | `abstract.search:attention` | Full-text on abstract |
364
- | `publication_year` | `publication_year:2023` | Exact year |
365
- | `from_publication_date` | `from_publication_date:2023-01-01` | Date range start |
366
- | `to_publication_date` | `to_publication_date:2023-12-31` | Date range end |
367
- | `cited_by_count` | `cited_by_count:>500` | Range with `>`, `<`, `-` |
368
- | `open_access.is_oa` | `open_access.is_oa:true` | OA filter |
369
- | `author.id` | `author.id:A5108093963` | By author OpenAlex ID |
370
- | `institutions.id` | `institutions.id:I63966007` | By institution ID |
371
- | `concepts.id` | `concepts.id:C119857082` | By concept ID |
372
- | `primary_location.source.id` | `primary_location.source.id:S137773608` | By journal/source |
373
- | `type` | `type:journal-article` | Work type |
374
- | `language` | `language:en` | ISO 639-1 language code |
375
- | `cites` | `cites:W2626778328` | Works citing this paper |
376
- | `doi` | `doi:10.1038/nature14539` | By DOI (no `https://doi.org/` prefix) |
377
-
378
- ## URL and parameter reference
379
-
380
- ### API base
381
-
382
- ```
383
- https://api.openalex.org/{entity_type}
384
- ```
385
-
386
- Entity types: `works`, `authors`, `institutions`, `sources`, `concepts`, `topics`, `funders`, `publishers`
387
-
388
- ### Query parameters
389
-
390
- | Parameter | Example | Notes |
391
- |---|---|---|
392
- | `search` | `search=deep+learning` | Full-text relevance search across entity |
393
- | `filter` | `filter=publication_year:2023` | Structured filters (see above) |
394
- | `sort` | `sort=cited_by_count:desc` | Sort field + direction; use `relevance_score:desc` with `search` |
395
- | `per-page` | `per-page=200` | Max 200 per page |
396
- | `page` | `page=2` | Page number; fails (HTTP 400) if `per-page * page > 10000` |
397
- | `cursor` | `cursor=*` | Cursor for bulk pagination; `*` = first page |
398
- | `select` | `select=id,doi,display_name` | Return only these fields (reduces payload) |
399
- | `group_by` | `group_by=publication_year` | Aggregate counts by field (returns `group_by` array) |
400
- | `mailto` | `mailto=you@example.com` | **Always include** — enables polite pool |
401
-
402
- ### Entity ID prefix convention
403
-
404
- OpenAlex IDs use a letter prefix on the numeric ID:
405
-
406
- | Prefix | Entity | Example |
407
- |---|---|---|
408
- | `W` | Work (paper) | `W2626778328` |
409
- | `A` | Author | `A5108093963` |
410
- | `I` | Institution | `I63966007` |
411
- | `S` | Source (journal) | `S137773608` |
412
- | `C` | Concept | `C119857082` |
413
- | `T` | Topic | `T11948` |
414
- | `F` | Funder | `F4320306076` |
415
- | `P` | Publisher | `P4310319965` |
416
-
417
- Full entity URLs: `https://openalex.org/{ID}` (canonical form returned in `id` field).
418
- Bare ID is always `entity_url.split("/")[-1]`.
419
-
420
- ### Sort fields
421
-
422
- Works: `cited_by_count`, `publication_date`, `relevance_score` (only with `search`), `fwci`
423
- Authors: `cited_by_count`, `works_count`
424
- All: append `:desc` or `:asc`
425
-
426
- ## Rate limits
427
-
428
- | Pool | Rate | Daily cap |
429
- |---|---|---|
430
- | Polite pool (with `mailto=`) | 10 req/s | 100,000 req/day |
431
- | Common pool (no `mailto`) | 100 req/s | 100,000 req/day |
432
-
433
- - No API key required — the polite pool is opt-in via `mailto=`.
434
- - Response includes `meta.cost_usd` (typically $0.001 per call).
435
- - No `Retry-After` header when throttled — just add a short sleep on 429.
436
- - For bulk harvesting >1,000 results, use cursor pagination + respect the polite pool.
437
-
438
- ## Gotchas
439
-
440
- - **Never use the browser for OpenAlex.** The API returns complete structured JSON for all entity types. No HTML scraping needed.
441
-
442
- - **`mailto=` goes in every call, not just once.** It is a query parameter, not a header. There is no session. Omitting it puts you in the common pool (higher contention, less predictable).
443
-
444
- - **OpenAlex IDs in the `id` field are full URLs, not bare IDs.** The field returns `https://openalex.org/W2626778328`. Always `.split("/")[-1]` to get the bare `W2626778328` form needed for `filter=cites:`, `filter=author.id:`, etc.
445
-
446
- - **DOI lookup uses the full DOI URL as the path parameter.** Correct: `GET /works/https://doi.org/10.1038/nature14539`. Incorrect: `GET /works/10.1038/nature14539` (returns 404).
447
-
448
- - **Page-based pagination hard stops at 10,000 results.** `per-page=200&page=51` returns HTTP 400. Use `cursor=*` pagination for harvesting more than 10K results — it has no such limit.
449
-
450
- - **`cursor=*` must be URL-encoded on subsequent pages.** The `next_cursor` value contains `+`, `=`, `/` characters. Always `urllib.parse.quote(cursor, safe="")` before interpolating into the URL.
451
-
452
- - **`group_by` and `page` are incompatible — use `group_by` without `page`/`cursor`.** Group-by returns a `group_by` list, not `results`. The `per-page` param sets max groups returned (default 200).
453
-
454
- - **`abstract_inverted_index` may be `null` for some papers.** Publisher agreements prevent OpenAlex from providing abstracts for many closed-access works. Always check `if aii:` before reconstructing.
455
-
456
- - **`select` significantly reduces response size and latency.** A full work object has 50+ fields; specifying `select=id,doi,display_name,cited_by_count` cuts payload by ~90%. Always use `select=` in bulk harvests.
457
-
458
- - **`sort=relevance_score:desc` only works with `search=`.** Using it without a `search` param returns results in undefined order. Use `cited_by_count:desc` or `publication_date:desc` for filter-only queries.
459
-
460
- - **The `concepts` field is deprecated in favor of `topics`.** Concepts (Wikidata-linked, 5 levels) are still populated and useful, but OpenAlex now recommends `topics` (4-level hierarchy: domain > field > subfield > topic) going forward.
461
-
462
- - **`open_access.oa_url` can be `null` even when `is_oa=true`.** Check `best_oa_location.pdf_url` instead — it is more reliably populated when an OA PDF exists.
463
-
464
- - **Negation filter syntax is `field:!value`, not `field!=value`.** Example: `filter=publication_year:!2020` excludes 2020.
465
-
466
- - **Author disambiguation is imperfect.** The same person may appear as multiple author entities. Use ORCID (`ids.orcid`) when available to cross-reference. The `display_name_alternatives` field lists name variants.
467
-
468
- - **The `funders.grants_count` field returns `None` in API responses** despite the docs mentioning it. Use `works_count` and `cited_by_count` instead for funder-level metrics.
469
-
470
- - **`per-page` with `cursor=*` ignores `page=`.** When cursor pagination is active, `page` is set to `null` in `meta`. Do not combine cursor + page.
1
+ # OpenAlex — Scraping & Data Extraction
2
+
3
+ `https://api.openalex.org` — open academic knowledge graph covering 260M+ works, 90M+ authors, 110K+ institutions. **Never use the browser for OpenAlex.** The entire API is JSON over HTTPS, completely free, no API key required. Add `mailto=your@email.com` to every request to use the polite pool (10 req/s vs 100 req/s limit, more reliable).
4
+
5
+ ## Do this first
6
+
7
+ **Use `http_get` with the REST JSON API — one call, JSON response, no auth, no parsing library.**
8
+
9
+ ```python
10
+ from helpers import http_get
11
+ import json
12
+
13
+ data = json.loads(http_get(
14
+ "https://api.openalex.org/works?search=transformer+attention&per-page=5&mailto=you@example.com"
15
+ ))
16
+ works = data["results"]
17
+ total = data["meta"]["count"]
18
+ ```
19
+
20
+ Always include `mailto=` to stay in the polite pool. Always parse with `json.loads()`.
21
+
22
+ ## Common workflows
23
+
24
+ ### Search papers (works)
25
+
26
+ ```python
27
+ from helpers import http_get
28
+ import json
29
+
30
+ data = json.loads(http_get(
31
+ "https://api.openalex.org/works"
32
+ "?search=transformer+attention"
33
+ "&per-page=5"
34
+ "&sort=cited_by_count:desc"
35
+ "&select=id,doi,display_name,publication_year,cited_by_count,open_access,primary_location"
36
+ "&mailto=you@example.com"
37
+ ))
38
+ print("total matching:", data["meta"]["count"])
39
+ for w in data["results"]:
40
+ oa = w["open_access"]
41
+ loc = w["primary_location"] or {}
42
+ src = loc.get("source") or {}
43
+ print(w["id"].split("/")[-1], w["publication_year"], w["cited_by_count"], w["display_name"][:60])
44
+ print(" doi:", w["doi"])
45
+ print(" open access:", oa["is_oa"], "| pdf:", oa["oa_url"])
46
+ print(" journal:", src.get("display_name"))
47
+ # Confirmed output (2026-04-18):
48
+ # W3151130473 2021 1887 CrossViT: Cross-Attention Multi-Scale Vision Transformer for I
49
+ # doi: https://doi.org/10.1109/iccv48922.2021.00041
50
+ # open access: False | pdf: None
51
+ # journal: 2021 IEEE/CVF International Conference on Computer Vision (ICCV)
52
+ ```
53
+
54
+ ### Fetch single paper by OpenAlex ID or DOI
55
+
56
+ ```python
57
+ from helpers import http_get
58
+ import json
59
+
60
+ # By OpenAlex ID (bare or full URL form both work)
61
+ w = json.loads(http_get("https://api.openalex.org/works/W2626778328?mailto=you@example.com"))
62
+ print(w["display_name"], w["cited_by_count"])
63
+ # Confirmed: Attention Is All You Need 6526
64
+
65
+ # By DOI (pass the full DOI URL as the entity ID)
66
+ w = json.loads(http_get(
67
+ "https://api.openalex.org/works/https://doi.org/10.1038/nature14539?mailto=you@example.com"
68
+ ))
69
+ print(w["display_name"], w["cited_by_count"])
70
+ # Confirmed: Deep learning 79790
71
+ ```
72
+
73
+ ### Reconstruct abstract from inverted index
74
+
75
+ OpenAlex does not return abstracts as plain strings — they come as an inverted index (`{word: [position, ...], ...}`) due to publisher agreements. Reconstruct as follows:
76
+
77
+ ```python
78
+ from helpers import http_get
79
+ import json
80
+
81
+ w = json.loads(http_get(
82
+ "https://api.openalex.org/works/W2626778328"
83
+ "?select=id,display_name,abstract_inverted_index"
84
+ "&mailto=you@example.com"
85
+ ))
86
+ aii = w.get("abstract_inverted_index") or {}
87
+ words_pos = [(pos, word) for word, positions in aii.items() for pos in positions]
88
+ abstract = " ".join(word for _, word in sorted(words_pos))
89
+ print(abstract[:200])
90
+ # Confirmed: The dominant sequence transduction models are based on complex recurrent
91
+ # or convolutional neural networks in an encoder-decoder configuration...
92
+ ```
93
+
94
+ ### Author lookup
95
+
96
+ ```python
97
+ from helpers import http_get
98
+ import json
99
+
100
+ # Search by name
101
+ data = json.loads(http_get(
102
+ "https://api.openalex.org/authors?search=geoffrey+hinton&per-page=3&mailto=you@example.com"
103
+ ))
104
+ for a in data["results"]:
105
+ bare_id = a["id"].split("/")[-1] # e.g. A5108093963
106
+ print(bare_id, a["display_name"], a["works_count"], "works |", a["cited_by_count"], "cites")
107
+ affils = a.get("affiliations", [])
108
+ if affils:
109
+ print(" latest affil:", affils[0]["institution"]["display_name"])
110
+ # Confirmed:
111
+ # A5108093963 Geoffrey E. Hinton 384 works | 446018 cites
112
+ # latest affil: University of New Brunswick
113
+
114
+ # Fetch by bare ID
115
+ a = json.loads(http_get("https://api.openalex.org/authors/A5108093963?mailto=you@example.com"))
116
+ print(a["display_name"], a["works_count"])
117
+ # Confirmed: Geoffrey E. Hinton 384
118
+
119
+ # Get all works by this author (sorted by citations)
120
+ works_data = json.loads(http_get(
121
+ "https://api.openalex.org/works"
122
+ "?filter=author.id:A5108093963"
123
+ "&per-page=5&sort=cited_by_count:desc"
124
+ "&select=id,display_name,cited_by_count,publication_year"
125
+ "&mailto=you@example.com"
126
+ ))
127
+ for w in works_data["results"]:
128
+ print(w["publication_year"], w["display_name"][:55], w["cited_by_count"])
129
+ # Confirmed:
130
+ # 2015 Deep learning 79790
131
+ # 2017 ImageNet classification with deep convolutional neural netwo 75670
132
+ # 2008 Visualizing Data using t-SNE 35710
133
+ ```
134
+
135
+ ### Institution lookup
136
+
137
+ ```python
138
+ from helpers import http_get
139
+ import json
140
+
141
+ data = json.loads(http_get(
142
+ "https://api.openalex.org/institutions?search=MIT&per-page=3&mailto=you@example.com"
143
+ ))
144
+ for inst in data["results"]:
145
+ bare_id = inst["id"].split("/")[-1] # e.g. I63966007
146
+ print(bare_id, inst["display_name"], inst["country_code"], inst["works_count"], "works")
147
+ # Confirmed: I63966007 Massachusetts Institute of Technology US 340302 works
148
+
149
+ # Works from an institution
150
+ works = json.loads(http_get(
151
+ "https://api.openalex.org/works"
152
+ "?filter=institutions.id:I63966007"
153
+ "&per-page=3&sort=cited_by_count:desc"
154
+ "&select=id,display_name,cited_by_count,publication_year"
155
+ "&mailto=you@example.com"
156
+ ))
157
+ print("total MIT works:", works["meta"]["count"])
158
+ # Confirmed: 323992
159
+ ```
160
+
161
+ ### Concept/Topic lookup
162
+
163
+ Concepts (legacy, level-based hierarchy) and Topics (newer, 4-level hierarchy) are both available.
164
+
165
+ ```python
166
+ from helpers import http_get
167
+ import json
168
+
169
+ # Concepts endpoint (Wikidata-linked)
170
+ data = json.loads(http_get(
171
+ "https://api.openalex.org/concepts?search=machine+learning&per-page=5&mailto=you@example.com"
172
+ ))
173
+ for c in data["results"]:
174
+ bare_id = c["id"].split("/")[-1] # e.g. C119857082
175
+ print(bare_id, c["display_name"], "level:", c["level"], "works:", c["works_count"])
176
+ # Confirmed: C119857082 Machine learning level: 1 works: 4960536
177
+
178
+ # Topics endpoint (newer: domain > field > subfield > topic)
179
+ data2 = json.loads(http_get(
180
+ "https://api.openalex.org/topics?search=machine+learning&per-page=3&mailto=you@example.com"
181
+ ))
182
+ for t in data2["results"]:
183
+ print(t["id"].split("/")[-1], t["display_name"])
184
+ print(" ", t.get("domain", {}).get("display_name"), ">",
185
+ t.get("field", {}).get("display_name"), ">",
186
+ t.get("subfield", {}).get("display_name"))
187
+ # Confirmed: T11948 Machine Learning in Materials Science
188
+ # Physical Sciences > Materials Science > Materials Chemistry
189
+ ```
190
+
191
+ ### Source (journal/venue) lookup
192
+
193
+ ```python
194
+ from helpers import http_get
195
+ import json
196
+
197
+ data = json.loads(http_get(
198
+ "https://api.openalex.org/sources?search=nature&per-page=3&mailto=you@example.com"
199
+ ))
200
+ for s in data["results"]:
201
+ bare_id = s["id"].split("/")[-1] # e.g. S137773608
202
+ print(bare_id, s["display_name"], s["type"], "issn:", s["issn"], "oa:", s["is_oa"])
203
+ # Confirmed: S137773608 Nature journal issn: ['0028-0836', '1476-4687'] oa: False
204
+
205
+ # Works in a source
206
+ works = json.loads(http_get(
207
+ "https://api.openalex.org/works?filter=primary_location.source.id:S137773608"
208
+ "&per-page=3&sort=cited_by_count:desc"
209
+ "&select=id,display_name,cited_by_count"
210
+ "&mailto=you@example.com"
211
+ ))
212
+ print("Nature works:", works["meta"]["count"])
213
+ ```
214
+
215
+ ### Funder lookup
216
+
217
+ ```python
218
+ from helpers import http_get
219
+ import json
220
+
221
+ data = json.loads(http_get(
222
+ "https://api.openalex.org/funders?search=national+science+foundation&per-page=3&mailto=you@example.com"
223
+ ))
224
+ for f in data["results"]:
225
+ bare_id = f["id"].split("/")[-1] # e.g. F4320306076
226
+ print(bare_id, f["display_name"], f["country_code"], f["works_count"], "works")
227
+ # Confirmed: F4320306076 National Science Foundation US
228
+ ```
229
+
230
+ ### Citation traversal
231
+
232
+ ```python
233
+ from helpers import http_get
234
+ import json
235
+
236
+ paper_id = "W2626778328" # Attention Is All You Need
237
+
238
+ # Papers that CITE this paper (forward citations)
239
+ citing = json.loads(http_get(
240
+ f"https://api.openalex.org/works?filter=cites:{paper_id}"
241
+ "&per-page=5&sort=cited_by_count:desc"
242
+ "&select=id,display_name,publication_year,cited_by_count"
243
+ "&mailto=you@example.com"
244
+ ))
245
+ print("papers citing Attention:", citing["meta"]["count"])
246
+ for w in citing["results"]:
247
+ print(f" {w['publication_year']} {w['display_name'][:55]} ({w['cited_by_count']} cites)")
248
+ # Confirmed: 6536 papers cite it; top: AlphaFold2 (43435), ViT (21409)
249
+
250
+ # Papers THIS paper cites (backward — list of IDs in the work object)
251
+ paper = json.loads(http_get(
252
+ f"https://api.openalex.org/works/{paper_id}?select=referenced_works&mailto=you@example.com"
253
+ ))
254
+ refs = paper.get("referenced_works", [])
255
+ ref_ids = [r.split("/")[-1] for r in refs] # bare IDs like W1632114991
256
+ print(f"references {len(ref_ids)} works:", ref_ids[:3])
257
+ # Confirmed: references 28 works
258
+ ```
259
+
260
+ ### Cursor pagination (bulk harvest)
261
+
262
+ Use cursor pagination (not page-based) for more than 10,000 results. Page-based fails with HTTP 400 beyond page 50 at per-page=200.
263
+
264
+ ```python
265
+ from helpers import http_get
266
+ import json, urllib.parse
267
+
268
+ def harvest_works(query_filter, max_results=1000, mailto="you@example.com"):
269
+ """Yield work dicts using cursor pagination."""
270
+ cursor = "*"
271
+ collected = 0
272
+ while collected < max_results:
273
+ per_page = min(200, max_results - collected)
274
+ encoded_cursor = urllib.parse.quote(cursor, safe="")
275
+ url = (
276
+ f"https://api.openalex.org/works"
277
+ f"?filter={query_filter}"
278
+ f"&per-page={per_page}"
279
+ f"&cursor={encoded_cursor}"
280
+ f"&select=id,display_name,publication_year,cited_by_count"
281
+ f"&mailto={mailto}"
282
+ )
283
+ data = json.loads(http_get(url))
284
+ results = data.get("results", [])
285
+ if not results:
286
+ break
287
+ for w in results:
288
+ yield w
289
+ collected += len(results)
290
+ next_cursor = data["meta"].get("next_cursor")
291
+ if not next_cursor:
292
+ break
293
+ cursor = next_cursor
294
+
295
+ for w in harvest_works("concepts.id:C119857082,publication_year:2023", max_results=400):
296
+ print(w["id"].split("/")[-1], w["display_name"][:55])
297
+ ```
298
+
299
+ ### Group-by analytics
300
+
301
+ ```python
302
+ from helpers import http_get
303
+ import json
304
+
305
+ # Publication counts by year for machine learning papers
306
+ data = json.loads(http_get(
307
+ "https://api.openalex.org/works"
308
+ "?filter=concepts.id:C119857082" # C119857082 = Machine learning concept
309
+ "&group_by=publication_year"
310
+ "&mailto=you@example.com"
311
+ ))
312
+ print("groups_count:", data["meta"]["groups_count"])
313
+ for g in data.get("group_by", [])[:5]:
314
+ print(f" {g['key']}: {g['count']:,} works")
315
+ # Confirmed (2026-04-18):
316
+ # 2026: 5,678,538 works
317
+ # 2025: 5,332,194 works
318
+ # 2020: 3,966,880 works
319
+
320
+ # Other useful group_by fields: open_access.oa_status, type, institutions.country_code
321
+ # authorships.institutions.country_code, primary_location.source.id
322
+ ```
323
+
324
+ ## Filter syntax reference
325
+
326
+ Filters go in the `filter=` param as comma-separated `field:value` pairs. All conditions are AND-ed.
327
+
328
+ ```
329
+ # Exact match
330
+ filter=publication_year:2023
331
+
332
+ # Full-text search on a field
333
+ filter=title.search:deep+learning
334
+
335
+ # Combine multiple (AND)
336
+ filter=title.search:CRISPR,publication_year:2022,open_access.is_oa:true
337
+
338
+ # OR within one field (pipe operator)
339
+ filter=publication_year:2022|2023
340
+
341
+ # Negation
342
+ filter=publication_year:!2020
343
+
344
+ # Range
345
+ filter=cited_by_count:>1000
346
+ filter=publication_year:<2010
347
+ filter=cited_by_count:100-500
348
+
349
+ # Nested field access
350
+ filter=author.id:A5108093963
351
+ filter=institutions.id:I63966007
352
+ filter=concepts.id:C119857082
353
+ filter=primary_location.source.id:S137773608
354
+ filter=open_access.is_oa:true
355
+ filter=cites:W2626778328 # papers citing this work
356
+ ```
357
+
358
+ Commonly useful filter fields for works:
359
+
360
+ | Filter field | Example | Notes |
361
+ |---|---|---|
362
+ | `title.search` | `title.search:machine+learning` | Full-text on title |
363
+ | `abstract.search` | `abstract.search:attention` | Full-text on abstract |
364
+ | `publication_year` | `publication_year:2023` | Exact year |
365
+ | `from_publication_date` | `from_publication_date:2023-01-01` | Date range start |
366
+ | `to_publication_date` | `to_publication_date:2023-12-31` | Date range end |
367
+ | `cited_by_count` | `cited_by_count:>500` | Range with `>`, `<`, `-` |
368
+ | `open_access.is_oa` | `open_access.is_oa:true` | OA filter |
369
+ | `author.id` | `author.id:A5108093963` | By author OpenAlex ID |
370
+ | `institutions.id` | `institutions.id:I63966007` | By institution ID |
371
+ | `concepts.id` | `concepts.id:C119857082` | By concept ID |
372
+ | `primary_location.source.id` | `primary_location.source.id:S137773608` | By journal/source |
373
+ | `type` | `type:journal-article` | Work type |
374
+ | `language` | `language:en` | ISO 639-1 language code |
375
+ | `cites` | `cites:W2626778328` | Works citing this paper |
376
+ | `doi` | `doi:10.1038/nature14539` | By DOI (no `https://doi.org/` prefix) |
377
+
378
+ ## URL and parameter reference
379
+
380
+ ### API base
381
+
382
+ ```
383
+ https://api.openalex.org/{entity_type}
384
+ ```
385
+
386
+ Entity types: `works`, `authors`, `institutions`, `sources`, `concepts`, `topics`, `funders`, `publishers`
387
+
388
+ ### Query parameters
389
+
390
+ | Parameter | Example | Notes |
391
+ |---|---|---|
392
+ | `search` | `search=deep+learning` | Full-text relevance search across entity |
393
+ | `filter` | `filter=publication_year:2023` | Structured filters (see above) |
394
+ | `sort` | `sort=cited_by_count:desc` | Sort field + direction; use `relevance_score:desc` with `search` |
395
+ | `per-page` | `per-page=200` | Max 200 per page |
396
+ | `page` | `page=2` | Page number; fails (HTTP 400) if `per-page * page > 10000` |
397
+ | `cursor` | `cursor=*` | Cursor for bulk pagination; `*` = first page |
398
+ | `select` | `select=id,doi,display_name` | Return only these fields (reduces payload) |
399
+ | `group_by` | `group_by=publication_year` | Aggregate counts by field (returns `group_by` array) |
400
+ | `mailto` | `mailto=you@example.com` | **Always include** — enables polite pool |
401
+
402
+ ### Entity ID prefix convention
403
+
404
+ OpenAlex IDs use a letter prefix on the numeric ID:
405
+
406
+ | Prefix | Entity | Example |
407
+ |---|---|---|
408
+ | `W` | Work (paper) | `W2626778328` |
409
+ | `A` | Author | `A5108093963` |
410
+ | `I` | Institution | `I63966007` |
411
+ | `S` | Source (journal) | `S137773608` |
412
+ | `C` | Concept | `C119857082` |
413
+ | `T` | Topic | `T11948` |
414
+ | `F` | Funder | `F4320306076` |
415
+ | `P` | Publisher | `P4310319965` |
416
+
417
+ Full entity URLs: `https://openalex.org/{ID}` (canonical form returned in `id` field).
418
+ Bare ID is always `entity_url.split("/")[-1]`.
419
+
420
+ ### Sort fields
421
+
422
+ Works: `cited_by_count`, `publication_date`, `relevance_score` (only with `search`), `fwci`
423
+ Authors: `cited_by_count`, `works_count`
424
+ All: append `:desc` or `:asc`
425
+
426
+ ## Rate limits
427
+
428
+ | Pool | Rate | Daily cap |
429
+ |---|---|---|
430
+ | Polite pool (with `mailto=`) | 10 req/s | 100,000 req/day |
431
+ | Common pool (no `mailto`) | 100 req/s | 100,000 req/day |
432
+
433
+ - No API key required — the polite pool is opt-in via `mailto=`.
434
+ - Response includes `meta.cost_usd` (typically $0.001 per call).
435
+ - No `Retry-After` header when throttled — just add a short sleep on 429.
436
+ - For bulk harvesting >1,000 results, use cursor pagination + respect the polite pool.
437
+
438
+ ## Gotchas
439
+
440
+ - **Never use the browser for OpenAlex.** The API returns complete structured JSON for all entity types. No HTML scraping needed.
441
+
442
+ - **`mailto=` goes in every call, not just once.** It is a query parameter, not a header. There is no session. Omitting it puts you in the common pool (higher contention, less predictable).
443
+
444
+ - **OpenAlex IDs in the `id` field are full URLs, not bare IDs.** The field returns `https://openalex.org/W2626778328`. Always `.split("/")[-1]` to get the bare `W2626778328` form needed for `filter=cites:`, `filter=author.id:`, etc.
445
+
446
+ - **DOI lookup uses the full DOI URL as the path parameter.** Correct: `GET /works/https://doi.org/10.1038/nature14539`. Incorrect: `GET /works/10.1038/nature14539` (returns 404).
447
+
448
+ - **Page-based pagination hard stops at 10,000 results.** `per-page=200&page=51` returns HTTP 400. Use `cursor=*` pagination for harvesting more than 10K results — it has no such limit.
449
+
450
+ - **`cursor=*` must be URL-encoded on subsequent pages.** The `next_cursor` value contains `+`, `=`, `/` characters. Always `urllib.parse.quote(cursor, safe="")` before interpolating into the URL.
451
+
452
+ - **`group_by` and `page` are incompatible — use `group_by` without `page`/`cursor`.** Group-by returns a `group_by` list, not `results`. The `per-page` param sets max groups returned (default 200).
453
+
454
+ - **`abstract_inverted_index` may be `null` for some papers.** Publisher agreements prevent OpenAlex from providing abstracts for many closed-access works. Always check `if aii:` before reconstructing.
455
+
456
+ - **`select` significantly reduces response size and latency.** A full work object has 50+ fields; specifying `select=id,doi,display_name,cited_by_count` cuts payload by ~90%. Always use `select=` in bulk harvests.
457
+
458
+ - **`sort=relevance_score:desc` only works with `search=`.** Using it without a `search` param returns results in undefined order. Use `cited_by_count:desc` or `publication_date:desc` for filter-only queries.
459
+
460
+ - **The `concepts` field is deprecated in favor of `topics`.** Concepts (Wikidata-linked, 5 levels) are still populated and useful, but OpenAlex now recommends `topics` (4-level hierarchy: domain > field > subfield > topic) going forward.
461
+
462
+ - **`open_access.oa_url` can be `null` even when `is_oa=true`.** Check `best_oa_location.pdf_url` instead — it is more reliably populated when an OA PDF exists.
463
+
464
+ - **Negation filter syntax is `field:!value`, not `field!=value`.** Example: `filter=publication_year:!2020` excludes 2020.
465
+
466
+ - **Author disambiguation is imperfect.** The same person may appear as multiple author entities. Use ORCID (`ids.orcid`) when available to cross-reference. The `display_name_alternatives` field lists name variants.
467
+
468
+ - **The `funders.grants_count` field returns `None` in API responses** despite the docs mentioning it. Use `works_count` and `cited_by_count` instead for funder-level metrics.
469
+
470
+ - **`per-page` with `cursor=*` ignores `page=`.** When cursor pagination is active, `page` is set to `null` in `meta`. Do not combine cursor + page.