@pencil-agent/nano-pencil 2.0.0-beta.8 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (241) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/extensions-host/index.d.ts +1 -1
  7. package/dist/core/extensions-host/loader.js +1 -1
  8. package/dist/core/extensions-host/runner.d.ts +1 -0
  9. package/dist/core/extensions-host/runner.js +2 -2
  10. package/dist/core/extensions-host/types.d.ts +17 -22
  11. package/dist/core/lib/ai/src/types.d.ts +12 -2
  12. package/dist/core/persona/persona-manager.js +5 -2
  13. package/dist/core/runtime/agent-session.js +3 -3
  14. package/dist/core/runtime/extension-core-bindings.d.ts +1 -0
  15. package/dist/core/runtime/extension-core-bindings.js +2 -2
  16. package/dist/extensions/builtin/AGENT.md +115 -115
  17. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  18. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  19. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  20. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  91. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  92. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  93. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  94. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  95. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  96. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  97. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  98. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  99. package/dist/extensions/builtin/browser/browser.md +73 -73
  100. package/dist/extensions/builtin/browser/install.md +142 -142
  101. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  102. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  103. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  104. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  105. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  107. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  108. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  109. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  110. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  111. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  112. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  113. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  114. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  115. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  116. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  117. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  118. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  119. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  120. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  121. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  122. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  123. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  124. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  125. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  126. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  127. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  128. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  129. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  130. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  131. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  132. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  133. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  134. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  135. package/dist/extensions/builtin/goal/README.md +67 -67
  136. package/dist/extensions/builtin/goal/goal-controller.d.ts +39 -10
  137. package/dist/extensions/builtin/goal/goal-controller.js +1 -1
  138. package/dist/extensions/builtin/goal/goal-format.js +1 -1
  139. package/dist/extensions/builtin/goal/goal-prompts.d.ts +2 -0
  140. package/dist/extensions/builtin/goal/goal-prompts.js +5 -4
  141. package/dist/extensions/builtin/goal/goal-store.js +1 -1
  142. package/dist/extensions/builtin/goal/index.d.ts +1 -1
  143. package/dist/extensions/builtin/goal/index.js +10 -7
  144. package/dist/extensions/builtin/grub/README.md +112 -112
  145. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  146. package/dist/extensions/builtin/link-world/index.js +6 -6
  147. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  148. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  149. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  150. package/dist/extensions/builtin/link-world/{network-routing.md → network-routing/network-routing.md} +67 -67
  151. package/dist/extensions/builtin/loop/README.md +92 -92
  152. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  153. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  154. package/dist/extensions/builtin/plan/index.js +1 -1
  155. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  156. package/dist/extensions/builtin/sal/README.md +72 -72
  157. package/dist/extensions/builtin/security-audit/README.md +289 -289
  158. package/dist/extensions/builtin/task/task-store.d.ts +4 -0
  159. package/dist/extensions/builtin/task/task-store.js +1 -1
  160. package/dist/extensions/builtin/team/AGENT.md +112 -112
  161. package/dist/extensions/builtin/team/TESTING.md +299 -299
  162. package/dist/extensions/builtin/token-save/README.md +56 -56
  163. package/dist/extensions/optional/AGENT.md +10 -10
  164. package/dist/index.d.ts +5 -30
  165. package/dist/index.js +1 -1
  166. package/dist/models.d.ts +7 -0
  167. package/dist/models.js +1 -0
  168. package/dist/modes/interactive/components/footer.js +1 -1
  169. package/dist/modes/interactive/components/task-status-panel.d.ts +36 -0
  170. package/dist/modes/interactive/components/task-status-panel.js +1 -0
  171. package/dist/modes/interactive/controllers/stream-render-controller.d.ts +7 -0
  172. package/dist/modes/interactive/controllers/stream-render-controller.js +2 -2
  173. package/dist/modes/interactive/interactive-mode.js +40 -40
  174. package/dist/modes/interactive/state/interactive-state.d.ts +2 -0
  175. package/dist/modes/interactive/state/interactive-state.js +1 -1
  176. package/dist/modes/interactive/theme/dark.json +85 -85
  177. package/dist/modes/interactive/theme/light.json +84 -84
  178. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  179. package/dist/modes/interactive/theme/warm.json +81 -81
  180. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  181. package/dist/node_modules/@pencil-agent/ai/dist/models.generated.js +1 -1
  182. package/dist/node_modules/@pencil-agent/ai/dist/providers/anthropic.js +2 -2
  183. package/dist/node_modules/@pencil-agent/ai/dist/providers/openai-completions.js +5 -5
  184. package/dist/node_modules/@pencil-agent/ai/dist/providers/openai-responses.js +1 -1
  185. package/dist/node_modules/@pencil-agent/ai/dist/stream.js +1 -1
  186. package/dist/packages/protocol/src/commands.d.ts +33 -0
  187. package/dist/packages/protocol/src/flags.d.ts +20 -0
  188. package/dist/packages/protocol/src/hooks.d.ts +17 -0
  189. package/dist/packages/protocol/src/hooks.js +0 -0
  190. package/dist/packages/{extension-sdk → protocol}/src/index.d.ts +7 -4
  191. package/dist/packages/protocol/src/index.js +1 -0
  192. package/dist/packages/{extension-sdk → protocol}/src/lifecycle.d.ts +15 -27
  193. package/dist/packages/protocol/src/lifecycle.js +0 -0
  194. package/dist/packages/{extension-sdk → protocol}/src/tools.d.ts +1 -1
  195. package/dist/packages/protocol/src/tools.js +0 -0
  196. package/dist/public-config.d.ts +12 -0
  197. package/dist/public-config.js +1 -0
  198. package/dist/runtime.d.ts +9 -0
  199. package/dist/runtime.js +1 -0
  200. package/dist/session-compaction.d.ts +7 -0
  201. package/dist/session-compaction.js +1 -0
  202. package/dist/session.d.ts +7 -0
  203. package/dist/session.js +1 -0
  204. package/dist/skills.d.ts +7 -0
  205. package/dist/skills.js +1 -0
  206. package/dist/tools.d.ts +7 -0
  207. package/dist/tools.js +1 -0
  208. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
  209. package/docs/SDK-TESTING.md +364 -0
  210. package/docs/codex-goal-command-impl.md +1055 -1055
  211. package/docs/codex-goal-vs-grub.md +500 -500
  212. package/docs/custom-provider.md +27 -27
  213. package/docs/extensions.md +27 -27
  214. package/docs/keybindings.md +27 -27
  215. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  216. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  217. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  218. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  219. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  220. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  221. package/docs/loop-usage-examples.md +214 -214
  222. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
  223. package/docs/models.md +27 -27
  224. package/docs/packages.md +27 -27
  225. package/docs/pi-design-philosophy.md +457 -457
  226. package/docs/planmode.md +1987 -1987
  227. package/docs/prompt-templates.md +27 -27
  228. package/docs/providers.md +27 -27
  229. package/docs/sdk.md +27 -27
  230. package/docs/skills.md +27 -27
  231. package/docs/startup-performance-optimization.md +301 -0
  232. package/docs/themes.md +27 -27
  233. package/docs/tui.md +27 -27
  234. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
  235. package/package.json +190 -162
  236. package/dist/packages/extension-sdk/src/index.js +0 -1
  237. package/docs/cc-agent-design.md +0 -1297
  238. package/docs/cc-tui-design.md +0 -1333
  239. package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
  240. /package/dist/packages/{extension-sdk/src/lifecycle.js → protocol/src/commands.js} +0 -0
  241. /package/dist/packages/{extension-sdk/src/tools.js → protocol/src/flags.js} +0 -0
@@ -1,414 +1,414 @@
1
- # Medium — Data Extraction
2
-
3
- `https://medium.com` — blogging platform. Three access paths tested and validated: the undocumented `?format=json` endpoint (fastest for article + publication data), the undocumented GraphQL API (best for targeted metric lookups), and RSS feeds (best for recent posts lists without auth). No browser needed for any read-only task.
4
-
5
- ## Do this first: pick your access path
6
-
7
- | Goal | Best approach | Latency |
8
- |------|--------------|---------|
9
- | Article metadata + full body | `?format=json` on article URL | ~400ms |
10
- | Article metrics only (claps, visibility) | GraphQL `post(id:)` | ~275ms |
11
- | Author profile + follower count | GraphQL `user(username:)` | ~220ms |
12
- | Recent posts for a user (up to 10) | `?format=json` on profile URL | ~240ms |
13
- | Recent posts for a publication | `?format=json` on publication URL | ~300ms |
14
- | Paginated post list (feed) | RSS feed | ~260ms |
15
- | Full article body as HTML | RSS `content:encoded` field | ~260ms |
16
- | Publication subscriber count | `?format=json` on publication URL | ~300ms |
17
-
18
- **Never use a browser for read-only Medium tasks.** All article content, metadata, and metrics are available over HTTP. Browser is only needed for authenticated actions (clapping, posting, account management).
19
-
20
- ---
21
-
22
- ## The XSSI prefix
23
-
24
- Every `?format=json` response starts with the anti-hijacking prefix `])}while(1);</x>` before the JSON. **Strip it before parsing.** The helper below handles this.
25
-
26
- ```python
27
- import urllib.request, gzip, json, re
28
-
29
- def medium_json(url):
30
- """Fetch any Medium URL with ?format=json and return parsed dict.
31
- Strips the XSSI prefix ])}while(1);</x> automatically.
32
- Works on: article URLs, user profile URLs, publication URLs.
33
- Does NOT work on: search pages, /latest, profile stream API.
34
- """
35
- sep = '&' if '?' in url else '?'
36
- req = urllib.request.Request(
37
- url + sep + 'format=json',
38
- headers={
39
- "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
40
- "Accept": "application/json, */*",
41
- "Accept-Encoding": "gzip",
42
- }
43
- )
44
- with urllib.request.urlopen(req, timeout=20) as r:
45
- raw = r.read()
46
- if r.headers.get("Content-Encoding") == "gzip":
47
- raw = gzip.decompress(raw)
48
- text = raw.decode()
49
- # Strip everything before the first {
50
- return json.loads(re.sub(r'^[^\{]+', '', text))
51
- ```
52
-
53
- ---
54
-
55
- ## Path 1: `?format=json` — article metadata + body (fastest for articles)
56
-
57
- Append `?format=json` to any article URL. Returns full metadata, virtuals (metrics), and the complete article body in a structured `bodyModel`. No auth required for public and subscriber-locked articles alike — the metadata and full body are always returned, but paywalled body content in a browser would be truncated.
58
-
59
- ```python
60
- data = medium_json("https://medium.com/@karpathy/software-2-0-a64152b37c35")
61
- payload = data['payload']
62
- val = payload['value'] # article fields
63
- refs = payload['references'] # User, Social, SocialStats dicts keyed by ID
64
-
65
- # --- Article fields ---
66
- title = val['title'] # "Software 2.0"
67
- article_id = val['id'] # "a64152b37c35"
68
- creator_id = val['creatorId'] # "ac9d9a35533e"
69
- slug = val['uniqueSlug'] # "software-2-0-a64152b37c35"
70
- url = val['canonicalUrl'] # "https://medium.com/@karpathy/..."
71
- first_pub = val['firstPublishedAt'] # unix ms: 1510438733751
72
- last_pub = val['latestPublishedAt'] # unix ms: 1615659523264
73
- visibility = val['visibility'] # 0=public, 2=subscriber-locked
74
- is_locked = val['isSubscriptionLocked'] # True if paywalled
75
- locked_src = val['lockedPostSource'] # 0=free, 1=Medium Partner Program
76
-
77
- # --- Metrics (in val['virtuals']) ---
78
- virtuals = val['virtuals']
79
- clap_count = virtuals['totalClapCount'] # 60865 (all claps, including multi-clap)
80
- recommends = virtuals['recommends'] # 8846 (unique clappers)
81
- read_time = virtuals['readingTime'] # 8.79811320754717 (minutes, float)
82
- word_count = virtuals['wordCount'] # 2146
83
-
84
- # --- Tags ---
85
- tags = [t['slug'] for t in virtuals['tags']]
86
- # ['machine-learning', 'artificial-intelligence', 'programming', 'software-development', 'future']
87
-
88
- # --- Author (from references) ---
89
- user = refs['User'][creator_id]
90
- author_name = user['name'] # "Andrej Karpathy"
91
- author_handle = user['username'] # "karpathy"
92
- author_bio = user['bio'] # "I like to train deep neural nets on large datasets."
93
- author_twitter = user['twitterScreenName'] # "karpathy"
94
-
95
- # --- Follower count (from SocialStats) ---
96
- ss = refs['SocialStats'][creator_id]
97
- follower_count = ss['usersFollowedByCount'] # 60027
98
- following_count = ss['usersFollowedCount'] # 183
99
- ```
100
-
101
- ### Detect paywall
102
-
103
- ```python
104
- # Paywalled (Medium Partner Program): isSubscriptionLocked=True, visibility=2, lockedPostSource=1
105
- # Free: isSubscriptionLocked=False, visibility=0, lockedPostSource=0
106
- is_paywalled = val['isSubscriptionLocked'] # True / False
107
- ```
108
-
109
- Confirmed on real TDS articles: paywalled articles return `isSubscriptionLocked=True`, `visibility=2`, `lockedPostSource=1`. Free articles: all three are `False`/`0`.
110
-
111
- ### Article body
112
-
113
- The full body is in `val['content']['bodyModel']['paragraphs']` — a list of dicts:
114
-
115
- ```python
116
- paragraphs = val['content']['bodyModel']['paragraphs']
117
-
118
- # Paragraph types (confirmed for this article):
119
- # type=1 -> body text (P)
120
- # type=3 -> heading (H1/H2)
121
- # type=4 -> image (text is empty; metadata has image ID)
122
-
123
- # Reconstruct plain text:
124
- text_paras = [p['text'] for p in paragraphs if p.get('text')]
125
- full_text = '\n\n'.join(text_paras)
126
- ```
127
-
128
- ---
129
-
130
- ## Path 2: GraphQL API — targeted metric lookups
131
-
132
- `POST https://medium.com/_/graphql` with a JSON body. No auth, no CSRF token required.
133
- Returns HTTP 200 with JSON even for unauthenticated queries. Invalid fields return HTTP 400 — do not assume a field exists without testing first.
134
-
135
- ```python
136
- import json, urllib.request, gzip
137
-
138
- def gql(query):
139
- body = json.dumps({"query": query}).encode()
140
- req = urllib.request.Request(
141
- "https://medium.com/_/graphql",
142
- data=body,
143
- headers={
144
- "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
145
- "Content-Type": "application/json",
146
- "Accept": "application/json",
147
- "Accept-Encoding": "gzip",
148
- },
149
- method="POST",
150
- )
151
- with urllib.request.urlopen(req, timeout=20) as r:
152
- raw = r.read()
153
- if r.headers.get("Content-Encoding") == "gzip":
154
- raw = gzip.decompress(raw)
155
- return json.loads(raw.decode())
156
- ```
157
-
158
- ### Fetch article metrics (fastest)
159
-
160
- ```python
161
- result = gql("""
162
- {
163
- post(id: "a64152b37c35") {
164
- title
165
- id
166
- firstPublishedAt
167
- latestPublishedAt
168
- visibility
169
- uniqueSlug
170
- canonicalUrl
171
- mediumUrl
172
- isLocked
173
- clapCount
174
- readingTime
175
- wordCount
176
- }
177
- }
178
- """)
179
- post = result['data']['post']
180
- # post['visibility'] -> "PUBLIC" | "LOCKED" (string, not numeric)
181
- # post['isLocked'] -> False | True
182
- # post['clapCount'] -> 60865 (same as totalClapCount in format=json)
183
- # post['readingTime'] -> 8.79811320754717 (minutes)
184
- # post['wordCount'] -> 2146
185
- ```
186
-
187
- **Confirmed working `post()` fields:** `title`, `id`, `createdAt`, `updatedAt`, `firstPublishedAt`, `latestPublishedAt`, `visibility`, `uniqueSlug`, `canonicalUrl`, `mediumUrl`, `isLocked`, `clapCount`, `readingTime`, `wordCount`
188
-
189
- **Nested object that works:** `topics { name slug }`, `creator { name username }`, `collection { name id slug description domain creator { name username } }`
190
-
191
- **Fields that return HTTP 400 (not available):** `tags`, `author`, `recommends`, `content`, `publication`, `responses`, `sequence`
192
-
193
- ### Fetch author profile
194
-
195
- ```python
196
- result = gql("""
197
- {
198
- user(username: "karpathy") {
199
- name
200
- username
201
- id
202
- bio
203
- imageId
204
- twitterScreenName
205
- mediumMemberAt
206
- socialStats {
207
- followerCount
208
- followingCount
209
- }
210
- }
211
- }
212
- """)
213
- user = result['data']['user']
214
- # user['name'] -> "Andrej Karpathy"
215
- # user['id'] -> "ac9d9a35533e"
216
- # user['bio'] -> "I like to train deep neural nets on large datasets."
217
- # user['twitterScreenName'] -> "karpathy"
218
- # user['socialStats']['followerCount'] -> 60028
219
- # user['mediumMemberAt'] -> 0 (not a member); nonzero = unix ms join date
220
- ```
221
-
222
- **Confirmed working `user()` fields:** `name`, `username`, `id`, `bio`, `imageId`, `twitterScreenName`, `mediumMemberAt`, `socialStats { followerCount followingCount }`
223
-
224
- **Fields that return HTTP 400:** `followerCount` (top-level), `followingCount` (top-level), `postCount`
225
-
226
- ### Fetch collection (publication) by ID
227
-
228
- The GraphQL `collection()` query only accepts `id`, not `slug`. Get the ID from `?format=json` on the publication page.
229
-
230
- ```python
231
- # TDS Archive id: 7f60cf5620c9 (from medium.com/towards-data-science?format=json)
232
- result = gql("""
233
- {
234
- collection(id: "7f60cf5620c9") {
235
- name
236
- id
237
- slug
238
- description
239
- domain
240
- creator { name username }
241
- }
242
- }
243
- """)
244
- coll = result['data']['collection']
245
- # coll['name'] -> "TDS Archive"
246
- # coll['slug'] -> "data-science"
247
- ```
248
-
249
- ---
250
-
251
- ## Path 3: RSS feeds (best for recent posts list + article bodies)
252
-
253
- Works with plain `http_get`. Returns up to 10 most recent posts. Full article HTML is in `content:encoded`. No clap count or visibility info in RSS.
254
-
255
- ```python
256
- import re
257
- from helpers import http_get
258
-
259
- def parse_rss_items(rss_xml):
260
- """Extract items from Medium RSS feed. Returns list of dicts."""
261
- def cdata(tag, text):
262
- m = re.search(rf'<{tag}[^>]*><!\[CDATA\[(.*?)\]\]></{tag}>', text, re.DOTALL)
263
- return m.group(1).strip() if m else None
264
-
265
- items = []
266
- for raw in re.findall(r'<item>(.*?)</item>', rss_xml, re.DOTALL):
267
- # link is plain text (not CDATA)
268
- link_m = re.search(r'<link>(.*?)</link>', raw, re.DOTALL)
269
- items.append({
270
- 'title': cdata('title', raw),
271
- 'link': link_m.group(1).strip() if link_m else None,
272
- 'pubDate': cdata('pubDate', raw),
273
- 'creator': cdata('dc:creator', raw),
274
- 'tags': re.findall(r'<category><!\[CDATA\[(.*?)\]\]></category>', raw),
275
- 'body_html': cdata('content:encoded', raw), # full article HTML
276
- })
277
- return items
278
-
279
- # User feed (up to 10 latest posts)
280
- rss = http_get("https://medium.com/feed/@karpathy")
281
- posts = parse_rss_items(rss)
282
- # posts[0]['title'] -> "Software 2.0"
283
- # posts[0]['pubDate'] -> "Sat, 11 Nov 2017 22:18:53 GMT"
284
- # posts[0]['creator'] -> "Andrej Karpathy"
285
- # posts[0]['tags'] -> ['programming', 'software-development', 'artificial-intelligence', 'future', 'machine-learning']
286
- # posts[0]['link'] -> "https://karpathy.medium.com/software-2-0-a64152b37c35?source=rss-..."
287
- # posts[0]['body_html'] -> full article body as HTML string (~15KB for this article)
288
-
289
- # Publication feed (up to 10 latest posts)
290
- rss_pub = http_get("https://medium.com/feed/towards-data-science")
291
- pub_posts = parse_rss_items(rss_pub)
292
- ```
293
-
294
- **RSS limitations:**
295
- - RSS does not include clap count, view count, or paywall status.
296
- - `body_html` contains the full article body as HTML, including `<p>`, `<strong>`, `<a>`, `<img>` tags.
297
- - Pagination is not supported — RSS always returns the 10 most recent posts.
298
-
299
- ---
300
-
301
- ## Path 4: `?format=json` on user profile — recent posts with metrics
302
-
303
- Better than RSS when you need clap counts alongside post list. Returns up to `limit` posts (default 10) plus full author metadata.
304
-
305
- ```python
306
- data = medium_json("https://medium.com/@karpathy?limit=10")
307
- payload = data['payload']
308
-
309
- user = payload['user']
310
- # user['name'] -> "Andrej Karpathy"
311
- # user['username'] -> "karpathy"
312
- # user['bio'] -> "I like to train deep neural nets on large datasets."
313
-
314
- refs = payload['references']
315
- ss = refs['SocialStats'][user['userId']]
316
- # ss['usersFollowedByCount'] -> 60028 (followers)
317
- # ss['usersFollowedCount'] -> 183 (following)
318
-
319
- posts = refs.get('Post', {}) # dict keyed by post ID
320
- for pid, p in posts.items():
321
- v = p['virtuals']
322
- print(p['title'], v['totalClapCount'], round(v['readingTime'], 1))
323
-
324
- # Paginate: use paging['next'] from payload
325
- paging = payload['paging']
326
- next_params = paging['next']
327
- # next_params = {'limit': 10, 'to': '1495652975362', 'source': 'overview', 'page': 2, 'ignoredIds': []}
328
- # Append as query params to the same profile URL to get next page
329
- next_url = (
330
- f"https://medium.com/@{user['username']}"
331
- f"?limit={next_params['limit']}&to={next_params['to']}"
332
- f"&source={next_params['source']}&page={next_params['page']}"
333
- )
334
- data2 = medium_json(next_url)
335
- # Note: karpathy has only 8 total posts — pagination returns same refs on page 2
336
- ```
337
-
338
- ---
339
-
340
- ## Path 5: `?format=json` on publication page
341
-
342
- Returns publication metadata and recent posts with metrics.
343
-
344
- ```python
345
- data = medium_json("https://medium.com/towards-data-science")
346
- payload = data['payload']
347
-
348
- coll = payload['collection']
349
- # coll['name'] -> "TDS Archive"
350
- # coll['slug'] -> "data-science"
351
- # coll['description'] -> full description string
352
- # coll['subscriberCount'] -> 828527
353
- # coll['metadata']['followerCount'] -> 828527
354
- # coll['tags'] -> ['DATA SCIENCE', 'MACHINE LEARNING', ...]
355
-
356
- posts = payload['references'].get('Post', {})
357
- for pid, p in posts.items():
358
- v = p['virtuals']
359
- print(p['title'], v['totalClapCount'], p['isSubscriptionLocked'])
360
- # Also includes: p['visibility'] (0=free, 2=paywalled)
361
-
362
- # Paginate (same pattern as user profile)
363
- paging = payload['paging']
364
- # paging['next'] = {'to': '1738573325936', 'page': 3}
365
- ```
366
-
367
- ---
368
-
369
- ## Retrieving the article ID from a URL
370
-
371
- The `id` is the last 12 hex chars of a Medium article URL slug:
372
-
373
- ```python
374
- import re
375
-
376
- url = "https://medium.com/@karpathy/software-2-0-a64152b37c35"
377
- article_id = re.search(r'-([a-f0-9]{12})$', url.rstrip('/').split('?')[0])
378
- if article_id:
379
- article_id = article_id.group(1) # "a64152b37c35"
380
- ```
381
-
382
- This ID is the same across all URL forms (`medium.com/@user/slug`, `user.medium.com/slug`, `medium.com/publication/slug`).
383
-
384
- ---
385
-
386
- ## Gotchas
387
-
388
- - **HTTP 403 on plain `http_get`** — The default `http_get` helper sends `User-Agent: Mozilla/5.0` which Medium accepts for most endpoints, but article HTML pages (without `?format=json`) return 403. Always use `?format=json` for article and profile pages.
389
-
390
- - **`?format=json` works; profile stream API does not** — `https://medium.com/_/api/users/{id}/profile/stream` returns HTTP 403 for unauthenticated requests. Use `?format=json` on the profile URL instead.
391
-
392
- - **`?format=json` on search pages returns 403 or broken JSON** — `medium.com/search?q=...&format=json` and `medium.com/search/posts?q=...&format=json` both fail. Search is not available without auth.
393
-
394
- - **GraphQL `collection()` requires ID, not slug** — `collection(slug: "towards-data-science")` returns HTTP 400. You must use the numeric ID (e.g. `"7f60cf5620c9"`). Get it from `?format=json` on the publication page: `payload['collection']['id']`.
395
-
396
- - **GraphQL `tags` field on `post()` returns HTTP 400** — Use `topics { name slug }` instead. Topics are a subset of tags but work without auth.
397
-
398
- - **GraphQL visibility is a string, not a number** — `post().visibility` returns `"PUBLIC"` or `"LOCKED"` (string). The `?format=json` `value.visibility` field uses integers: `0`=public, `2`=locked. Both agree on the lock status.
399
-
400
- - **`totalClapCount` vs `recommends`** — `totalClapCount` (60865) counts all claps (Medium allows up to 50 claps per reader). `recommends` (8846) counts unique clappers. The GraphQL `clapCount` field equals `totalClapCount`, not `recommends`.
401
-
402
- - **RSS returns at most 10 items, no clap counts** — RSS is best for getting recent article links + full HTML body. Use `?format=json` profile if you need metrics.
403
-
404
- - **RSS link contains tracking params** — `posts[0]['link']` includes `?source=rss-{userId}------2`. Strip with `.split('?')[0]` if you need a clean URL.
405
-
406
- - **`content:encoded` in RSS is full HTML, not plaintext** — Strip HTML tags if you want plaintext: `re.sub(r'<[^>]+>', '', body_html)`.
407
-
408
- - **Medium subdomains** — Some users have custom subdomains (`karpathy.medium.com`). Both `medium.com/@karpathy/...` and `karpathy.medium.com/...` resolve to the same article; `?format=json` works on both.
409
-
410
- - **towardsdatascience.com is no longer Medium** — TDS moved to its own WordPress site. `towardsdatascience.com/article-slug?format=json` returns full WordPress HTML, not Medium JSON. Use `medium.com/towards-data-science` for the archived Medium publication.
411
-
412
- - **No public search API** — Medium has no Algolia equivalent. Finding articles by keyword requires either a browser, or fetching a user/publication feed and filtering locally.
413
-
414
- - **Timestamps are unix milliseconds** — `firstPublishedAt`, `createdAt`, `latestPublishedAt` are all in milliseconds. Convert: `datetime.fromtimestamp(val['firstPublishedAt'] / 1000, tz=timezone.utc)`.
1
+ # Medium — Data Extraction
2
+
3
+ `https://medium.com` — blogging platform. Three access paths tested and validated: the undocumented `?format=json` endpoint (fastest for article + publication data), the undocumented GraphQL API (best for targeted metric lookups), and RSS feeds (best for recent posts lists without auth). No browser needed for any read-only task.
4
+
5
+ ## Do this first: pick your access path
6
+
7
+ | Goal | Best approach | Latency |
8
+ |------|--------------|---------|
9
+ | Article metadata + full body | `?format=json` on article URL | ~400ms |
10
+ | Article metrics only (claps, visibility) | GraphQL `post(id:)` | ~275ms |
11
+ | Author profile + follower count | GraphQL `user(username:)` | ~220ms |
12
+ | Recent posts for a user (up to 10) | `?format=json` on profile URL | ~240ms |
13
+ | Recent posts for a publication | `?format=json` on publication URL | ~300ms |
14
+ | Paginated post list (feed) | RSS feed | ~260ms |
15
+ | Full article body as HTML | RSS `content:encoded` field | ~260ms |
16
+ | Publication subscriber count | `?format=json` on publication URL | ~300ms |
17
+
18
+ **Never use a browser for read-only Medium tasks.** All article content, metadata, and metrics are available over HTTP. Browser is only needed for authenticated actions (clapping, posting, account management).
19
+
20
+ ---
21
+
22
+ ## The XSSI prefix
23
+
24
+ Every `?format=json` response starts with the anti-hijacking prefix `])}while(1);</x>` before the JSON. **Strip it before parsing.** The helper below handles this.
25
+
26
+ ```python
27
+ import urllib.request, gzip, json, re
28
+
29
+ def medium_json(url):
30
+ """Fetch any Medium URL with ?format=json and return parsed dict.
31
+ Strips the XSSI prefix ])}while(1);</x> automatically.
32
+ Works on: article URLs, user profile URLs, publication URLs.
33
+ Does NOT work on: search pages, /latest, profile stream API.
34
+ """
35
+ sep = '&' if '?' in url else '?'
36
+ req = urllib.request.Request(
37
+ url + sep + 'format=json',
38
+ headers={
39
+ "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
40
+ "Accept": "application/json, */*",
41
+ "Accept-Encoding": "gzip",
42
+ }
43
+ )
44
+ with urllib.request.urlopen(req, timeout=20) as r:
45
+ raw = r.read()
46
+ if r.headers.get("Content-Encoding") == "gzip":
47
+ raw = gzip.decompress(raw)
48
+ text = raw.decode()
49
+ # Strip everything before the first {
50
+ return json.loads(re.sub(r'^[^\{]+', '', text))
51
+ ```
52
+
53
+ ---
54
+
55
+ ## Path 1: `?format=json` — article metadata + body (fastest for articles)
56
+
57
+ Append `?format=json` to any article URL. Returns full metadata, virtuals (metrics), and the complete article body in a structured `bodyModel`. No auth required for public and subscriber-locked articles alike — the metadata and full body are always returned, but paywalled body content in a browser would be truncated.
58
+
59
+ ```python
60
+ data = medium_json("https://medium.com/@karpathy/software-2-0-a64152b37c35")
61
+ payload = data['payload']
62
+ val = payload['value'] # article fields
63
+ refs = payload['references'] # User, Social, SocialStats dicts keyed by ID
64
+
65
+ # --- Article fields ---
66
+ title = val['title'] # "Software 2.0"
67
+ article_id = val['id'] # "a64152b37c35"
68
+ creator_id = val['creatorId'] # "ac9d9a35533e"
69
+ slug = val['uniqueSlug'] # "software-2-0-a64152b37c35"
70
+ url = val['canonicalUrl'] # "https://medium.com/@karpathy/..."
71
+ first_pub = val['firstPublishedAt'] # unix ms: 1510438733751
72
+ last_pub = val['latestPublishedAt'] # unix ms: 1615659523264
73
+ visibility = val['visibility'] # 0=public, 2=subscriber-locked
74
+ is_locked = val['isSubscriptionLocked'] # True if paywalled
75
+ locked_src = val['lockedPostSource'] # 0=free, 1=Medium Partner Program
76
+
77
+ # --- Metrics (in val['virtuals']) ---
78
+ virtuals = val['virtuals']
79
+ clap_count = virtuals['totalClapCount'] # 60865 (all claps, including multi-clap)
80
+ recommends = virtuals['recommends'] # 8846 (unique clappers)
81
+ read_time = virtuals['readingTime'] # 8.79811320754717 (minutes, float)
82
+ word_count = virtuals['wordCount'] # 2146
83
+
84
+ # --- Tags ---
85
+ tags = [t['slug'] for t in virtuals['tags']]
86
+ # ['machine-learning', 'artificial-intelligence', 'programming', 'software-development', 'future']
87
+
88
+ # --- Author (from references) ---
89
+ user = refs['User'][creator_id]
90
+ author_name = user['name'] # "Andrej Karpathy"
91
+ author_handle = user['username'] # "karpathy"
92
+ author_bio = user['bio'] # "I like to train deep neural nets on large datasets."
93
+ author_twitter = user['twitterScreenName'] # "karpathy"
94
+
95
+ # --- Follower count (from SocialStats) ---
96
+ ss = refs['SocialStats'][creator_id]
97
+ follower_count = ss['usersFollowedByCount'] # 60027
98
+ following_count = ss['usersFollowedCount'] # 183
99
+ ```
100
+
101
+ ### Detect paywall
102
+
103
+ ```python
104
+ # Paywalled (Medium Partner Program): isSubscriptionLocked=True, visibility=2, lockedPostSource=1
105
+ # Free: isSubscriptionLocked=False, visibility=0, lockedPostSource=0
106
+ is_paywalled = val['isSubscriptionLocked'] # True / False
107
+ ```
108
+
109
+ Confirmed on real TDS articles: paywalled articles return `isSubscriptionLocked=True`, `visibility=2`, `lockedPostSource=1`. Free articles: all three are `False`/`0`.
110
+
111
+ ### Article body
112
+
113
+ The full body is in `val['content']['bodyModel']['paragraphs']` — a list of dicts:
114
+
115
+ ```python
116
+ paragraphs = val['content']['bodyModel']['paragraphs']
117
+
118
+ # Paragraph types (confirmed for this article):
119
+ # type=1 -> body text (P)
120
+ # type=3 -> heading (H1/H2)
121
+ # type=4 -> image (text is empty; metadata has image ID)
122
+
123
+ # Reconstruct plain text:
124
+ text_paras = [p['text'] for p in paragraphs if p.get('text')]
125
+ full_text = '\n\n'.join(text_paras)
126
+ ```
127
+
128
+ ---
129
+
130
+ ## Path 2: GraphQL API — targeted metric lookups
131
+
132
+ `POST https://medium.com/_/graphql` with a JSON body. No auth, no CSRF token required.
133
+ Returns HTTP 200 with JSON even for unauthenticated queries. Invalid fields return HTTP 400 — do not assume a field exists without testing first.
134
+
135
+ ```python
136
+ import json, urllib.request, gzip
137
+
138
+ def gql(query):
139
+ body = json.dumps({"query": query}).encode()
140
+ req = urllib.request.Request(
141
+ "https://medium.com/_/graphql",
142
+ data=body,
143
+ headers={
144
+ "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
145
+ "Content-Type": "application/json",
146
+ "Accept": "application/json",
147
+ "Accept-Encoding": "gzip",
148
+ },
149
+ method="POST",
150
+ )
151
+ with urllib.request.urlopen(req, timeout=20) as r:
152
+ raw = r.read()
153
+ if r.headers.get("Content-Encoding") == "gzip":
154
+ raw = gzip.decompress(raw)
155
+ return json.loads(raw.decode())
156
+ ```
157
+
158
+ ### Fetch article metrics (fastest)
159
+
160
+ ```python
161
+ result = gql("""
162
+ {
163
+ post(id: "a64152b37c35") {
164
+ title
165
+ id
166
+ firstPublishedAt
167
+ latestPublishedAt
168
+ visibility
169
+ uniqueSlug
170
+ canonicalUrl
171
+ mediumUrl
172
+ isLocked
173
+ clapCount
174
+ readingTime
175
+ wordCount
176
+ }
177
+ }
178
+ """)
179
+ post = result['data']['post']
180
+ # post['visibility'] -> "PUBLIC" | "LOCKED" (string, not numeric)
181
+ # post['isLocked'] -> False | True
182
+ # post['clapCount'] -> 60865 (same as totalClapCount in format=json)
183
+ # post['readingTime'] -> 8.79811320754717 (minutes)
184
+ # post['wordCount'] -> 2146
185
+ ```
186
+
187
+ **Confirmed working `post()` fields:** `title`, `id`, `createdAt`, `updatedAt`, `firstPublishedAt`, `latestPublishedAt`, `visibility`, `uniqueSlug`, `canonicalUrl`, `mediumUrl`, `isLocked`, `clapCount`, `readingTime`, `wordCount`
188
+
189
+ **Nested object that works:** `topics { name slug }`, `creator { name username }`, `collection { name id slug description domain creator { name username } }`
190
+
191
+ **Fields that return HTTP 400 (not available):** `tags`, `author`, `recommends`, `content`, `publication`, `responses`, `sequence`
192
+
193
+ ### Fetch author profile
194
+
195
+ ```python
196
+ result = gql("""
197
+ {
198
+ user(username: "karpathy") {
199
+ name
200
+ username
201
+ id
202
+ bio
203
+ imageId
204
+ twitterScreenName
205
+ mediumMemberAt
206
+ socialStats {
207
+ followerCount
208
+ followingCount
209
+ }
210
+ }
211
+ }
212
+ """)
213
+ user = result['data']['user']
214
+ # user['name'] -> "Andrej Karpathy"
215
+ # user['id'] -> "ac9d9a35533e"
216
+ # user['bio'] -> "I like to train deep neural nets on large datasets."
217
+ # user['twitterScreenName'] -> "karpathy"
218
+ # user['socialStats']['followerCount'] -> 60028
219
+ # user['mediumMemberAt'] -> 0 (not a member); nonzero = unix ms join date
220
+ ```
221
+
222
+ **Confirmed working `user()` fields:** `name`, `username`, `id`, `bio`, `imageId`, `twitterScreenName`, `mediumMemberAt`, `socialStats { followerCount followingCount }`
223
+
224
+ **Fields that return HTTP 400:** `followerCount` (top-level), `followingCount` (top-level), `postCount`
225
+
226
+ ### Fetch collection (publication) by ID
227
+
228
+ The GraphQL `collection()` query only accepts `id`, not `slug`. Get the ID from `?format=json` on the publication page.
229
+
230
+ ```python
231
+ # TDS Archive id: 7f60cf5620c9 (from medium.com/towards-data-science?format=json)
232
+ result = gql("""
233
+ {
234
+ collection(id: "7f60cf5620c9") {
235
+ name
236
+ id
237
+ slug
238
+ description
239
+ domain
240
+ creator { name username }
241
+ }
242
+ }
243
+ """)
244
+ coll = result['data']['collection']
245
+ # coll['name'] -> "TDS Archive"
246
+ # coll['slug'] -> "data-science"
247
+ ```
248
+
249
+ ---
250
+
251
+ ## Path 3: RSS feeds (best for recent posts list + article bodies)
252
+
253
+ Works with plain `http_get`. Returns up to 10 most recent posts. Full article HTML is in `content:encoded`. No clap count or visibility info in RSS.
254
+
255
+ ```python
256
+ import re
257
+ from helpers import http_get
258
+
259
+ def parse_rss_items(rss_xml):
260
+ """Extract items from Medium RSS feed. Returns list of dicts."""
261
+ def cdata(tag, text):
262
+ m = re.search(rf'<{tag}[^>]*><!\[CDATA\[(.*?)\]\]></{tag}>', text, re.DOTALL)
263
+ return m.group(1).strip() if m else None
264
+
265
+ items = []
266
+ for raw in re.findall(r'<item>(.*?)</item>', rss_xml, re.DOTALL):
267
+ # link is plain text (not CDATA)
268
+ link_m = re.search(r'<link>(.*?)</link>', raw, re.DOTALL)
269
+ items.append({
270
+ 'title': cdata('title', raw),
271
+ 'link': link_m.group(1).strip() if link_m else None,
272
+ 'pubDate': cdata('pubDate', raw),
273
+ 'creator': cdata('dc:creator', raw),
274
+ 'tags': re.findall(r'<category><!\[CDATA\[(.*?)\]\]></category>', raw),
275
+ 'body_html': cdata('content:encoded', raw), # full article HTML
276
+ })
277
+ return items
278
+
279
+ # User feed (up to 10 latest posts)
280
+ rss = http_get("https://medium.com/feed/@karpathy")
281
+ posts = parse_rss_items(rss)
282
+ # posts[0]['title'] -> "Software 2.0"
283
+ # posts[0]['pubDate'] -> "Sat, 11 Nov 2017 22:18:53 GMT"
284
+ # posts[0]['creator'] -> "Andrej Karpathy"
285
+ # posts[0]['tags'] -> ['programming', 'software-development', 'artificial-intelligence', 'future', 'machine-learning']
286
+ # posts[0]['link'] -> "https://karpathy.medium.com/software-2-0-a64152b37c35?source=rss-..."
287
+ # posts[0]['body_html'] -> full article body as HTML string (~15KB for this article)
288
+
289
+ # Publication feed (up to 10 latest posts)
290
+ rss_pub = http_get("https://medium.com/feed/towards-data-science")
291
+ pub_posts = parse_rss_items(rss_pub)
292
+ ```
293
+
294
+ **RSS limitations:**
295
+ - RSS does not include clap count, view count, or paywall status.
296
+ - `body_html` contains the full article body as HTML, including `<p>`, `<strong>`, `<a>`, `<img>` tags.
297
+ - Pagination is not supported — RSS always returns the 10 most recent posts.
298
+
299
+ ---
300
+
301
+ ## Path 4: `?format=json` on user profile — recent posts with metrics
302
+
303
+ Better than RSS when you need clap counts alongside post list. Returns up to `limit` posts (default 10) plus full author metadata.
304
+
305
+ ```python
306
+ data = medium_json("https://medium.com/@karpathy?limit=10")
307
+ payload = data['payload']
308
+
309
+ user = payload['user']
310
+ # user['name'] -> "Andrej Karpathy"
311
+ # user['username'] -> "karpathy"
312
+ # user['bio'] -> "I like to train deep neural nets on large datasets."
313
+
314
+ refs = payload['references']
315
+ ss = refs['SocialStats'][user['userId']]
316
+ # ss['usersFollowedByCount'] -> 60028 (followers)
317
+ # ss['usersFollowedCount'] -> 183 (following)
318
+
319
+ posts = refs.get('Post', {}) # dict keyed by post ID
320
+ for pid, p in posts.items():
321
+ v = p['virtuals']
322
+ print(p['title'], v['totalClapCount'], round(v['readingTime'], 1))
323
+
324
+ # Paginate: use paging['next'] from payload
325
+ paging = payload['paging']
326
+ next_params = paging['next']
327
+ # next_params = {'limit': 10, 'to': '1495652975362', 'source': 'overview', 'page': 2, 'ignoredIds': []}
328
+ # Append as query params to the same profile URL to get next page
329
+ next_url = (
330
+ f"https://medium.com/@{user['username']}"
331
+ f"?limit={next_params['limit']}&to={next_params['to']}"
332
+ f"&source={next_params['source']}&page={next_params['page']}"
333
+ )
334
+ data2 = medium_json(next_url)
335
+ # Note: karpathy has only 8 total posts — pagination returns same refs on page 2
336
+ ```
337
+
338
+ ---
339
+
340
+ ## Path 5: `?format=json` on publication page
341
+
342
+ Returns publication metadata and recent posts with metrics.
343
+
344
+ ```python
345
+ data = medium_json("https://medium.com/towards-data-science")
346
+ payload = data['payload']
347
+
348
+ coll = payload['collection']
349
+ # coll['name'] -> "TDS Archive"
350
+ # coll['slug'] -> "data-science"
351
+ # coll['description'] -> full description string
352
+ # coll['subscriberCount'] -> 828527
353
+ # coll['metadata']['followerCount'] -> 828527
354
+ # coll['tags'] -> ['DATA SCIENCE', 'MACHINE LEARNING', ...]
355
+
356
+ posts = payload['references'].get('Post', {})
357
+ for pid, p in posts.items():
358
+ v = p['virtuals']
359
+ print(p['title'], v['totalClapCount'], p['isSubscriptionLocked'])
360
+ # Also includes: p['visibility'] (0=free, 2=paywalled)
361
+
362
+ # Paginate (same pattern as user profile)
363
+ paging = payload['paging']
364
+ # paging['next'] = {'to': '1738573325936', 'page': 3}
365
+ ```
366
+
367
+ ---
368
+
369
+ ## Retrieving the article ID from a URL
370
+
371
+ The `id` is the last 12 hex chars of a Medium article URL slug:
372
+
373
+ ```python
374
+ import re
375
+
376
+ url = "https://medium.com/@karpathy/software-2-0-a64152b37c35"
377
+ article_id = re.search(r'-([a-f0-9]{12})$', url.rstrip('/').split('?')[0])
378
+ if article_id:
379
+ article_id = article_id.group(1) # "a64152b37c35"
380
+ ```
381
+
382
+ This ID is the same across all URL forms (`medium.com/@user/slug`, `user.medium.com/slug`, `medium.com/publication/slug`).
383
+
384
+ ---
385
+
386
+ ## Gotchas
387
+
388
+ - **HTTP 403 on plain `http_get`** — The default `http_get` helper sends `User-Agent: Mozilla/5.0` which Medium accepts for most endpoints, but article HTML pages (without `?format=json`) return 403. Always use `?format=json` for article and profile pages.
389
+
390
+ - **`?format=json` works; profile stream API does not** — `https://medium.com/_/api/users/{id}/profile/stream` returns HTTP 403 for unauthenticated requests. Use `?format=json` on the profile URL instead.
391
+
392
+ - **`?format=json` on search pages returns 403 or broken JSON** — `medium.com/search?q=...&format=json` and `medium.com/search/posts?q=...&format=json` both fail. Search is not available without auth.
393
+
394
+ - **GraphQL `collection()` requires ID, not slug** — `collection(slug: "towards-data-science")` returns HTTP 400. You must use the numeric ID (e.g. `"7f60cf5620c9"`). Get it from `?format=json` on the publication page: `payload['collection']['id']`.
395
+
396
+ - **GraphQL `tags` field on `post()` returns HTTP 400** — Use `topics { name slug }` instead. Topics are a subset of tags but work without auth.
397
+
398
+ - **GraphQL visibility is a string, not a number** — `post().visibility` returns `"PUBLIC"` or `"LOCKED"` (string). The `?format=json` `value.visibility` field uses integers: `0`=public, `2`=locked. Both agree on the lock status.
399
+
400
+ - **`totalClapCount` vs `recommends`** — `totalClapCount` (60865) counts all claps (Medium allows up to 50 claps per reader). `recommends` (8846) counts unique clappers. The GraphQL `clapCount` field equals `totalClapCount`, not `recommends`.
401
+
402
+ - **RSS returns at most 10 items, no clap counts** — RSS is best for getting recent article links + full HTML body. Use `?format=json` profile if you need metrics.
403
+
404
+ - **RSS link contains tracking params** — `posts[0]['link']` includes `?source=rss-{userId}------2`. Strip with `.split('?')[0]` if you need a clean URL.
405
+
406
+ - **`content:encoded` in RSS is full HTML, not plaintext** — Strip HTML tags if you want plaintext: `re.sub(r'<[^>]+>', '', body_html)`.
407
+
408
+ - **Medium subdomains** — Some users have custom subdomains (`karpathy.medium.com`). Both `medium.com/@karpathy/...` and `karpathy.medium.com/...` resolve to the same article; `?format=json` works on both.
409
+
410
+ - **towardsdatascience.com is no longer Medium** — TDS moved to its own WordPress site. `towardsdatascience.com/article-slug?format=json` returns full WordPress HTML, not Medium JSON. Use `medium.com/towards-data-science` for the archived Medium publication.
411
+
412
+ - **No public search API** — Medium has no Algolia equivalent. Finding articles by keyword requires either a browser, or fetching a user/publication feed and filtering locally.
413
+
414
+ - **Timestamps are unix milliseconds** — `firstPublishedAt`, `createdAt`, `latestPublishedAt` are all in milliseconds. Convert: `datetime.fromtimestamp(val['firstPublishedAt'] / 1000, tz=timezone.utc)`.