@pencil-agent/nano-pencil 2.0.0 → 2.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (195) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/mcp/mcp-client.d.ts +3 -1
  7. package/dist/core/mcp/mcp-client.js +6 -6
  8. package/dist/core/mcp/mcp-config.d.ts +3 -3
  9. package/dist/core/mcp/mcp-config.js +1 -1
  10. package/dist/core/mcp/mcp-manager.d.ts +5 -1
  11. package/dist/core/mcp/mcp-manager.js +1 -1
  12. package/dist/core/platform/config/resource-loader.d.ts +2 -0
  13. package/dist/core/platform/config/resource-loader.js +2 -2
  14. package/dist/core/runtime/agent-session.d.ts +12 -0
  15. package/dist/core/runtime/agent-session.js +8 -8
  16. package/dist/core/runtime/sdk.d.ts +8 -0
  17. package/dist/core/runtime/sdk.js +1 -1
  18. package/dist/extensions/builtin/AGENT.md +115 -115
  19. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  20. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  91. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  92. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  93. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  94. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  95. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  96. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  97. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  98. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  99. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  100. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  101. package/dist/extensions/builtin/browser/browser.md +73 -73
  102. package/dist/extensions/builtin/browser/install.md +142 -142
  103. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  104. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  105. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  107. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  108. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  109. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  110. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  111. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  112. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  113. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  114. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  115. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  116. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  117. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  118. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  119. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  120. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  121. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  122. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  123. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  124. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  125. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  126. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  127. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  128. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  129. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  130. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  131. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  132. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  133. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  134. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  135. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  136. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  137. package/dist/extensions/builtin/goal/README.md +67 -67
  138. package/dist/extensions/builtin/grub/README.md +112 -112
  139. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  140. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  141. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  142. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  143. package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
  144. package/dist/extensions/builtin/loop/README.md +92 -92
  145. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  146. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  147. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  148. package/dist/extensions/builtin/sal/README.md +72 -72
  149. package/dist/extensions/builtin/security-audit/README.md +289 -289
  150. package/dist/extensions/builtin/team/AGENT.md +112 -112
  151. package/dist/extensions/builtin/team/TESTING.md +299 -299
  152. package/dist/extensions/builtin/token-save/README.md +56 -56
  153. package/dist/extensions/optional/AGENT.md +10 -10
  154. package/dist/modes/interactive/interactive-mode.js +36 -36
  155. package/dist/modes/interactive/theme/dark.json +85 -85
  156. package/dist/modes/interactive/theme/light.json +84 -84
  157. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  158. package/dist/modes/interactive/theme/warm.json +81 -81
  159. package/dist/node_modules/@pencil-agent/agent-core/dist/agent-loop.js +3 -2
  160. package/dist/node_modules/@pencil-agent/agent-core/dist/structured-adaptive-agent-loop.js +2 -1
  161. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  162. package/docs/cc-agent-design.md +1297 -0
  163. package/docs/cc-tui-design.md +1333 -0
  164. package/docs/codex-goal-command-impl.md +1055 -1055
  165. package/docs/codex-goal-vs-grub.md +500 -500
  166. package/docs/custom-provider.md +27 -27
  167. package/docs/extensions.md +27 -27
  168. package/docs/keybindings.md +27 -27
  169. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  170. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  171. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  172. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  173. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  174. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  175. package/docs/loop-usage-examples.md +214 -214
  176. package/docs/models.md +27 -27
  177. package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +170 -0
  178. package/docs/packages.md +27 -27
  179. package/docs/pi-design-philosophy.md +457 -457
  180. package/docs/planmode.md +1987 -1987
  181. package/docs/prompt-templates.md +27 -27
  182. package/docs/providers.md +27 -27
  183. package/docs/scan-report.md +3820 -0
  184. package/docs/sdk.md +27 -27
  185. package/docs/skills.md +27 -27
  186. package/docs/themes.md +27 -27
  187. package/docs/tui.md +27 -27
  188. package/docs//345/257/271/346/240/207Claude-Code.md +1775 -0
  189. package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +261 -0
  190. package/package.json +190 -190
  191. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +0 -851
  192. package/docs/SDK-TESTING.md +0 -364
  193. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +0 -593
  194. package/docs/startup-performance-optimization.md +0 -301
  195. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +0 -47
@@ -1,414 +1,414 @@
1
- # Medium — Data Extraction
2
-
3
- `https://medium.com` — blogging platform. Three access paths tested and validated: the undocumented `?format=json` endpoint (fastest for article + publication data), the undocumented GraphQL API (best for targeted metric lookups), and RSS feeds (best for recent posts lists without auth). No browser needed for any read-only task.
4
-
5
- ## Do this first: pick your access path
6
-
7
- | Goal | Best approach | Latency |
8
- |------|--------------|---------|
9
- | Article metadata + full body | `?format=json` on article URL | ~400ms |
10
- | Article metrics only (claps, visibility) | GraphQL `post(id:)` | ~275ms |
11
- | Author profile + follower count | GraphQL `user(username:)` | ~220ms |
12
- | Recent posts for a user (up to 10) | `?format=json` on profile URL | ~240ms |
13
- | Recent posts for a publication | `?format=json` on publication URL | ~300ms |
14
- | Paginated post list (feed) | RSS feed | ~260ms |
15
- | Full article body as HTML | RSS `content:encoded` field | ~260ms |
16
- | Publication subscriber count | `?format=json` on publication URL | ~300ms |
17
-
18
- **Never use a browser for read-only Medium tasks.** All article content, metadata, and metrics are available over HTTP. Browser is only needed for authenticated actions (clapping, posting, account management).
19
-
20
- ---
21
-
22
- ## The XSSI prefix
23
-
24
- Every `?format=json` response starts with the anti-hijacking prefix `])}while(1);</x>` before the JSON. **Strip it before parsing.** The helper below handles this.
25
-
26
- ```python
27
- import urllib.request, gzip, json, re
28
-
29
- def medium_json(url):
30
- """Fetch any Medium URL with ?format=json and return parsed dict.
31
- Strips the XSSI prefix ])}while(1);</x> automatically.
32
- Works on: article URLs, user profile URLs, publication URLs.
33
- Does NOT work on: search pages, /latest, profile stream API.
34
- """
35
- sep = '&' if '?' in url else '?'
36
- req = urllib.request.Request(
37
- url + sep + 'format=json',
38
- headers={
39
- "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
40
- "Accept": "application/json, */*",
41
- "Accept-Encoding": "gzip",
42
- }
43
- )
44
- with urllib.request.urlopen(req, timeout=20) as r:
45
- raw = r.read()
46
- if r.headers.get("Content-Encoding") == "gzip":
47
- raw = gzip.decompress(raw)
48
- text = raw.decode()
49
- # Strip everything before the first {
50
- return json.loads(re.sub(r'^[^\{]+', '', text))
51
- ```
52
-
53
- ---
54
-
55
- ## Path 1: `?format=json` — article metadata + body (fastest for articles)
56
-
57
- Append `?format=json` to any article URL. Returns full metadata, virtuals (metrics), and the complete article body in a structured `bodyModel`. No auth required for public and subscriber-locked articles alike — the metadata and full body are always returned, but paywalled body content in a browser would be truncated.
58
-
59
- ```python
60
- data = medium_json("https://medium.com/@karpathy/software-2-0-a64152b37c35")
61
- payload = data['payload']
62
- val = payload['value'] # article fields
63
- refs = payload['references'] # User, Social, SocialStats dicts keyed by ID
64
-
65
- # --- Article fields ---
66
- title = val['title'] # "Software 2.0"
67
- article_id = val['id'] # "a64152b37c35"
68
- creator_id = val['creatorId'] # "ac9d9a35533e"
69
- slug = val['uniqueSlug'] # "software-2-0-a64152b37c35"
70
- url = val['canonicalUrl'] # "https://medium.com/@karpathy/..."
71
- first_pub = val['firstPublishedAt'] # unix ms: 1510438733751
72
- last_pub = val['latestPublishedAt'] # unix ms: 1615659523264
73
- visibility = val['visibility'] # 0=public, 2=subscriber-locked
74
- is_locked = val['isSubscriptionLocked'] # True if paywalled
75
- locked_src = val['lockedPostSource'] # 0=free, 1=Medium Partner Program
76
-
77
- # --- Metrics (in val['virtuals']) ---
78
- virtuals = val['virtuals']
79
- clap_count = virtuals['totalClapCount'] # 60865 (all claps, including multi-clap)
80
- recommends = virtuals['recommends'] # 8846 (unique clappers)
81
- read_time = virtuals['readingTime'] # 8.79811320754717 (minutes, float)
82
- word_count = virtuals['wordCount'] # 2146
83
-
84
- # --- Tags ---
85
- tags = [t['slug'] for t in virtuals['tags']]
86
- # ['machine-learning', 'artificial-intelligence', 'programming', 'software-development', 'future']
87
-
88
- # --- Author (from references) ---
89
- user = refs['User'][creator_id]
90
- author_name = user['name'] # "Andrej Karpathy"
91
- author_handle = user['username'] # "karpathy"
92
- author_bio = user['bio'] # "I like to train deep neural nets on large datasets."
93
- author_twitter = user['twitterScreenName'] # "karpathy"
94
-
95
- # --- Follower count (from SocialStats) ---
96
- ss = refs['SocialStats'][creator_id]
97
- follower_count = ss['usersFollowedByCount'] # 60027
98
- following_count = ss['usersFollowedCount'] # 183
99
- ```
100
-
101
- ### Detect paywall
102
-
103
- ```python
104
- # Paywalled (Medium Partner Program): isSubscriptionLocked=True, visibility=2, lockedPostSource=1
105
- # Free: isSubscriptionLocked=False, visibility=0, lockedPostSource=0
106
- is_paywalled = val['isSubscriptionLocked'] # True / False
107
- ```
108
-
109
- Confirmed on real TDS articles: paywalled articles return `isSubscriptionLocked=True`, `visibility=2`, `lockedPostSource=1`. Free articles: all three are `False`/`0`.
110
-
111
- ### Article body
112
-
113
- The full body is in `val['content']['bodyModel']['paragraphs']` — a list of dicts:
114
-
115
- ```python
116
- paragraphs = val['content']['bodyModel']['paragraphs']
117
-
118
- # Paragraph types (confirmed for this article):
119
- # type=1 -> body text (P)
120
- # type=3 -> heading (H1/H2)
121
- # type=4 -> image (text is empty; metadata has image ID)
122
-
123
- # Reconstruct plain text:
124
- text_paras = [p['text'] for p in paragraphs if p.get('text')]
125
- full_text = '\n\n'.join(text_paras)
126
- ```
127
-
128
- ---
129
-
130
- ## Path 2: GraphQL API — targeted metric lookups
131
-
132
- `POST https://medium.com/_/graphql` with a JSON body. No auth, no CSRF token required.
133
- Returns HTTP 200 with JSON even for unauthenticated queries. Invalid fields return HTTP 400 — do not assume a field exists without testing first.
134
-
135
- ```python
136
- import json, urllib.request, gzip
137
-
138
- def gql(query):
139
- body = json.dumps({"query": query}).encode()
140
- req = urllib.request.Request(
141
- "https://medium.com/_/graphql",
142
- data=body,
143
- headers={
144
- "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
145
- "Content-Type": "application/json",
146
- "Accept": "application/json",
147
- "Accept-Encoding": "gzip",
148
- },
149
- method="POST",
150
- )
151
- with urllib.request.urlopen(req, timeout=20) as r:
152
- raw = r.read()
153
- if r.headers.get("Content-Encoding") == "gzip":
154
- raw = gzip.decompress(raw)
155
- return json.loads(raw.decode())
156
- ```
157
-
158
- ### Fetch article metrics (fastest)
159
-
160
- ```python
161
- result = gql("""
162
- {
163
- post(id: "a64152b37c35") {
164
- title
165
- id
166
- firstPublishedAt
167
- latestPublishedAt
168
- visibility
169
- uniqueSlug
170
- canonicalUrl
171
- mediumUrl
172
- isLocked
173
- clapCount
174
- readingTime
175
- wordCount
176
- }
177
- }
178
- """)
179
- post = result['data']['post']
180
- # post['visibility'] -> "PUBLIC" | "LOCKED" (string, not numeric)
181
- # post['isLocked'] -> False | True
182
- # post['clapCount'] -> 60865 (same as totalClapCount in format=json)
183
- # post['readingTime'] -> 8.79811320754717 (minutes)
184
- # post['wordCount'] -> 2146
185
- ```
186
-
187
- **Confirmed working `post()` fields:** `title`, `id`, `createdAt`, `updatedAt`, `firstPublishedAt`, `latestPublishedAt`, `visibility`, `uniqueSlug`, `canonicalUrl`, `mediumUrl`, `isLocked`, `clapCount`, `readingTime`, `wordCount`
188
-
189
- **Nested object that works:** `topics { name slug }`, `creator { name username }`, `collection { name id slug description domain creator { name username } }`
190
-
191
- **Fields that return HTTP 400 (not available):** `tags`, `author`, `recommends`, `content`, `publication`, `responses`, `sequence`
192
-
193
- ### Fetch author profile
194
-
195
- ```python
196
- result = gql("""
197
- {
198
- user(username: "karpathy") {
199
- name
200
- username
201
- id
202
- bio
203
- imageId
204
- twitterScreenName
205
- mediumMemberAt
206
- socialStats {
207
- followerCount
208
- followingCount
209
- }
210
- }
211
- }
212
- """)
213
- user = result['data']['user']
214
- # user['name'] -> "Andrej Karpathy"
215
- # user['id'] -> "ac9d9a35533e"
216
- # user['bio'] -> "I like to train deep neural nets on large datasets."
217
- # user['twitterScreenName'] -> "karpathy"
218
- # user['socialStats']['followerCount'] -> 60028
219
- # user['mediumMemberAt'] -> 0 (not a member); nonzero = unix ms join date
220
- ```
221
-
222
- **Confirmed working `user()` fields:** `name`, `username`, `id`, `bio`, `imageId`, `twitterScreenName`, `mediumMemberAt`, `socialStats { followerCount followingCount }`
223
-
224
- **Fields that return HTTP 400:** `followerCount` (top-level), `followingCount` (top-level), `postCount`
225
-
226
- ### Fetch collection (publication) by ID
227
-
228
- The GraphQL `collection()` query only accepts `id`, not `slug`. Get the ID from `?format=json` on the publication page.
229
-
230
- ```python
231
- # TDS Archive id: 7f60cf5620c9 (from medium.com/towards-data-science?format=json)
232
- result = gql("""
233
- {
234
- collection(id: "7f60cf5620c9") {
235
- name
236
- id
237
- slug
238
- description
239
- domain
240
- creator { name username }
241
- }
242
- }
243
- """)
244
- coll = result['data']['collection']
245
- # coll['name'] -> "TDS Archive"
246
- # coll['slug'] -> "data-science"
247
- ```
248
-
249
- ---
250
-
251
- ## Path 3: RSS feeds (best for recent posts list + article bodies)
252
-
253
- Works with plain `http_get`. Returns up to 10 most recent posts. Full article HTML is in `content:encoded`. No clap count or visibility info in RSS.
254
-
255
- ```python
256
- import re
257
- from helpers import http_get
258
-
259
- def parse_rss_items(rss_xml):
260
- """Extract items from Medium RSS feed. Returns list of dicts."""
261
- def cdata(tag, text):
262
- m = re.search(rf'<{tag}[^>]*><!\[CDATA\[(.*?)\]\]></{tag}>', text, re.DOTALL)
263
- return m.group(1).strip() if m else None
264
-
265
- items = []
266
- for raw in re.findall(r'<item>(.*?)</item>', rss_xml, re.DOTALL):
267
- # link is plain text (not CDATA)
268
- link_m = re.search(r'<link>(.*?)</link>', raw, re.DOTALL)
269
- items.append({
270
- 'title': cdata('title', raw),
271
- 'link': link_m.group(1).strip() if link_m else None,
272
- 'pubDate': cdata('pubDate', raw),
273
- 'creator': cdata('dc:creator', raw),
274
- 'tags': re.findall(r'<category><!\[CDATA\[(.*?)\]\]></category>', raw),
275
- 'body_html': cdata('content:encoded', raw), # full article HTML
276
- })
277
- return items
278
-
279
- # User feed (up to 10 latest posts)
280
- rss = http_get("https://medium.com/feed/@karpathy")
281
- posts = parse_rss_items(rss)
282
- # posts[0]['title'] -> "Software 2.0"
283
- # posts[0]['pubDate'] -> "Sat, 11 Nov 2017 22:18:53 GMT"
284
- # posts[0]['creator'] -> "Andrej Karpathy"
285
- # posts[0]['tags'] -> ['programming', 'software-development', 'artificial-intelligence', 'future', 'machine-learning']
286
- # posts[0]['link'] -> "https://karpathy.medium.com/software-2-0-a64152b37c35?source=rss-..."
287
- # posts[0]['body_html'] -> full article body as HTML string (~15KB for this article)
288
-
289
- # Publication feed (up to 10 latest posts)
290
- rss_pub = http_get("https://medium.com/feed/towards-data-science")
291
- pub_posts = parse_rss_items(rss_pub)
292
- ```
293
-
294
- **RSS limitations:**
295
- - RSS does not include clap count, view count, or paywall status.
296
- - `body_html` contains the full article body as HTML, including `<p>`, `<strong>`, `<a>`, `<img>` tags.
297
- - Pagination is not supported — RSS always returns the 10 most recent posts.
298
-
299
- ---
300
-
301
- ## Path 4: `?format=json` on user profile — recent posts with metrics
302
-
303
- Better than RSS when you need clap counts alongside post list. Returns up to `limit` posts (default 10) plus full author metadata.
304
-
305
- ```python
306
- data = medium_json("https://medium.com/@karpathy?limit=10")
307
- payload = data['payload']
308
-
309
- user = payload['user']
310
- # user['name'] -> "Andrej Karpathy"
311
- # user['username'] -> "karpathy"
312
- # user['bio'] -> "I like to train deep neural nets on large datasets."
313
-
314
- refs = payload['references']
315
- ss = refs['SocialStats'][user['userId']]
316
- # ss['usersFollowedByCount'] -> 60028 (followers)
317
- # ss['usersFollowedCount'] -> 183 (following)
318
-
319
- posts = refs.get('Post', {}) # dict keyed by post ID
320
- for pid, p in posts.items():
321
- v = p['virtuals']
322
- print(p['title'], v['totalClapCount'], round(v['readingTime'], 1))
323
-
324
- # Paginate: use paging['next'] from payload
325
- paging = payload['paging']
326
- next_params = paging['next']
327
- # next_params = {'limit': 10, 'to': '1495652975362', 'source': 'overview', 'page': 2, 'ignoredIds': []}
328
- # Append as query params to the same profile URL to get next page
329
- next_url = (
330
- f"https://medium.com/@{user['username']}"
331
- f"?limit={next_params['limit']}&to={next_params['to']}"
332
- f"&source={next_params['source']}&page={next_params['page']}"
333
- )
334
- data2 = medium_json(next_url)
335
- # Note: karpathy has only 8 total posts — pagination returns same refs on page 2
336
- ```
337
-
338
- ---
339
-
340
- ## Path 5: `?format=json` on publication page
341
-
342
- Returns publication metadata and recent posts with metrics.
343
-
344
- ```python
345
- data = medium_json("https://medium.com/towards-data-science")
346
- payload = data['payload']
347
-
348
- coll = payload['collection']
349
- # coll['name'] -> "TDS Archive"
350
- # coll['slug'] -> "data-science"
351
- # coll['description'] -> full description string
352
- # coll['subscriberCount'] -> 828527
353
- # coll['metadata']['followerCount'] -> 828527
354
- # coll['tags'] -> ['DATA SCIENCE', 'MACHINE LEARNING', ...]
355
-
356
- posts = payload['references'].get('Post', {})
357
- for pid, p in posts.items():
358
- v = p['virtuals']
359
- print(p['title'], v['totalClapCount'], p['isSubscriptionLocked'])
360
- # Also includes: p['visibility'] (0=free, 2=paywalled)
361
-
362
- # Paginate (same pattern as user profile)
363
- paging = payload['paging']
364
- # paging['next'] = {'to': '1738573325936', 'page': 3}
365
- ```
366
-
367
- ---
368
-
369
- ## Retrieving the article ID from a URL
370
-
371
- The `id` is the last 12 hex chars of a Medium article URL slug:
372
-
373
- ```python
374
- import re
375
-
376
- url = "https://medium.com/@karpathy/software-2-0-a64152b37c35"
377
- article_id = re.search(r'-([a-f0-9]{12})$', url.rstrip('/').split('?')[0])
378
- if article_id:
379
- article_id = article_id.group(1) # "a64152b37c35"
380
- ```
381
-
382
- This ID is the same across all URL forms (`medium.com/@user/slug`, `user.medium.com/slug`, `medium.com/publication/slug`).
383
-
384
- ---
385
-
386
- ## Gotchas
387
-
388
- - **HTTP 403 on plain `http_get`** — The default `http_get` helper sends `User-Agent: Mozilla/5.0` which Medium accepts for most endpoints, but article HTML pages (without `?format=json`) return 403. Always use `?format=json` for article and profile pages.
389
-
390
- - **`?format=json` works; profile stream API does not** — `https://medium.com/_/api/users/{id}/profile/stream` returns HTTP 403 for unauthenticated requests. Use `?format=json` on the profile URL instead.
391
-
392
- - **`?format=json` on search pages returns 403 or broken JSON** — `medium.com/search?q=...&format=json` and `medium.com/search/posts?q=...&format=json` both fail. Search is not available without auth.
393
-
394
- - **GraphQL `collection()` requires ID, not slug** — `collection(slug: "towards-data-science")` returns HTTP 400. You must use the numeric ID (e.g. `"7f60cf5620c9"`). Get it from `?format=json` on the publication page: `payload['collection']['id']`.
395
-
396
- - **GraphQL `tags` field on `post()` returns HTTP 400** — Use `topics { name slug }` instead. Topics are a subset of tags but work without auth.
397
-
398
- - **GraphQL visibility is a string, not a number** — `post().visibility` returns `"PUBLIC"` or `"LOCKED"` (string). The `?format=json` `value.visibility` field uses integers: `0`=public, `2`=locked. Both agree on the lock status.
399
-
400
- - **`totalClapCount` vs `recommends`** — `totalClapCount` (60865) counts all claps (Medium allows up to 50 claps per reader). `recommends` (8846) counts unique clappers. The GraphQL `clapCount` field equals `totalClapCount`, not `recommends`.
401
-
402
- - **RSS returns at most 10 items, no clap counts** — RSS is best for getting recent article links + full HTML body. Use `?format=json` profile if you need metrics.
403
-
404
- - **RSS link contains tracking params** — `posts[0]['link']` includes `?source=rss-{userId}------2`. Strip with `.split('?')[0]` if you need a clean URL.
405
-
406
- - **`content:encoded` in RSS is full HTML, not plaintext** — Strip HTML tags if you want plaintext: `re.sub(r'<[^>]+>', '', body_html)`.
407
-
408
- - **Medium subdomains** — Some users have custom subdomains (`karpathy.medium.com`). Both `medium.com/@karpathy/...` and `karpathy.medium.com/...` resolve to the same article; `?format=json` works on both.
409
-
410
- - **towardsdatascience.com is no longer Medium** — TDS moved to its own WordPress site. `towardsdatascience.com/article-slug?format=json` returns full WordPress HTML, not Medium JSON. Use `medium.com/towards-data-science` for the archived Medium publication.
411
-
412
- - **No public search API** — Medium has no Algolia equivalent. Finding articles by keyword requires either a browser, or fetching a user/publication feed and filtering locally.
413
-
414
- - **Timestamps are unix milliseconds** — `firstPublishedAt`, `createdAt`, `latestPublishedAt` are all in milliseconds. Convert: `datetime.fromtimestamp(val['firstPublishedAt'] / 1000, tz=timezone.utc)`.
1
+ # Medium — Data Extraction
2
+
3
+ `https://medium.com` — blogging platform. Three access paths tested and validated: the undocumented `?format=json` endpoint (fastest for article + publication data), the undocumented GraphQL API (best for targeted metric lookups), and RSS feeds (best for recent posts lists without auth). No browser needed for any read-only task.
4
+
5
+ ## Do this first: pick your access path
6
+
7
+ | Goal | Best approach | Latency |
8
+ |------|--------------|---------|
9
+ | Article metadata + full body | `?format=json` on article URL | ~400ms |
10
+ | Article metrics only (claps, visibility) | GraphQL `post(id:)` | ~275ms |
11
+ | Author profile + follower count | GraphQL `user(username:)` | ~220ms |
12
+ | Recent posts for a user (up to 10) | `?format=json` on profile URL | ~240ms |
13
+ | Recent posts for a publication | `?format=json` on publication URL | ~300ms |
14
+ | Paginated post list (feed) | RSS feed | ~260ms |
15
+ | Full article body as HTML | RSS `content:encoded` field | ~260ms |
16
+ | Publication subscriber count | `?format=json` on publication URL | ~300ms |
17
+
18
+ **Never use a browser for read-only Medium tasks.** All article content, metadata, and metrics are available over HTTP. Browser is only needed for authenticated actions (clapping, posting, account management).
19
+
20
+ ---
21
+
22
+ ## The XSSI prefix
23
+
24
+ Every `?format=json` response starts with the anti-hijacking prefix `])}while(1);</x>` before the JSON. **Strip it before parsing.** The helper below handles this.
25
+
26
+ ```python
27
+ import urllib.request, gzip, json, re
28
+
29
+ def medium_json(url):
30
+ """Fetch any Medium URL with ?format=json and return parsed dict.
31
+ Strips the XSSI prefix ])}while(1);</x> automatically.
32
+ Works on: article URLs, user profile URLs, publication URLs.
33
+ Does NOT work on: search pages, /latest, profile stream API.
34
+ """
35
+ sep = '&' if '?' in url else '?'
36
+ req = urllib.request.Request(
37
+ url + sep + 'format=json',
38
+ headers={
39
+ "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
40
+ "Accept": "application/json, */*",
41
+ "Accept-Encoding": "gzip",
42
+ }
43
+ )
44
+ with urllib.request.urlopen(req, timeout=20) as r:
45
+ raw = r.read()
46
+ if r.headers.get("Content-Encoding") == "gzip":
47
+ raw = gzip.decompress(raw)
48
+ text = raw.decode()
49
+ # Strip everything before the first {
50
+ return json.loads(re.sub(r'^[^\{]+', '', text))
51
+ ```
52
+
53
+ ---
54
+
55
+ ## Path 1: `?format=json` — article metadata + body (fastest for articles)
56
+
57
+ Append `?format=json` to any article URL. Returns full metadata, virtuals (metrics), and the complete article body in a structured `bodyModel`. No auth required for public and subscriber-locked articles alike — the metadata and full body are always returned, but paywalled body content in a browser would be truncated.
58
+
59
+ ```python
60
+ data = medium_json("https://medium.com/@karpathy/software-2-0-a64152b37c35")
61
+ payload = data['payload']
62
+ val = payload['value'] # article fields
63
+ refs = payload['references'] # User, Social, SocialStats dicts keyed by ID
64
+
65
+ # --- Article fields ---
66
+ title = val['title'] # "Software 2.0"
67
+ article_id = val['id'] # "a64152b37c35"
68
+ creator_id = val['creatorId'] # "ac9d9a35533e"
69
+ slug = val['uniqueSlug'] # "software-2-0-a64152b37c35"
70
+ url = val['canonicalUrl'] # "https://medium.com/@karpathy/..."
71
+ first_pub = val['firstPublishedAt'] # unix ms: 1510438733751
72
+ last_pub = val['latestPublishedAt'] # unix ms: 1615659523264
73
+ visibility = val['visibility'] # 0=public, 2=subscriber-locked
74
+ is_locked = val['isSubscriptionLocked'] # True if paywalled
75
+ locked_src = val['lockedPostSource'] # 0=free, 1=Medium Partner Program
76
+
77
+ # --- Metrics (in val['virtuals']) ---
78
+ virtuals = val['virtuals']
79
+ clap_count = virtuals['totalClapCount'] # 60865 (all claps, including multi-clap)
80
+ recommends = virtuals['recommends'] # 8846 (unique clappers)
81
+ read_time = virtuals['readingTime'] # 8.79811320754717 (minutes, float)
82
+ word_count = virtuals['wordCount'] # 2146
83
+
84
+ # --- Tags ---
85
+ tags = [t['slug'] for t in virtuals['tags']]
86
+ # ['machine-learning', 'artificial-intelligence', 'programming', 'software-development', 'future']
87
+
88
+ # --- Author (from references) ---
89
+ user = refs['User'][creator_id]
90
+ author_name = user['name'] # "Andrej Karpathy"
91
+ author_handle = user['username'] # "karpathy"
92
+ author_bio = user['bio'] # "I like to train deep neural nets on large datasets."
93
+ author_twitter = user['twitterScreenName'] # "karpathy"
94
+
95
+ # --- Follower count (from SocialStats) ---
96
+ ss = refs['SocialStats'][creator_id]
97
+ follower_count = ss['usersFollowedByCount'] # 60027
98
+ following_count = ss['usersFollowedCount'] # 183
99
+ ```
100
+
101
+ ### Detect paywall
102
+
103
+ ```python
104
+ # Paywalled (Medium Partner Program): isSubscriptionLocked=True, visibility=2, lockedPostSource=1
105
+ # Free: isSubscriptionLocked=False, visibility=0, lockedPostSource=0
106
+ is_paywalled = val['isSubscriptionLocked'] # True / False
107
+ ```
108
+
109
+ Confirmed on real TDS articles: paywalled articles return `isSubscriptionLocked=True`, `visibility=2`, `lockedPostSource=1`. Free articles: all three are `False`/`0`.
110
+
111
+ ### Article body
112
+
113
+ The full body is in `val['content']['bodyModel']['paragraphs']` — a list of dicts:
114
+
115
+ ```python
116
+ paragraphs = val['content']['bodyModel']['paragraphs']
117
+
118
+ # Paragraph types (confirmed for this article):
119
+ # type=1 -> body text (P)
120
+ # type=3 -> heading (H1/H2)
121
+ # type=4 -> image (text is empty; metadata has image ID)
122
+
123
+ # Reconstruct plain text:
124
+ text_paras = [p['text'] for p in paragraphs if p.get('text')]
125
+ full_text = '\n\n'.join(text_paras)
126
+ ```
127
+
128
+ ---
129
+
130
+ ## Path 2: GraphQL API — targeted metric lookups
131
+
132
+ `POST https://medium.com/_/graphql` with a JSON body. No auth, no CSRF token required.
133
+ Returns HTTP 200 with JSON even for unauthenticated queries. Invalid fields return HTTP 400 — do not assume a field exists without testing first.
134
+
135
+ ```python
136
+ import json, urllib.request, gzip
137
+
138
+ def gql(query):
139
+ body = json.dumps({"query": query}).encode()
140
+ req = urllib.request.Request(
141
+ "https://medium.com/_/graphql",
142
+ data=body,
143
+ headers={
144
+ "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
145
+ "Content-Type": "application/json",
146
+ "Accept": "application/json",
147
+ "Accept-Encoding": "gzip",
148
+ },
149
+ method="POST",
150
+ )
151
+ with urllib.request.urlopen(req, timeout=20) as r:
152
+ raw = r.read()
153
+ if r.headers.get("Content-Encoding") == "gzip":
154
+ raw = gzip.decompress(raw)
155
+ return json.loads(raw.decode())
156
+ ```
157
+
158
+ ### Fetch article metrics (fastest)
159
+
160
+ ```python
161
+ result = gql("""
162
+ {
163
+ post(id: "a64152b37c35") {
164
+ title
165
+ id
166
+ firstPublishedAt
167
+ latestPublishedAt
168
+ visibility
169
+ uniqueSlug
170
+ canonicalUrl
171
+ mediumUrl
172
+ isLocked
173
+ clapCount
174
+ readingTime
175
+ wordCount
176
+ }
177
+ }
178
+ """)
179
+ post = result['data']['post']
180
+ # post['visibility'] -> "PUBLIC" | "LOCKED" (string, not numeric)
181
+ # post['isLocked'] -> False | True
182
+ # post['clapCount'] -> 60865 (same as totalClapCount in format=json)
183
+ # post['readingTime'] -> 8.79811320754717 (minutes)
184
+ # post['wordCount'] -> 2146
185
+ ```
186
+
187
+ **Confirmed working `post()` fields:** `title`, `id`, `createdAt`, `updatedAt`, `firstPublishedAt`, `latestPublishedAt`, `visibility`, `uniqueSlug`, `canonicalUrl`, `mediumUrl`, `isLocked`, `clapCount`, `readingTime`, `wordCount`
188
+
189
+ **Nested object that works:** `topics { name slug }`, `creator { name username }`, `collection { name id slug description domain creator { name username } }`
190
+
191
+ **Fields that return HTTP 400 (not available):** `tags`, `author`, `recommends`, `content`, `publication`, `responses`, `sequence`
192
+
193
+ ### Fetch author profile
194
+
195
+ ```python
196
+ result = gql("""
197
+ {
198
+ user(username: "karpathy") {
199
+ name
200
+ username
201
+ id
202
+ bio
203
+ imageId
204
+ twitterScreenName
205
+ mediumMemberAt
206
+ socialStats {
207
+ followerCount
208
+ followingCount
209
+ }
210
+ }
211
+ }
212
+ """)
213
+ user = result['data']['user']
214
+ # user['name'] -> "Andrej Karpathy"
215
+ # user['id'] -> "ac9d9a35533e"
216
+ # user['bio'] -> "I like to train deep neural nets on large datasets."
217
+ # user['twitterScreenName'] -> "karpathy"
218
+ # user['socialStats']['followerCount'] -> 60028
219
+ # user['mediumMemberAt'] -> 0 (not a member); nonzero = unix ms join date
220
+ ```
221
+
222
+ **Confirmed working `user()` fields:** `name`, `username`, `id`, `bio`, `imageId`, `twitterScreenName`, `mediumMemberAt`, `socialStats { followerCount followingCount }`
223
+
224
+ **Fields that return HTTP 400:** `followerCount` (top-level), `followingCount` (top-level), `postCount`
225
+
226
+ ### Fetch collection (publication) by ID
227
+
228
+ The GraphQL `collection()` query only accepts `id`, not `slug`. Get the ID from `?format=json` on the publication page.
229
+
230
+ ```python
231
+ # TDS Archive id: 7f60cf5620c9 (from medium.com/towards-data-science?format=json)
232
+ result = gql("""
233
+ {
234
+ collection(id: "7f60cf5620c9") {
235
+ name
236
+ id
237
+ slug
238
+ description
239
+ domain
240
+ creator { name username }
241
+ }
242
+ }
243
+ """)
244
+ coll = result['data']['collection']
245
+ # coll['name'] -> "TDS Archive"
246
+ # coll['slug'] -> "data-science"
247
+ ```
248
+
249
+ ---
250
+
251
+ ## Path 3: RSS feeds (best for recent posts list + article bodies)
252
+
253
+ Works with plain `http_get`. Returns up to 10 most recent posts. Full article HTML is in `content:encoded`. No clap count or visibility info in RSS.
254
+
255
+ ```python
256
+ import re
257
+ from helpers import http_get
258
+
259
+ def parse_rss_items(rss_xml):
260
+ """Extract items from Medium RSS feed. Returns list of dicts."""
261
+ def cdata(tag, text):
262
+ m = re.search(rf'<{tag}[^>]*><!\[CDATA\[(.*?)\]\]></{tag}>', text, re.DOTALL)
263
+ return m.group(1).strip() if m else None
264
+
265
+ items = []
266
+ for raw in re.findall(r'<item>(.*?)</item>', rss_xml, re.DOTALL):
267
+ # link is plain text (not CDATA)
268
+ link_m = re.search(r'<link>(.*?)</link>', raw, re.DOTALL)
269
+ items.append({
270
+ 'title': cdata('title', raw),
271
+ 'link': link_m.group(1).strip() if link_m else None,
272
+ 'pubDate': cdata('pubDate', raw),
273
+ 'creator': cdata('dc:creator', raw),
274
+ 'tags': re.findall(r'<category><!\[CDATA\[(.*?)\]\]></category>', raw),
275
+ 'body_html': cdata('content:encoded', raw), # full article HTML
276
+ })
277
+ return items
278
+
279
+ # User feed (up to 10 latest posts)
280
+ rss = http_get("https://medium.com/feed/@karpathy")
281
+ posts = parse_rss_items(rss)
282
+ # posts[0]['title'] -> "Software 2.0"
283
+ # posts[0]['pubDate'] -> "Sat, 11 Nov 2017 22:18:53 GMT"
284
+ # posts[0]['creator'] -> "Andrej Karpathy"
285
+ # posts[0]['tags'] -> ['programming', 'software-development', 'artificial-intelligence', 'future', 'machine-learning']
286
+ # posts[0]['link'] -> "https://karpathy.medium.com/software-2-0-a64152b37c35?source=rss-..."
287
+ # posts[0]['body_html'] -> full article body as HTML string (~15KB for this article)
288
+
289
+ # Publication feed (up to 10 latest posts)
290
+ rss_pub = http_get("https://medium.com/feed/towards-data-science")
291
+ pub_posts = parse_rss_items(rss_pub)
292
+ ```
293
+
294
+ **RSS limitations:**
295
+ - RSS does not include clap count, view count, or paywall status.
296
+ - `body_html` contains the full article body as HTML, including `<p>`, `<strong>`, `<a>`, `<img>` tags.
297
+ - Pagination is not supported — RSS always returns the 10 most recent posts.
298
+
299
+ ---
300
+
301
+ ## Path 4: `?format=json` on user profile — recent posts with metrics
302
+
303
+ Better than RSS when you need clap counts alongside post list. Returns up to `limit` posts (default 10) plus full author metadata.
304
+
305
+ ```python
306
+ data = medium_json("https://medium.com/@karpathy?limit=10")
307
+ payload = data['payload']
308
+
309
+ user = payload['user']
310
+ # user['name'] -> "Andrej Karpathy"
311
+ # user['username'] -> "karpathy"
312
+ # user['bio'] -> "I like to train deep neural nets on large datasets."
313
+
314
+ refs = payload['references']
315
+ ss = refs['SocialStats'][user['userId']]
316
+ # ss['usersFollowedByCount'] -> 60028 (followers)
317
+ # ss['usersFollowedCount'] -> 183 (following)
318
+
319
+ posts = refs.get('Post', {}) # dict keyed by post ID
320
+ for pid, p in posts.items():
321
+ v = p['virtuals']
322
+ print(p['title'], v['totalClapCount'], round(v['readingTime'], 1))
323
+
324
+ # Paginate: use paging['next'] from payload
325
+ paging = payload['paging']
326
+ next_params = paging['next']
327
+ # next_params = {'limit': 10, 'to': '1495652975362', 'source': 'overview', 'page': 2, 'ignoredIds': []}
328
+ # Append as query params to the same profile URL to get next page
329
+ next_url = (
330
+ f"https://medium.com/@{user['username']}"
331
+ f"?limit={next_params['limit']}&to={next_params['to']}"
332
+ f"&source={next_params['source']}&page={next_params['page']}"
333
+ )
334
+ data2 = medium_json(next_url)
335
+ # Note: karpathy has only 8 total posts — pagination returns same refs on page 2
336
+ ```
337
+
338
+ ---
339
+
340
+ ## Path 5: `?format=json` on publication page
341
+
342
+ Returns publication metadata and recent posts with metrics.
343
+
344
+ ```python
345
+ data = medium_json("https://medium.com/towards-data-science")
346
+ payload = data['payload']
347
+
348
+ coll = payload['collection']
349
+ # coll['name'] -> "TDS Archive"
350
+ # coll['slug'] -> "data-science"
351
+ # coll['description'] -> full description string
352
+ # coll['subscriberCount'] -> 828527
353
+ # coll['metadata']['followerCount'] -> 828527
354
+ # coll['tags'] -> ['DATA SCIENCE', 'MACHINE LEARNING', ...]
355
+
356
+ posts = payload['references'].get('Post', {})
357
+ for pid, p in posts.items():
358
+ v = p['virtuals']
359
+ print(p['title'], v['totalClapCount'], p['isSubscriptionLocked'])
360
+ # Also includes: p['visibility'] (0=free, 2=paywalled)
361
+
362
+ # Paginate (same pattern as user profile)
363
+ paging = payload['paging']
364
+ # paging['next'] = {'to': '1738573325936', 'page': 3}
365
+ ```
366
+
367
+ ---
368
+
369
+ ## Retrieving the article ID from a URL
370
+
371
+ The `id` is the last 12 hex chars of a Medium article URL slug:
372
+
373
+ ```python
374
+ import re
375
+
376
+ url = "https://medium.com/@karpathy/software-2-0-a64152b37c35"
377
+ article_id = re.search(r'-([a-f0-9]{12})$', url.rstrip('/').split('?')[0])
378
+ if article_id:
379
+ article_id = article_id.group(1) # "a64152b37c35"
380
+ ```
381
+
382
+ This ID is the same across all URL forms (`medium.com/@user/slug`, `user.medium.com/slug`, `medium.com/publication/slug`).
383
+
384
+ ---
385
+
386
+ ## Gotchas
387
+
388
+ - **HTTP 403 on plain `http_get`** — The default `http_get` helper sends `User-Agent: Mozilla/5.0` which Medium accepts for most endpoints, but article HTML pages (without `?format=json`) return 403. Always use `?format=json` for article and profile pages.
389
+
390
+ - **`?format=json` works; profile stream API does not** — `https://medium.com/_/api/users/{id}/profile/stream` returns HTTP 403 for unauthenticated requests. Use `?format=json` on the profile URL instead.
391
+
392
+ - **`?format=json` on search pages returns 403 or broken JSON** — `medium.com/search?q=...&format=json` and `medium.com/search/posts?q=...&format=json` both fail. Search is not available without auth.
393
+
394
+ - **GraphQL `collection()` requires ID, not slug** — `collection(slug: "towards-data-science")` returns HTTP 400. You must use the numeric ID (e.g. `"7f60cf5620c9"`). Get it from `?format=json` on the publication page: `payload['collection']['id']`.
395
+
396
+ - **GraphQL `tags` field on `post()` returns HTTP 400** — Use `topics { name slug }` instead. Topics are a subset of tags but work without auth.
397
+
398
+ - **GraphQL visibility is a string, not a number** — `post().visibility` returns `"PUBLIC"` or `"LOCKED"` (string). The `?format=json` `value.visibility` field uses integers: `0`=public, `2`=locked. Both agree on the lock status.
399
+
400
+ - **`totalClapCount` vs `recommends`** — `totalClapCount` (60865) counts all claps (Medium allows up to 50 claps per reader). `recommends` (8846) counts unique clappers. The GraphQL `clapCount` field equals `totalClapCount`, not `recommends`.
401
+
402
+ - **RSS returns at most 10 items, no clap counts** — RSS is best for getting recent article links + full HTML body. Use `?format=json` profile if you need metrics.
403
+
404
+ - **RSS link contains tracking params** — `posts[0]['link']` includes `?source=rss-{userId}------2`. Strip with `.split('?')[0]` if you need a clean URL.
405
+
406
+ - **`content:encoded` in RSS is full HTML, not plaintext** — Strip HTML tags if you want plaintext: `re.sub(r'<[^>]+>', '', body_html)`.
407
+
408
+ - **Medium subdomains** — Some users have custom subdomains (`karpathy.medium.com`). Both `medium.com/@karpathy/...` and `karpathy.medium.com/...` resolve to the same article; `?format=json` works on both.
409
+
410
+ - **towardsdatascience.com is no longer Medium** — TDS moved to its own WordPress site. `towardsdatascience.com/article-slug?format=json` returns full WordPress HTML, not Medium JSON. Use `medium.com/towards-data-science` for the archived Medium publication.
411
+
412
+ - **No public search API** — Medium has no Algolia equivalent. Finding articles by keyword requires either a browser, or fetching a user/publication feed and filtering locally.
413
+
414
+ - **Timestamps are unix milliseconds** — `firstPublishedAt`, `createdAt`, `latestPublishedAt` are all in milliseconds. Convert: `datetime.fromtimestamp(val['firstPublishedAt'] / 1000, tz=timezone.utc)`.