@pencil-agent/nano-pencil 2.0.1 → 2.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (186) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/model/custom-providers.js +1 -1
  7. package/dist/core/model-registry.js +5 -5
  8. package/dist/extensions/builtin/AGENT.md +115 -115
  9. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  10. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  11. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  12. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  13. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  14. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  15. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  16. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  17. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  18. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  19. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  20. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  91. package/dist/extensions/builtin/browser/browser.md +73 -73
  92. package/dist/extensions/builtin/browser/install.md +142 -142
  93. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  94. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  95. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  96. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  97. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  98. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  99. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  100. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  101. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  102. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  103. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  104. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  105. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  107. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  108. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  109. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  110. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  111. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  112. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  113. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  114. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  115. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  116. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  117. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  118. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  119. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  120. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  121. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  122. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  123. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  124. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  125. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  126. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  127. package/dist/extensions/builtin/goal/README.md +67 -67
  128. package/dist/extensions/builtin/grub/README.md +112 -112
  129. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  130. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  131. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  132. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  133. package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
  134. package/dist/extensions/builtin/loop/README.md +92 -92
  135. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  136. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  137. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  138. package/dist/extensions/builtin/sal/README.md +72 -72
  139. package/dist/extensions/builtin/security-audit/README.md +289 -289
  140. package/dist/extensions/builtin/team/AGENT.md +112 -112
  141. package/dist/extensions/builtin/team/TESTING.md +299 -299
  142. package/dist/extensions/builtin/token-save/README.md +56 -56
  143. package/dist/extensions/optional/AGENT.md +10 -10
  144. package/dist/modes/interactive/controllers/input-submit-controller.js +2 -2
  145. package/dist/modes/interactive/controllers/stream-render-controller.js +2 -2
  146. package/dist/modes/interactive/interactive-mode.js +19 -19
  147. package/dist/modes/interactive/theme/dark.json +85 -85
  148. package/dist/modes/interactive/theme/light.json +84 -84
  149. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  150. package/dist/modes/interactive/theme/warm.json +81 -81
  151. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  152. package/dist/node_modules/@pencil-agent/ai/dist/models.generated.js +1 -1
  153. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
  154. package/docs/SDK-TESTING.md +364 -0
  155. package/docs/codex-goal-command-impl.md +1055 -1055
  156. package/docs/codex-goal-vs-grub.md +500 -500
  157. package/docs/custom-provider.md +27 -27
  158. package/docs/extensions.md +27 -27
  159. package/docs/keybindings.md +27 -27
  160. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  161. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  162. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  163. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  164. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  165. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  166. package/docs/loop-usage-examples.md +214 -214
  167. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
  168. package/docs/models.md +27 -27
  169. package/docs/packages.md +27 -27
  170. package/docs/pi-design-philosophy.md +457 -457
  171. package/docs/planmode.md +1987 -1987
  172. package/docs/prompt-templates.md +27 -27
  173. package/docs/providers.md +27 -27
  174. package/docs/sdk.md +27 -27
  175. package/docs/skills.md +27 -27
  176. package/docs/startup-performance-optimization.md +301 -0
  177. package/docs/themes.md +27 -27
  178. package/docs/tui.md +27 -27
  179. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
  180. package/package.json +190 -190
  181. package/docs/cc-agent-design.md +0 -1297
  182. package/docs/cc-tui-design.md +0 -1333
  183. package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +0 -170
  184. package/docs/scan-report.md +0 -3820
  185. package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
  186. package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +0 -261
@@ -1,414 +1,414 @@
1
- # Medium — Data Extraction
2
-
3
- `https://medium.com` — blogging platform. Three access paths tested and validated: the undocumented `?format=json` endpoint (fastest for article + publication data), the undocumented GraphQL API (best for targeted metric lookups), and RSS feeds (best for recent posts lists without auth). No browser needed for any read-only task.
4
-
5
- ## Do this first: pick your access path
6
-
7
- | Goal | Best approach | Latency |
8
- |------|--------------|---------|
9
- | Article metadata + full body | `?format=json` on article URL | ~400ms |
10
- | Article metrics only (claps, visibility) | GraphQL `post(id:)` | ~275ms |
11
- | Author profile + follower count | GraphQL `user(username:)` | ~220ms |
12
- | Recent posts for a user (up to 10) | `?format=json` on profile URL | ~240ms |
13
- | Recent posts for a publication | `?format=json` on publication URL | ~300ms |
14
- | Paginated post list (feed) | RSS feed | ~260ms |
15
- | Full article body as HTML | RSS `content:encoded` field | ~260ms |
16
- | Publication subscriber count | `?format=json` on publication URL | ~300ms |
17
-
18
- **Never use a browser for read-only Medium tasks.** All article content, metadata, and metrics are available over HTTP. Browser is only needed for authenticated actions (clapping, posting, account management).
19
-
20
- ---
21
-
22
- ## The XSSI prefix
23
-
24
- Every `?format=json` response starts with the anti-hijacking prefix `])}while(1);</x>` before the JSON. **Strip it before parsing.** The helper below handles this.
25
-
26
- ```python
27
- import urllib.request, gzip, json, re
28
-
29
- def medium_json(url):
30
- """Fetch any Medium URL with ?format=json and return parsed dict.
31
- Strips the XSSI prefix ])}while(1);</x> automatically.
32
- Works on: article URLs, user profile URLs, publication URLs.
33
- Does NOT work on: search pages, /latest, profile stream API.
34
- """
35
- sep = '&' if '?' in url else '?'
36
- req = urllib.request.Request(
37
- url + sep + 'format=json',
38
- headers={
39
- "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
40
- "Accept": "application/json, */*",
41
- "Accept-Encoding": "gzip",
42
- }
43
- )
44
- with urllib.request.urlopen(req, timeout=20) as r:
45
- raw = r.read()
46
- if r.headers.get("Content-Encoding") == "gzip":
47
- raw = gzip.decompress(raw)
48
- text = raw.decode()
49
- # Strip everything before the first {
50
- return json.loads(re.sub(r'^[^\{]+', '', text))
51
- ```
52
-
53
- ---
54
-
55
- ## Path 1: `?format=json` — article metadata + body (fastest for articles)
56
-
57
- Append `?format=json` to any article URL. Returns full metadata, virtuals (metrics), and the complete article body in a structured `bodyModel`. No auth required for public and subscriber-locked articles alike — the metadata and full body are always returned, but paywalled body content in a browser would be truncated.
58
-
59
- ```python
60
- data = medium_json("https://medium.com/@karpathy/software-2-0-a64152b37c35")
61
- payload = data['payload']
62
- val = payload['value'] # article fields
63
- refs = payload['references'] # User, Social, SocialStats dicts keyed by ID
64
-
65
- # --- Article fields ---
66
- title = val['title'] # "Software 2.0"
67
- article_id = val['id'] # "a64152b37c35"
68
- creator_id = val['creatorId'] # "ac9d9a35533e"
69
- slug = val['uniqueSlug'] # "software-2-0-a64152b37c35"
70
- url = val['canonicalUrl'] # "https://medium.com/@karpathy/..."
71
- first_pub = val['firstPublishedAt'] # unix ms: 1510438733751
72
- last_pub = val['latestPublishedAt'] # unix ms: 1615659523264
73
- visibility = val['visibility'] # 0=public, 2=subscriber-locked
74
- is_locked = val['isSubscriptionLocked'] # True if paywalled
75
- locked_src = val['lockedPostSource'] # 0=free, 1=Medium Partner Program
76
-
77
- # --- Metrics (in val['virtuals']) ---
78
- virtuals = val['virtuals']
79
- clap_count = virtuals['totalClapCount'] # 60865 (all claps, including multi-clap)
80
- recommends = virtuals['recommends'] # 8846 (unique clappers)
81
- read_time = virtuals['readingTime'] # 8.79811320754717 (minutes, float)
82
- word_count = virtuals['wordCount'] # 2146
83
-
84
- # --- Tags ---
85
- tags = [t['slug'] for t in virtuals['tags']]
86
- # ['machine-learning', 'artificial-intelligence', 'programming', 'software-development', 'future']
87
-
88
- # --- Author (from references) ---
89
- user = refs['User'][creator_id]
90
- author_name = user['name'] # "Andrej Karpathy"
91
- author_handle = user['username'] # "karpathy"
92
- author_bio = user['bio'] # "I like to train deep neural nets on large datasets."
93
- author_twitter = user['twitterScreenName'] # "karpathy"
94
-
95
- # --- Follower count (from SocialStats) ---
96
- ss = refs['SocialStats'][creator_id]
97
- follower_count = ss['usersFollowedByCount'] # 60027
98
- following_count = ss['usersFollowedCount'] # 183
99
- ```
100
-
101
- ### Detect paywall
102
-
103
- ```python
104
- # Paywalled (Medium Partner Program): isSubscriptionLocked=True, visibility=2, lockedPostSource=1
105
- # Free: isSubscriptionLocked=False, visibility=0, lockedPostSource=0
106
- is_paywalled = val['isSubscriptionLocked'] # True / False
107
- ```
108
-
109
- Confirmed on real TDS articles: paywalled articles return `isSubscriptionLocked=True`, `visibility=2`, `lockedPostSource=1`. Free articles: all three are `False`/`0`.
110
-
111
- ### Article body
112
-
113
- The full body is in `val['content']['bodyModel']['paragraphs']` — a list of dicts:
114
-
115
- ```python
116
- paragraphs = val['content']['bodyModel']['paragraphs']
117
-
118
- # Paragraph types (confirmed for this article):
119
- # type=1 -> body text (P)
120
- # type=3 -> heading (H1/H2)
121
- # type=4 -> image (text is empty; metadata has image ID)
122
-
123
- # Reconstruct plain text:
124
- text_paras = [p['text'] for p in paragraphs if p.get('text')]
125
- full_text = '\n\n'.join(text_paras)
126
- ```
127
-
128
- ---
129
-
130
- ## Path 2: GraphQL API — targeted metric lookups
131
-
132
- `POST https://medium.com/_/graphql` with a JSON body. No auth, no CSRF token required.
133
- Returns HTTP 200 with JSON even for unauthenticated queries. Invalid fields return HTTP 400 — do not assume a field exists without testing first.
134
-
135
- ```python
136
- import json, urllib.request, gzip
137
-
138
- def gql(query):
139
- body = json.dumps({"query": query}).encode()
140
- req = urllib.request.Request(
141
- "https://medium.com/_/graphql",
142
- data=body,
143
- headers={
144
- "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
145
- "Content-Type": "application/json",
146
- "Accept": "application/json",
147
- "Accept-Encoding": "gzip",
148
- },
149
- method="POST",
150
- )
151
- with urllib.request.urlopen(req, timeout=20) as r:
152
- raw = r.read()
153
- if r.headers.get("Content-Encoding") == "gzip":
154
- raw = gzip.decompress(raw)
155
- return json.loads(raw.decode())
156
- ```
157
-
158
- ### Fetch article metrics (fastest)
159
-
160
- ```python
161
- result = gql("""
162
- {
163
- post(id: "a64152b37c35") {
164
- title
165
- id
166
- firstPublishedAt
167
- latestPublishedAt
168
- visibility
169
- uniqueSlug
170
- canonicalUrl
171
- mediumUrl
172
- isLocked
173
- clapCount
174
- readingTime
175
- wordCount
176
- }
177
- }
178
- """)
179
- post = result['data']['post']
180
- # post['visibility'] -> "PUBLIC" | "LOCKED" (string, not numeric)
181
- # post['isLocked'] -> False | True
182
- # post['clapCount'] -> 60865 (same as totalClapCount in format=json)
183
- # post['readingTime'] -> 8.79811320754717 (minutes)
184
- # post['wordCount'] -> 2146
185
- ```
186
-
187
- **Confirmed working `post()` fields:** `title`, `id`, `createdAt`, `updatedAt`, `firstPublishedAt`, `latestPublishedAt`, `visibility`, `uniqueSlug`, `canonicalUrl`, `mediumUrl`, `isLocked`, `clapCount`, `readingTime`, `wordCount`
188
-
189
- **Nested object that works:** `topics { name slug }`, `creator { name username }`, `collection { name id slug description domain creator { name username } }`
190
-
191
- **Fields that return HTTP 400 (not available):** `tags`, `author`, `recommends`, `content`, `publication`, `responses`, `sequence`
192
-
193
- ### Fetch author profile
194
-
195
- ```python
196
- result = gql("""
197
- {
198
- user(username: "karpathy") {
199
- name
200
- username
201
- id
202
- bio
203
- imageId
204
- twitterScreenName
205
- mediumMemberAt
206
- socialStats {
207
- followerCount
208
- followingCount
209
- }
210
- }
211
- }
212
- """)
213
- user = result['data']['user']
214
- # user['name'] -> "Andrej Karpathy"
215
- # user['id'] -> "ac9d9a35533e"
216
- # user['bio'] -> "I like to train deep neural nets on large datasets."
217
- # user['twitterScreenName'] -> "karpathy"
218
- # user['socialStats']['followerCount'] -> 60028
219
- # user['mediumMemberAt'] -> 0 (not a member); nonzero = unix ms join date
220
- ```
221
-
222
- **Confirmed working `user()` fields:** `name`, `username`, `id`, `bio`, `imageId`, `twitterScreenName`, `mediumMemberAt`, `socialStats { followerCount followingCount }`
223
-
224
- **Fields that return HTTP 400:** `followerCount` (top-level), `followingCount` (top-level), `postCount`
225
-
226
- ### Fetch collection (publication) by ID
227
-
228
- The GraphQL `collection()` query only accepts `id`, not `slug`. Get the ID from `?format=json` on the publication page.
229
-
230
- ```python
231
- # TDS Archive id: 7f60cf5620c9 (from medium.com/towards-data-science?format=json)
232
- result = gql("""
233
- {
234
- collection(id: "7f60cf5620c9") {
235
- name
236
- id
237
- slug
238
- description
239
- domain
240
- creator { name username }
241
- }
242
- }
243
- """)
244
- coll = result['data']['collection']
245
- # coll['name'] -> "TDS Archive"
246
- # coll['slug'] -> "data-science"
247
- ```
248
-
249
- ---
250
-
251
- ## Path 3: RSS feeds (best for recent posts list + article bodies)
252
-
253
- Works with plain `http_get`. Returns up to 10 most recent posts. Full article HTML is in `content:encoded`. No clap count or visibility info in RSS.
254
-
255
- ```python
256
- import re
257
- from helpers import http_get
258
-
259
- def parse_rss_items(rss_xml):
260
- """Extract items from Medium RSS feed. Returns list of dicts."""
261
- def cdata(tag, text):
262
- m = re.search(rf'<{tag}[^>]*><!\[CDATA\[(.*?)\]\]></{tag}>', text, re.DOTALL)
263
- return m.group(1).strip() if m else None
264
-
265
- items = []
266
- for raw in re.findall(r'<item>(.*?)</item>', rss_xml, re.DOTALL):
267
- # link is plain text (not CDATA)
268
- link_m = re.search(r'<link>(.*?)</link>', raw, re.DOTALL)
269
- items.append({
270
- 'title': cdata('title', raw),
271
- 'link': link_m.group(1).strip() if link_m else None,
272
- 'pubDate': cdata('pubDate', raw),
273
- 'creator': cdata('dc:creator', raw),
274
- 'tags': re.findall(r'<category><!\[CDATA\[(.*?)\]\]></category>', raw),
275
- 'body_html': cdata('content:encoded', raw), # full article HTML
276
- })
277
- return items
278
-
279
- # User feed (up to 10 latest posts)
280
- rss = http_get("https://medium.com/feed/@karpathy")
281
- posts = parse_rss_items(rss)
282
- # posts[0]['title'] -> "Software 2.0"
283
- # posts[0]['pubDate'] -> "Sat, 11 Nov 2017 22:18:53 GMT"
284
- # posts[0]['creator'] -> "Andrej Karpathy"
285
- # posts[0]['tags'] -> ['programming', 'software-development', 'artificial-intelligence', 'future', 'machine-learning']
286
- # posts[0]['link'] -> "https://karpathy.medium.com/software-2-0-a64152b37c35?source=rss-..."
287
- # posts[0]['body_html'] -> full article body as HTML string (~15KB for this article)
288
-
289
- # Publication feed (up to 10 latest posts)
290
- rss_pub = http_get("https://medium.com/feed/towards-data-science")
291
- pub_posts = parse_rss_items(rss_pub)
292
- ```
293
-
294
- **RSS limitations:**
295
- - RSS does not include clap count, view count, or paywall status.
296
- - `body_html` contains the full article body as HTML, including `<p>`, `<strong>`, `<a>`, `<img>` tags.
297
- - Pagination is not supported — RSS always returns the 10 most recent posts.
298
-
299
- ---
300
-
301
- ## Path 4: `?format=json` on user profile — recent posts with metrics
302
-
303
- Better than RSS when you need clap counts alongside post list. Returns up to `limit` posts (default 10) plus full author metadata.
304
-
305
- ```python
306
- data = medium_json("https://medium.com/@karpathy?limit=10")
307
- payload = data['payload']
308
-
309
- user = payload['user']
310
- # user['name'] -> "Andrej Karpathy"
311
- # user['username'] -> "karpathy"
312
- # user['bio'] -> "I like to train deep neural nets on large datasets."
313
-
314
- refs = payload['references']
315
- ss = refs['SocialStats'][user['userId']]
316
- # ss['usersFollowedByCount'] -> 60028 (followers)
317
- # ss['usersFollowedCount'] -> 183 (following)
318
-
319
- posts = refs.get('Post', {}) # dict keyed by post ID
320
- for pid, p in posts.items():
321
- v = p['virtuals']
322
- print(p['title'], v['totalClapCount'], round(v['readingTime'], 1))
323
-
324
- # Paginate: use paging['next'] from payload
325
- paging = payload['paging']
326
- next_params = paging['next']
327
- # next_params = {'limit': 10, 'to': '1495652975362', 'source': 'overview', 'page': 2, 'ignoredIds': []}
328
- # Append as query params to the same profile URL to get next page
329
- next_url = (
330
- f"https://medium.com/@{user['username']}"
331
- f"?limit={next_params['limit']}&to={next_params['to']}"
332
- f"&source={next_params['source']}&page={next_params['page']}"
333
- )
334
- data2 = medium_json(next_url)
335
- # Note: karpathy has only 8 total posts — pagination returns same refs on page 2
336
- ```
337
-
338
- ---
339
-
340
- ## Path 5: `?format=json` on publication page
341
-
342
- Returns publication metadata and recent posts with metrics.
343
-
344
- ```python
345
- data = medium_json("https://medium.com/towards-data-science")
346
- payload = data['payload']
347
-
348
- coll = payload['collection']
349
- # coll['name'] -> "TDS Archive"
350
- # coll['slug'] -> "data-science"
351
- # coll['description'] -> full description string
352
- # coll['subscriberCount'] -> 828527
353
- # coll['metadata']['followerCount'] -> 828527
354
- # coll['tags'] -> ['DATA SCIENCE', 'MACHINE LEARNING', ...]
355
-
356
- posts = payload['references'].get('Post', {})
357
- for pid, p in posts.items():
358
- v = p['virtuals']
359
- print(p['title'], v['totalClapCount'], p['isSubscriptionLocked'])
360
- # Also includes: p['visibility'] (0=free, 2=paywalled)
361
-
362
- # Paginate (same pattern as user profile)
363
- paging = payload['paging']
364
- # paging['next'] = {'to': '1738573325936', 'page': 3}
365
- ```
366
-
367
- ---
368
-
369
- ## Retrieving the article ID from a URL
370
-
371
- The `id` is the last 12 hex chars of a Medium article URL slug:
372
-
373
- ```python
374
- import re
375
-
376
- url = "https://medium.com/@karpathy/software-2-0-a64152b37c35"
377
- article_id = re.search(r'-([a-f0-9]{12})$', url.rstrip('/').split('?')[0])
378
- if article_id:
379
- article_id = article_id.group(1) # "a64152b37c35"
380
- ```
381
-
382
- This ID is the same across all URL forms (`medium.com/@user/slug`, `user.medium.com/slug`, `medium.com/publication/slug`).
383
-
384
- ---
385
-
386
- ## Gotchas
387
-
388
- - **HTTP 403 on plain `http_get`** — The default `http_get` helper sends `User-Agent: Mozilla/5.0` which Medium accepts for most endpoints, but article HTML pages (without `?format=json`) return 403. Always use `?format=json` for article and profile pages.
389
-
390
- - **`?format=json` works; profile stream API does not** — `https://medium.com/_/api/users/{id}/profile/stream` returns HTTP 403 for unauthenticated requests. Use `?format=json` on the profile URL instead.
391
-
392
- - **`?format=json` on search pages returns 403 or broken JSON** — `medium.com/search?q=...&format=json` and `medium.com/search/posts?q=...&format=json` both fail. Search is not available without auth.
393
-
394
- - **GraphQL `collection()` requires ID, not slug** — `collection(slug: "towards-data-science")` returns HTTP 400. You must use the numeric ID (e.g. `"7f60cf5620c9"`). Get it from `?format=json` on the publication page: `payload['collection']['id']`.
395
-
396
- - **GraphQL `tags` field on `post()` returns HTTP 400** — Use `topics { name slug }` instead. Topics are a subset of tags but work without auth.
397
-
398
- - **GraphQL visibility is a string, not a number** — `post().visibility` returns `"PUBLIC"` or `"LOCKED"` (string). The `?format=json` `value.visibility` field uses integers: `0`=public, `2`=locked. Both agree on the lock status.
399
-
400
- - **`totalClapCount` vs `recommends`** — `totalClapCount` (60865) counts all claps (Medium allows up to 50 claps per reader). `recommends` (8846) counts unique clappers. The GraphQL `clapCount` field equals `totalClapCount`, not `recommends`.
401
-
402
- - **RSS returns at most 10 items, no clap counts** — RSS is best for getting recent article links + full HTML body. Use `?format=json` profile if you need metrics.
403
-
404
- - **RSS link contains tracking params** — `posts[0]['link']` includes `?source=rss-{userId}------2`. Strip with `.split('?')[0]` if you need a clean URL.
405
-
406
- - **`content:encoded` in RSS is full HTML, not plaintext** — Strip HTML tags if you want plaintext: `re.sub(r'<[^>]+>', '', body_html)`.
407
-
408
- - **Medium subdomains** — Some users have custom subdomains (`karpathy.medium.com`). Both `medium.com/@karpathy/...` and `karpathy.medium.com/...` resolve to the same article; `?format=json` works on both.
409
-
410
- - **towardsdatascience.com is no longer Medium** — TDS moved to its own WordPress site. `towardsdatascience.com/article-slug?format=json` returns full WordPress HTML, not Medium JSON. Use `medium.com/towards-data-science` for the archived Medium publication.
411
-
412
- - **No public search API** — Medium has no Algolia equivalent. Finding articles by keyword requires either a browser, or fetching a user/publication feed and filtering locally.
413
-
414
- - **Timestamps are unix milliseconds** — `firstPublishedAt`, `createdAt`, `latestPublishedAt` are all in milliseconds. Convert: `datetime.fromtimestamp(val['firstPublishedAt'] / 1000, tz=timezone.utc)`.
1
+ # Medium — Data Extraction
2
+
3
+ `https://medium.com` — blogging platform. Three access paths tested and validated: the undocumented `?format=json` endpoint (fastest for article + publication data), the undocumented GraphQL API (best for targeted metric lookups), and RSS feeds (best for recent posts lists without auth). No browser needed for any read-only task.
4
+
5
+ ## Do this first: pick your access path
6
+
7
+ | Goal | Best approach | Latency |
8
+ |------|--------------|---------|
9
+ | Article metadata + full body | `?format=json` on article URL | ~400ms |
10
+ | Article metrics only (claps, visibility) | GraphQL `post(id:)` | ~275ms |
11
+ | Author profile + follower count | GraphQL `user(username:)` | ~220ms |
12
+ | Recent posts for a user (up to 10) | `?format=json` on profile URL | ~240ms |
13
+ | Recent posts for a publication | `?format=json` on publication URL | ~300ms |
14
+ | Paginated post list (feed) | RSS feed | ~260ms |
15
+ | Full article body as HTML | RSS `content:encoded` field | ~260ms |
16
+ | Publication subscriber count | `?format=json` on publication URL | ~300ms |
17
+
18
+ **Never use a browser for read-only Medium tasks.** All article content, metadata, and metrics are available over HTTP. Browser is only needed for authenticated actions (clapping, posting, account management).
19
+
20
+ ---
21
+
22
+ ## The XSSI prefix
23
+
24
+ Every `?format=json` response starts with the anti-hijacking prefix `])}while(1);</x>` before the JSON. **Strip it before parsing.** The helper below handles this.
25
+
26
+ ```python
27
+ import urllib.request, gzip, json, re
28
+
29
+ def medium_json(url):
30
+ """Fetch any Medium URL with ?format=json and return parsed dict.
31
+ Strips the XSSI prefix ])}while(1);</x> automatically.
32
+ Works on: article URLs, user profile URLs, publication URLs.
33
+ Does NOT work on: search pages, /latest, profile stream API.
34
+ """
35
+ sep = '&' if '?' in url else '?'
36
+ req = urllib.request.Request(
37
+ url + sep + 'format=json',
38
+ headers={
39
+ "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
40
+ "Accept": "application/json, */*",
41
+ "Accept-Encoding": "gzip",
42
+ }
43
+ )
44
+ with urllib.request.urlopen(req, timeout=20) as r:
45
+ raw = r.read()
46
+ if r.headers.get("Content-Encoding") == "gzip":
47
+ raw = gzip.decompress(raw)
48
+ text = raw.decode()
49
+ # Strip everything before the first {
50
+ return json.loads(re.sub(r'^[^\{]+', '', text))
51
+ ```
52
+
53
+ ---
54
+
55
+ ## Path 1: `?format=json` — article metadata + body (fastest for articles)
56
+
57
+ Append `?format=json` to any article URL. Returns full metadata, virtuals (metrics), and the complete article body in a structured `bodyModel`. No auth required for public and subscriber-locked articles alike — the metadata and full body are always returned, but paywalled body content in a browser would be truncated.
58
+
59
+ ```python
60
+ data = medium_json("https://medium.com/@karpathy/software-2-0-a64152b37c35")
61
+ payload = data['payload']
62
+ val = payload['value'] # article fields
63
+ refs = payload['references'] # User, Social, SocialStats dicts keyed by ID
64
+
65
+ # --- Article fields ---
66
+ title = val['title'] # "Software 2.0"
67
+ article_id = val['id'] # "a64152b37c35"
68
+ creator_id = val['creatorId'] # "ac9d9a35533e"
69
+ slug = val['uniqueSlug'] # "software-2-0-a64152b37c35"
70
+ url = val['canonicalUrl'] # "https://medium.com/@karpathy/..."
71
+ first_pub = val['firstPublishedAt'] # unix ms: 1510438733751
72
+ last_pub = val['latestPublishedAt'] # unix ms: 1615659523264
73
+ visibility = val['visibility'] # 0=public, 2=subscriber-locked
74
+ is_locked = val['isSubscriptionLocked'] # True if paywalled
75
+ locked_src = val['lockedPostSource'] # 0=free, 1=Medium Partner Program
76
+
77
+ # --- Metrics (in val['virtuals']) ---
78
+ virtuals = val['virtuals']
79
+ clap_count = virtuals['totalClapCount'] # 60865 (all claps, including multi-clap)
80
+ recommends = virtuals['recommends'] # 8846 (unique clappers)
81
+ read_time = virtuals['readingTime'] # 8.79811320754717 (minutes, float)
82
+ word_count = virtuals['wordCount'] # 2146
83
+
84
+ # --- Tags ---
85
+ tags = [t['slug'] for t in virtuals['tags']]
86
+ # ['machine-learning', 'artificial-intelligence', 'programming', 'software-development', 'future']
87
+
88
+ # --- Author (from references) ---
89
+ user = refs['User'][creator_id]
90
+ author_name = user['name'] # "Andrej Karpathy"
91
+ author_handle = user['username'] # "karpathy"
92
+ author_bio = user['bio'] # "I like to train deep neural nets on large datasets."
93
+ author_twitter = user['twitterScreenName'] # "karpathy"
94
+
95
+ # --- Follower count (from SocialStats) ---
96
+ ss = refs['SocialStats'][creator_id]
97
+ follower_count = ss['usersFollowedByCount'] # 60027
98
+ following_count = ss['usersFollowedCount'] # 183
99
+ ```
100
+
101
+ ### Detect paywall
102
+
103
+ ```python
104
+ # Paywalled (Medium Partner Program): isSubscriptionLocked=True, visibility=2, lockedPostSource=1
105
+ # Free: isSubscriptionLocked=False, visibility=0, lockedPostSource=0
106
+ is_paywalled = val['isSubscriptionLocked'] # True / False
107
+ ```
108
+
109
+ Confirmed on real TDS articles: paywalled articles return `isSubscriptionLocked=True`, `visibility=2`, `lockedPostSource=1`. Free articles: all three are `False`/`0`.
110
+
111
+ ### Article body
112
+
113
+ The full body is in `val['content']['bodyModel']['paragraphs']` — a list of dicts:
114
+
115
+ ```python
116
+ paragraphs = val['content']['bodyModel']['paragraphs']
117
+
118
+ # Paragraph types (confirmed for this article):
119
+ # type=1 -> body text (P)
120
+ # type=3 -> heading (H1/H2)
121
+ # type=4 -> image (text is empty; metadata has image ID)
122
+
123
+ # Reconstruct plain text:
124
+ text_paras = [p['text'] for p in paragraphs if p.get('text')]
125
+ full_text = '\n\n'.join(text_paras)
126
+ ```
127
+
128
+ ---
129
+
130
+ ## Path 2: GraphQL API — targeted metric lookups
131
+
132
+ `POST https://medium.com/_/graphql` with a JSON body. No auth, no CSRF token required.
133
+ Returns HTTP 200 with JSON even for unauthenticated queries. Invalid fields return HTTP 400 — do not assume a field exists without testing first.
134
+
135
+ ```python
136
+ import json, urllib.request, gzip
137
+
138
+ def gql(query):
139
+ body = json.dumps({"query": query}).encode()
140
+ req = urllib.request.Request(
141
+ "https://medium.com/_/graphql",
142
+ data=body,
143
+ headers={
144
+ "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
145
+ "Content-Type": "application/json",
146
+ "Accept": "application/json",
147
+ "Accept-Encoding": "gzip",
148
+ },
149
+ method="POST",
150
+ )
151
+ with urllib.request.urlopen(req, timeout=20) as r:
152
+ raw = r.read()
153
+ if r.headers.get("Content-Encoding") == "gzip":
154
+ raw = gzip.decompress(raw)
155
+ return json.loads(raw.decode())
156
+ ```
157
+
158
+ ### Fetch article metrics (fastest)
159
+
160
+ ```python
161
+ result = gql("""
162
+ {
163
+ post(id: "a64152b37c35") {
164
+ title
165
+ id
166
+ firstPublishedAt
167
+ latestPublishedAt
168
+ visibility
169
+ uniqueSlug
170
+ canonicalUrl
171
+ mediumUrl
172
+ isLocked
173
+ clapCount
174
+ readingTime
175
+ wordCount
176
+ }
177
+ }
178
+ """)
179
+ post = result['data']['post']
180
+ # post['visibility'] -> "PUBLIC" | "LOCKED" (string, not numeric)
181
+ # post['isLocked'] -> False | True
182
+ # post['clapCount'] -> 60865 (same as totalClapCount in format=json)
183
+ # post['readingTime'] -> 8.79811320754717 (minutes)
184
+ # post['wordCount'] -> 2146
185
+ ```
186
+
187
+ **Confirmed working `post()` fields:** `title`, `id`, `createdAt`, `updatedAt`, `firstPublishedAt`, `latestPublishedAt`, `visibility`, `uniqueSlug`, `canonicalUrl`, `mediumUrl`, `isLocked`, `clapCount`, `readingTime`, `wordCount`
188
+
189
+ **Nested object that works:** `topics { name slug }`, `creator { name username }`, `collection { name id slug description domain creator { name username } }`
190
+
191
+ **Fields that return HTTP 400 (not available):** `tags`, `author`, `recommends`, `content`, `publication`, `responses`, `sequence`
192
+
193
+ ### Fetch author profile
194
+
195
+ ```python
196
+ result = gql("""
197
+ {
198
+ user(username: "karpathy") {
199
+ name
200
+ username
201
+ id
202
+ bio
203
+ imageId
204
+ twitterScreenName
205
+ mediumMemberAt
206
+ socialStats {
207
+ followerCount
208
+ followingCount
209
+ }
210
+ }
211
+ }
212
+ """)
213
+ user = result['data']['user']
214
+ # user['name'] -> "Andrej Karpathy"
215
+ # user['id'] -> "ac9d9a35533e"
216
+ # user['bio'] -> "I like to train deep neural nets on large datasets."
217
+ # user['twitterScreenName'] -> "karpathy"
218
+ # user['socialStats']['followerCount'] -> 60028
219
+ # user['mediumMemberAt'] -> 0 (not a member); nonzero = unix ms join date
220
+ ```
221
+
222
+ **Confirmed working `user()` fields:** `name`, `username`, `id`, `bio`, `imageId`, `twitterScreenName`, `mediumMemberAt`, `socialStats { followerCount followingCount }`
223
+
224
+ **Fields that return HTTP 400:** `followerCount` (top-level), `followingCount` (top-level), `postCount`
225
+
226
+ ### Fetch collection (publication) by ID
227
+
228
+ The GraphQL `collection()` query only accepts `id`, not `slug`. Get the ID from `?format=json` on the publication page.
229
+
230
+ ```python
231
+ # TDS Archive id: 7f60cf5620c9 (from medium.com/towards-data-science?format=json)
232
+ result = gql("""
233
+ {
234
+ collection(id: "7f60cf5620c9") {
235
+ name
236
+ id
237
+ slug
238
+ description
239
+ domain
240
+ creator { name username }
241
+ }
242
+ }
243
+ """)
244
+ coll = result['data']['collection']
245
+ # coll['name'] -> "TDS Archive"
246
+ # coll['slug'] -> "data-science"
247
+ ```
248
+
249
+ ---
250
+
251
+ ## Path 3: RSS feeds (best for recent posts list + article bodies)
252
+
253
+ Works with plain `http_get`. Returns up to 10 most recent posts. Full article HTML is in `content:encoded`. No clap count or visibility info in RSS.
254
+
255
+ ```python
256
+ import re
257
+ from helpers import http_get
258
+
259
+ def parse_rss_items(rss_xml):
260
+ """Extract items from Medium RSS feed. Returns list of dicts."""
261
+ def cdata(tag, text):
262
+ m = re.search(rf'<{tag}[^>]*><!\[CDATA\[(.*?)\]\]></{tag}>', text, re.DOTALL)
263
+ return m.group(1).strip() if m else None
264
+
265
+ items = []
266
+ for raw in re.findall(r'<item>(.*?)</item>', rss_xml, re.DOTALL):
267
+ # link is plain text (not CDATA)
268
+ link_m = re.search(r'<link>(.*?)</link>', raw, re.DOTALL)
269
+ items.append({
270
+ 'title': cdata('title', raw),
271
+ 'link': link_m.group(1).strip() if link_m else None,
272
+ 'pubDate': cdata('pubDate', raw),
273
+ 'creator': cdata('dc:creator', raw),
274
+ 'tags': re.findall(r'<category><!\[CDATA\[(.*?)\]\]></category>', raw),
275
+ 'body_html': cdata('content:encoded', raw), # full article HTML
276
+ })
277
+ return items
278
+
279
+ # User feed (up to 10 latest posts)
280
+ rss = http_get("https://medium.com/feed/@karpathy")
281
+ posts = parse_rss_items(rss)
282
+ # posts[0]['title'] -> "Software 2.0"
283
+ # posts[0]['pubDate'] -> "Sat, 11 Nov 2017 22:18:53 GMT"
284
+ # posts[0]['creator'] -> "Andrej Karpathy"
285
+ # posts[0]['tags'] -> ['programming', 'software-development', 'artificial-intelligence', 'future', 'machine-learning']
286
+ # posts[0]['link'] -> "https://karpathy.medium.com/software-2-0-a64152b37c35?source=rss-..."
287
+ # posts[0]['body_html'] -> full article body as HTML string (~15KB for this article)
288
+
289
+ # Publication feed (up to 10 latest posts)
290
+ rss_pub = http_get("https://medium.com/feed/towards-data-science")
291
+ pub_posts = parse_rss_items(rss_pub)
292
+ ```
293
+
294
+ **RSS limitations:**
295
+ - RSS does not include clap count, view count, or paywall status.
296
+ - `body_html` contains the full article body as HTML, including `<p>`, `<strong>`, `<a>`, `<img>` tags.
297
+ - Pagination is not supported — RSS always returns the 10 most recent posts.
298
+
299
+ ---
300
+
301
+ ## Path 4: `?format=json` on user profile — recent posts with metrics
302
+
303
+ Better than RSS when you need clap counts alongside post list. Returns up to `limit` posts (default 10) plus full author metadata.
304
+
305
+ ```python
306
+ data = medium_json("https://medium.com/@karpathy?limit=10")
307
+ payload = data['payload']
308
+
309
+ user = payload['user']
310
+ # user['name'] -> "Andrej Karpathy"
311
+ # user['username'] -> "karpathy"
312
+ # user['bio'] -> "I like to train deep neural nets on large datasets."
313
+
314
+ refs = payload['references']
315
+ ss = refs['SocialStats'][user['userId']]
316
+ # ss['usersFollowedByCount'] -> 60028 (followers)
317
+ # ss['usersFollowedCount'] -> 183 (following)
318
+
319
+ posts = refs.get('Post', {}) # dict keyed by post ID
320
+ for pid, p in posts.items():
321
+ v = p['virtuals']
322
+ print(p['title'], v['totalClapCount'], round(v['readingTime'], 1))
323
+
324
+ # Paginate: use paging['next'] from payload
325
+ paging = payload['paging']
326
+ next_params = paging['next']
327
+ # next_params = {'limit': 10, 'to': '1495652975362', 'source': 'overview', 'page': 2, 'ignoredIds': []}
328
+ # Append as query params to the same profile URL to get next page
329
+ next_url = (
330
+ f"https://medium.com/@{user['username']}"
331
+ f"?limit={next_params['limit']}&to={next_params['to']}"
332
+ f"&source={next_params['source']}&page={next_params['page']}"
333
+ )
334
+ data2 = medium_json(next_url)
335
+ # Note: karpathy has only 8 total posts — pagination returns same refs on page 2
336
+ ```
337
+
338
+ ---
339
+
340
+ ## Path 5: `?format=json` on publication page
341
+
342
+ Returns publication metadata and recent posts with metrics.
343
+
344
+ ```python
345
+ data = medium_json("https://medium.com/towards-data-science")
346
+ payload = data['payload']
347
+
348
+ coll = payload['collection']
349
+ # coll['name'] -> "TDS Archive"
350
+ # coll['slug'] -> "data-science"
351
+ # coll['description'] -> full description string
352
+ # coll['subscriberCount'] -> 828527
353
+ # coll['metadata']['followerCount'] -> 828527
354
+ # coll['tags'] -> ['DATA SCIENCE', 'MACHINE LEARNING', ...]
355
+
356
+ posts = payload['references'].get('Post', {})
357
+ for pid, p in posts.items():
358
+ v = p['virtuals']
359
+ print(p['title'], v['totalClapCount'], p['isSubscriptionLocked'])
360
+ # Also includes: p['visibility'] (0=free, 2=paywalled)
361
+
362
+ # Paginate (same pattern as user profile)
363
+ paging = payload['paging']
364
+ # paging['next'] = {'to': '1738573325936', 'page': 3}
365
+ ```
366
+
367
+ ---
368
+
369
+ ## Retrieving the article ID from a URL
370
+
371
+ The `id` is the last 12 hex chars of a Medium article URL slug:
372
+
373
+ ```python
374
+ import re
375
+
376
+ url = "https://medium.com/@karpathy/software-2-0-a64152b37c35"
377
+ article_id = re.search(r'-([a-f0-9]{12})$', url.rstrip('/').split('?')[0])
378
+ if article_id:
379
+ article_id = article_id.group(1) # "a64152b37c35"
380
+ ```
381
+
382
+ This ID is the same across all URL forms (`medium.com/@user/slug`, `user.medium.com/slug`, `medium.com/publication/slug`).
383
+
384
+ ---
385
+
386
+ ## Gotchas
387
+
388
+ - **HTTP 403 on plain `http_get`** — The default `http_get` helper sends `User-Agent: Mozilla/5.0` which Medium accepts for most endpoints, but article HTML pages (without `?format=json`) return 403. Always use `?format=json` for article and profile pages.
389
+
390
+ - **`?format=json` works; profile stream API does not** — `https://medium.com/_/api/users/{id}/profile/stream` returns HTTP 403 for unauthenticated requests. Use `?format=json` on the profile URL instead.
391
+
392
+ - **`?format=json` on search pages returns 403 or broken JSON** — `medium.com/search?q=...&format=json` and `medium.com/search/posts?q=...&format=json` both fail. Search is not available without auth.
393
+
394
+ - **GraphQL `collection()` requires ID, not slug** — `collection(slug: "towards-data-science")` returns HTTP 400. You must use the numeric ID (e.g. `"7f60cf5620c9"`). Get it from `?format=json` on the publication page: `payload['collection']['id']`.
395
+
396
+ - **GraphQL `tags` field on `post()` returns HTTP 400** — Use `topics { name slug }` instead. Topics are a subset of tags but work without auth.
397
+
398
+ - **GraphQL visibility is a string, not a number** — `post().visibility` returns `"PUBLIC"` or `"LOCKED"` (string). The `?format=json` `value.visibility` field uses integers: `0`=public, `2`=locked. Both agree on the lock status.
399
+
400
+ - **`totalClapCount` vs `recommends`** — `totalClapCount` (60865) counts all claps (Medium allows up to 50 claps per reader). `recommends` (8846) counts unique clappers. The GraphQL `clapCount` field equals `totalClapCount`, not `recommends`.
401
+
402
+ - **RSS returns at most 10 items, no clap counts** — RSS is best for getting recent article links + full HTML body. Use `?format=json` profile if you need metrics.
403
+
404
+ - **RSS link contains tracking params** — `posts[0]['link']` includes `?source=rss-{userId}------2`. Strip with `.split('?')[0]` if you need a clean URL.
405
+
406
+ - **`content:encoded` in RSS is full HTML, not plaintext** — Strip HTML tags if you want plaintext: `re.sub(r'<[^>]+>', '', body_html)`.
407
+
408
+ - **Medium subdomains** — Some users have custom subdomains (`karpathy.medium.com`). Both `medium.com/@karpathy/...` and `karpathy.medium.com/...` resolve to the same article; `?format=json` works on both.
409
+
410
+ - **towardsdatascience.com is no longer Medium** — TDS moved to its own WordPress site. `towardsdatascience.com/article-slug?format=json` returns full WordPress HTML, not Medium JSON. Use `medium.com/towards-data-science` for the archived Medium publication.
411
+
412
+ - **No public search API** — Medium has no Algolia equivalent. Finding articles by keyword requires either a browser, or fetching a user/publication feed and filtering locally.
413
+
414
+ - **Timestamps are unix milliseconds** — `firstPublishedAt`, `createdAt`, `latestPublishedAt` are all in milliseconds. Convert: `datetime.fromtimestamp(val['firstPublishedAt'] / 1000, tz=timezone.utc)`.