@pencil-agent/nano-pencil 2.0.1 → 2.0.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +267 -267
- package/dist/build-meta.json +3 -3
- package/dist/core/export-html/AGENT.md +11 -11
- package/dist/core/export-html/template.css +971 -971
- package/dist/core/export-html/template.html +54 -54
- package/dist/core/model/custom-providers.js +1 -1
- package/dist/core/model-registry.js +5 -5
- package/dist/extensions/builtin/AGENT.md +115 -115
- package/dist/extensions/builtin/browser/AGENT.md +17 -17
- package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
- package/dist/extensions/builtin/browser/browser.md +73 -73
- package/dist/extensions/builtin/browser/install.md +142 -142
- package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
- package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
- package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
- package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
- package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
- package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
- package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
- package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
- package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
- package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
- package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
- package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
- package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
- package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
- package/dist/extensions/builtin/debug/index.js +9 -9
- package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
- package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
- package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
- package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
- package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
- package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
- package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
- package/dist/extensions/builtin/goal/README.md +67 -67
- package/dist/extensions/builtin/goal/index.js +6 -6
- package/dist/extensions/builtin/grub/README.md +112 -112
- package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
- package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
- package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
- package/dist/extensions/builtin/link-world/linkworld.md +313 -313
- package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
- package/dist/extensions/builtin/loop/README.md +92 -92
- package/dist/extensions/builtin/mcp/figma-design.md +68 -68
- package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
- package/dist/extensions/builtin/recap/AGENT.md +15 -15
- package/dist/extensions/builtin/sal/README.md +72 -72
- package/dist/extensions/builtin/security-audit/README.md +289 -289
- package/dist/extensions/builtin/team/AGENT.md +112 -112
- package/dist/extensions/builtin/team/TESTING.md +299 -299
- package/dist/extensions/builtin/token-save/README.md +56 -56
- package/dist/extensions/optional/AGENT.md +10 -10
- package/dist/modes/interactive/controllers/input-submit-controller.js +2 -2
- package/dist/modes/interactive/controllers/stream-render-controller.js +2 -2
- package/dist/modes/interactive/interactive-mode.js +19 -19
- package/dist/modes/interactive/theme/dark.json +85 -85
- package/dist/modes/interactive/theme/light.json +84 -84
- package/dist/modes/interactive/theme/theme-schema.json +335 -335
- package/dist/modes/interactive/theme/warm.json +81 -81
- package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
- package/dist/node_modules/@pencil-agent/ai/dist/models.generated.js +1 -1
- package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
- package/docs/SDK-TESTING.md +364 -0
- package/docs/codex-goal-command-impl.md +1055 -1055
- package/docs/codex-goal-vs-grub.md +500 -500
- package/docs/custom-provider.md +27 -27
- package/docs/extensions.md +27 -27
- package/docs/keybindings.md +27 -27
- package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
- package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
- package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
- package/docs/loop-usage-examples.md +214 -214
- package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
- package/docs/models.md +27 -27
- package/docs/packages.md +27 -27
- package/docs/pi-design-philosophy.md +457 -457
- package/docs/planmode.md +1987 -1987
- package/docs/prompt-templates.md +27 -27
- package/docs/providers.md +27 -27
- package/docs/sdk.md +27 -27
- package/docs/skills.md +27 -27
- package/docs/startup-performance-optimization.md +301 -0
- package/docs/themes.md +27 -27
- package/docs/tui.md +27 -27
- package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
- package/package.json +190 -190
- package/docs/cc-agent-design.md +0 -1297
- package/docs/cc-tui-design.md +0 -1333
- package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +0 -170
- package/docs/scan-report.md +0 -3820
- package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
- package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +0 -261
|
@@ -1,414 +1,414 @@
|
|
|
1
|
-
# Medium — Data Extraction
|
|
2
|
-
|
|
3
|
-
`https://medium.com` — blogging platform. Three access paths tested and validated: the undocumented `?format=json` endpoint (fastest for article + publication data), the undocumented GraphQL API (best for targeted metric lookups), and RSS feeds (best for recent posts lists without auth). No browser needed for any read-only task.
|
|
4
|
-
|
|
5
|
-
## Do this first: pick your access path
|
|
6
|
-
|
|
7
|
-
| Goal | Best approach | Latency |
|
|
8
|
-
|------|--------------|---------|
|
|
9
|
-
| Article metadata + full body | `?format=json` on article URL | ~400ms |
|
|
10
|
-
| Article metrics only (claps, visibility) | GraphQL `post(id:)` | ~275ms |
|
|
11
|
-
| Author profile + follower count | GraphQL `user(username:)` | ~220ms |
|
|
12
|
-
| Recent posts for a user (up to 10) | `?format=json` on profile URL | ~240ms |
|
|
13
|
-
| Recent posts for a publication | `?format=json` on publication URL | ~300ms |
|
|
14
|
-
| Paginated post list (feed) | RSS feed | ~260ms |
|
|
15
|
-
| Full article body as HTML | RSS `content:encoded` field | ~260ms |
|
|
16
|
-
| Publication subscriber count | `?format=json` on publication URL | ~300ms |
|
|
17
|
-
|
|
18
|
-
**Never use a browser for read-only Medium tasks.** All article content, metadata, and metrics are available over HTTP. Browser is only needed for authenticated actions (clapping, posting, account management).
|
|
19
|
-
|
|
20
|
-
---
|
|
21
|
-
|
|
22
|
-
## The XSSI prefix
|
|
23
|
-
|
|
24
|
-
Every `?format=json` response starts with the anti-hijacking prefix `])}while(1);</x>` before the JSON. **Strip it before parsing.** The helper below handles this.
|
|
25
|
-
|
|
26
|
-
```python
|
|
27
|
-
import urllib.request, gzip, json, re
|
|
28
|
-
|
|
29
|
-
def medium_json(url):
|
|
30
|
-
"""Fetch any Medium URL with ?format=json and return parsed dict.
|
|
31
|
-
Strips the XSSI prefix ])}while(1);</x> automatically.
|
|
32
|
-
Works on: article URLs, user profile URLs, publication URLs.
|
|
33
|
-
Does NOT work on: search pages, /latest, profile stream API.
|
|
34
|
-
"""
|
|
35
|
-
sep = '&' if '?' in url else '?'
|
|
36
|
-
req = urllib.request.Request(
|
|
37
|
-
url + sep + 'format=json',
|
|
38
|
-
headers={
|
|
39
|
-
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
|
|
40
|
-
"Accept": "application/json, */*",
|
|
41
|
-
"Accept-Encoding": "gzip",
|
|
42
|
-
}
|
|
43
|
-
)
|
|
44
|
-
with urllib.request.urlopen(req, timeout=20) as r:
|
|
45
|
-
raw = r.read()
|
|
46
|
-
if r.headers.get("Content-Encoding") == "gzip":
|
|
47
|
-
raw = gzip.decompress(raw)
|
|
48
|
-
text = raw.decode()
|
|
49
|
-
# Strip everything before the first {
|
|
50
|
-
return json.loads(re.sub(r'^[^\{]+', '', text))
|
|
51
|
-
```
|
|
52
|
-
|
|
53
|
-
---
|
|
54
|
-
|
|
55
|
-
## Path 1: `?format=json` — article metadata + body (fastest for articles)
|
|
56
|
-
|
|
57
|
-
Append `?format=json` to any article URL. Returns full metadata, virtuals (metrics), and the complete article body in a structured `bodyModel`. No auth required for public and subscriber-locked articles alike — the metadata and full body are always returned, but paywalled body content in a browser would be truncated.
|
|
58
|
-
|
|
59
|
-
```python
|
|
60
|
-
data = medium_json("https://medium.com/@karpathy/software-2-0-a64152b37c35")
|
|
61
|
-
payload = data['payload']
|
|
62
|
-
val = payload['value'] # article fields
|
|
63
|
-
refs = payload['references'] # User, Social, SocialStats dicts keyed by ID
|
|
64
|
-
|
|
65
|
-
# --- Article fields ---
|
|
66
|
-
title = val['title'] # "Software 2.0"
|
|
67
|
-
article_id = val['id'] # "a64152b37c35"
|
|
68
|
-
creator_id = val['creatorId'] # "ac9d9a35533e"
|
|
69
|
-
slug = val['uniqueSlug'] # "software-2-0-a64152b37c35"
|
|
70
|
-
url = val['canonicalUrl'] # "https://medium.com/@karpathy/..."
|
|
71
|
-
first_pub = val['firstPublishedAt'] # unix ms: 1510438733751
|
|
72
|
-
last_pub = val['latestPublishedAt'] # unix ms: 1615659523264
|
|
73
|
-
visibility = val['visibility'] # 0=public, 2=subscriber-locked
|
|
74
|
-
is_locked = val['isSubscriptionLocked'] # True if paywalled
|
|
75
|
-
locked_src = val['lockedPostSource'] # 0=free, 1=Medium Partner Program
|
|
76
|
-
|
|
77
|
-
# --- Metrics (in val['virtuals']) ---
|
|
78
|
-
virtuals = val['virtuals']
|
|
79
|
-
clap_count = virtuals['totalClapCount'] # 60865 (all claps, including multi-clap)
|
|
80
|
-
recommends = virtuals['recommends'] # 8846 (unique clappers)
|
|
81
|
-
read_time = virtuals['readingTime'] # 8.79811320754717 (minutes, float)
|
|
82
|
-
word_count = virtuals['wordCount'] # 2146
|
|
83
|
-
|
|
84
|
-
# --- Tags ---
|
|
85
|
-
tags = [t['slug'] for t in virtuals['tags']]
|
|
86
|
-
# ['machine-learning', 'artificial-intelligence', 'programming', 'software-development', 'future']
|
|
87
|
-
|
|
88
|
-
# --- Author (from references) ---
|
|
89
|
-
user = refs['User'][creator_id]
|
|
90
|
-
author_name = user['name'] # "Andrej Karpathy"
|
|
91
|
-
author_handle = user['username'] # "karpathy"
|
|
92
|
-
author_bio = user['bio'] # "I like to train deep neural nets on large datasets."
|
|
93
|
-
author_twitter = user['twitterScreenName'] # "karpathy"
|
|
94
|
-
|
|
95
|
-
# --- Follower count (from SocialStats) ---
|
|
96
|
-
ss = refs['SocialStats'][creator_id]
|
|
97
|
-
follower_count = ss['usersFollowedByCount'] # 60027
|
|
98
|
-
following_count = ss['usersFollowedCount'] # 183
|
|
99
|
-
```
|
|
100
|
-
|
|
101
|
-
### Detect paywall
|
|
102
|
-
|
|
103
|
-
```python
|
|
104
|
-
# Paywalled (Medium Partner Program): isSubscriptionLocked=True, visibility=2, lockedPostSource=1
|
|
105
|
-
# Free: isSubscriptionLocked=False, visibility=0, lockedPostSource=0
|
|
106
|
-
is_paywalled = val['isSubscriptionLocked'] # True / False
|
|
107
|
-
```
|
|
108
|
-
|
|
109
|
-
Confirmed on real TDS articles: paywalled articles return `isSubscriptionLocked=True`, `visibility=2`, `lockedPostSource=1`. Free articles: all three are `False`/`0`.
|
|
110
|
-
|
|
111
|
-
### Article body
|
|
112
|
-
|
|
113
|
-
The full body is in `val['content']['bodyModel']['paragraphs']` — a list of dicts:
|
|
114
|
-
|
|
115
|
-
```python
|
|
116
|
-
paragraphs = val['content']['bodyModel']['paragraphs']
|
|
117
|
-
|
|
118
|
-
# Paragraph types (confirmed for this article):
|
|
119
|
-
# type=1 -> body text (P)
|
|
120
|
-
# type=3 -> heading (H1/H2)
|
|
121
|
-
# type=4 -> image (text is empty; metadata has image ID)
|
|
122
|
-
|
|
123
|
-
# Reconstruct plain text:
|
|
124
|
-
text_paras = [p['text'] for p in paragraphs if p.get('text')]
|
|
125
|
-
full_text = '\n\n'.join(text_paras)
|
|
126
|
-
```
|
|
127
|
-
|
|
128
|
-
---
|
|
129
|
-
|
|
130
|
-
## Path 2: GraphQL API — targeted metric lookups
|
|
131
|
-
|
|
132
|
-
`POST https://medium.com/_/graphql` with a JSON body. No auth, no CSRF token required.
|
|
133
|
-
Returns HTTP 200 with JSON even for unauthenticated queries. Invalid fields return HTTP 400 — do not assume a field exists without testing first.
|
|
134
|
-
|
|
135
|
-
```python
|
|
136
|
-
import json, urllib.request, gzip
|
|
137
|
-
|
|
138
|
-
def gql(query):
|
|
139
|
-
body = json.dumps({"query": query}).encode()
|
|
140
|
-
req = urllib.request.Request(
|
|
141
|
-
"https://medium.com/_/graphql",
|
|
142
|
-
data=body,
|
|
143
|
-
headers={
|
|
144
|
-
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
|
|
145
|
-
"Content-Type": "application/json",
|
|
146
|
-
"Accept": "application/json",
|
|
147
|
-
"Accept-Encoding": "gzip",
|
|
148
|
-
},
|
|
149
|
-
method="POST",
|
|
150
|
-
)
|
|
151
|
-
with urllib.request.urlopen(req, timeout=20) as r:
|
|
152
|
-
raw = r.read()
|
|
153
|
-
if r.headers.get("Content-Encoding") == "gzip":
|
|
154
|
-
raw = gzip.decompress(raw)
|
|
155
|
-
return json.loads(raw.decode())
|
|
156
|
-
```
|
|
157
|
-
|
|
158
|
-
### Fetch article metrics (fastest)
|
|
159
|
-
|
|
160
|
-
```python
|
|
161
|
-
result = gql("""
|
|
162
|
-
{
|
|
163
|
-
post(id: "a64152b37c35") {
|
|
164
|
-
title
|
|
165
|
-
id
|
|
166
|
-
firstPublishedAt
|
|
167
|
-
latestPublishedAt
|
|
168
|
-
visibility
|
|
169
|
-
uniqueSlug
|
|
170
|
-
canonicalUrl
|
|
171
|
-
mediumUrl
|
|
172
|
-
isLocked
|
|
173
|
-
clapCount
|
|
174
|
-
readingTime
|
|
175
|
-
wordCount
|
|
176
|
-
}
|
|
177
|
-
}
|
|
178
|
-
""")
|
|
179
|
-
post = result['data']['post']
|
|
180
|
-
# post['visibility'] -> "PUBLIC" | "LOCKED" (string, not numeric)
|
|
181
|
-
# post['isLocked'] -> False | True
|
|
182
|
-
# post['clapCount'] -> 60865 (same as totalClapCount in format=json)
|
|
183
|
-
# post['readingTime'] -> 8.79811320754717 (minutes)
|
|
184
|
-
# post['wordCount'] -> 2146
|
|
185
|
-
```
|
|
186
|
-
|
|
187
|
-
**Confirmed working `post()` fields:** `title`, `id`, `createdAt`, `updatedAt`, `firstPublishedAt`, `latestPublishedAt`, `visibility`, `uniqueSlug`, `canonicalUrl`, `mediumUrl`, `isLocked`, `clapCount`, `readingTime`, `wordCount`
|
|
188
|
-
|
|
189
|
-
**Nested object that works:** `topics { name slug }`, `creator { name username }`, `collection { name id slug description domain creator { name username } }`
|
|
190
|
-
|
|
191
|
-
**Fields that return HTTP 400 (not available):** `tags`, `author`, `recommends`, `content`, `publication`, `responses`, `sequence`
|
|
192
|
-
|
|
193
|
-
### Fetch author profile
|
|
194
|
-
|
|
195
|
-
```python
|
|
196
|
-
result = gql("""
|
|
197
|
-
{
|
|
198
|
-
user(username: "karpathy") {
|
|
199
|
-
name
|
|
200
|
-
username
|
|
201
|
-
id
|
|
202
|
-
bio
|
|
203
|
-
imageId
|
|
204
|
-
twitterScreenName
|
|
205
|
-
mediumMemberAt
|
|
206
|
-
socialStats {
|
|
207
|
-
followerCount
|
|
208
|
-
followingCount
|
|
209
|
-
}
|
|
210
|
-
}
|
|
211
|
-
}
|
|
212
|
-
""")
|
|
213
|
-
user = result['data']['user']
|
|
214
|
-
# user['name'] -> "Andrej Karpathy"
|
|
215
|
-
# user['id'] -> "ac9d9a35533e"
|
|
216
|
-
# user['bio'] -> "I like to train deep neural nets on large datasets."
|
|
217
|
-
# user['twitterScreenName'] -> "karpathy"
|
|
218
|
-
# user['socialStats']['followerCount'] -> 60028
|
|
219
|
-
# user['mediumMemberAt'] -> 0 (not a member); nonzero = unix ms join date
|
|
220
|
-
```
|
|
221
|
-
|
|
222
|
-
**Confirmed working `user()` fields:** `name`, `username`, `id`, `bio`, `imageId`, `twitterScreenName`, `mediumMemberAt`, `socialStats { followerCount followingCount }`
|
|
223
|
-
|
|
224
|
-
**Fields that return HTTP 400:** `followerCount` (top-level), `followingCount` (top-level), `postCount`
|
|
225
|
-
|
|
226
|
-
### Fetch collection (publication) by ID
|
|
227
|
-
|
|
228
|
-
The GraphQL `collection()` query only accepts `id`, not `slug`. Get the ID from `?format=json` on the publication page.
|
|
229
|
-
|
|
230
|
-
```python
|
|
231
|
-
# TDS Archive id: 7f60cf5620c9 (from medium.com/towards-data-science?format=json)
|
|
232
|
-
result = gql("""
|
|
233
|
-
{
|
|
234
|
-
collection(id: "7f60cf5620c9") {
|
|
235
|
-
name
|
|
236
|
-
id
|
|
237
|
-
slug
|
|
238
|
-
description
|
|
239
|
-
domain
|
|
240
|
-
creator { name username }
|
|
241
|
-
}
|
|
242
|
-
}
|
|
243
|
-
""")
|
|
244
|
-
coll = result['data']['collection']
|
|
245
|
-
# coll['name'] -> "TDS Archive"
|
|
246
|
-
# coll['slug'] -> "data-science"
|
|
247
|
-
```
|
|
248
|
-
|
|
249
|
-
---
|
|
250
|
-
|
|
251
|
-
## Path 3: RSS feeds (best for recent posts list + article bodies)
|
|
252
|
-
|
|
253
|
-
Works with plain `http_get`. Returns up to 10 most recent posts. Full article HTML is in `content:encoded`. No clap count or visibility info in RSS.
|
|
254
|
-
|
|
255
|
-
```python
|
|
256
|
-
import re
|
|
257
|
-
from helpers import http_get
|
|
258
|
-
|
|
259
|
-
def parse_rss_items(rss_xml):
|
|
260
|
-
"""Extract items from Medium RSS feed. Returns list of dicts."""
|
|
261
|
-
def cdata(tag, text):
|
|
262
|
-
m = re.search(rf'<{tag}[^>]*><!\[CDATA\[(.*?)\]\]></{tag}>', text, re.DOTALL)
|
|
263
|
-
return m.group(1).strip() if m else None
|
|
264
|
-
|
|
265
|
-
items = []
|
|
266
|
-
for raw in re.findall(r'<item>(.*?)</item>', rss_xml, re.DOTALL):
|
|
267
|
-
# link is plain text (not CDATA)
|
|
268
|
-
link_m = re.search(r'<link>(.*?)</link>', raw, re.DOTALL)
|
|
269
|
-
items.append({
|
|
270
|
-
'title': cdata('title', raw),
|
|
271
|
-
'link': link_m.group(1).strip() if link_m else None,
|
|
272
|
-
'pubDate': cdata('pubDate', raw),
|
|
273
|
-
'creator': cdata('dc:creator', raw),
|
|
274
|
-
'tags': re.findall(r'<category><!\[CDATA\[(.*?)\]\]></category>', raw),
|
|
275
|
-
'body_html': cdata('content:encoded', raw), # full article HTML
|
|
276
|
-
})
|
|
277
|
-
return items
|
|
278
|
-
|
|
279
|
-
# User feed (up to 10 latest posts)
|
|
280
|
-
rss = http_get("https://medium.com/feed/@karpathy")
|
|
281
|
-
posts = parse_rss_items(rss)
|
|
282
|
-
# posts[0]['title'] -> "Software 2.0"
|
|
283
|
-
# posts[0]['pubDate'] -> "Sat, 11 Nov 2017 22:18:53 GMT"
|
|
284
|
-
# posts[0]['creator'] -> "Andrej Karpathy"
|
|
285
|
-
# posts[0]['tags'] -> ['programming', 'software-development', 'artificial-intelligence', 'future', 'machine-learning']
|
|
286
|
-
# posts[0]['link'] -> "https://karpathy.medium.com/software-2-0-a64152b37c35?source=rss-..."
|
|
287
|
-
# posts[0]['body_html'] -> full article body as HTML string (~15KB for this article)
|
|
288
|
-
|
|
289
|
-
# Publication feed (up to 10 latest posts)
|
|
290
|
-
rss_pub = http_get("https://medium.com/feed/towards-data-science")
|
|
291
|
-
pub_posts = parse_rss_items(rss_pub)
|
|
292
|
-
```
|
|
293
|
-
|
|
294
|
-
**RSS limitations:**
|
|
295
|
-
- RSS does not include clap count, view count, or paywall status.
|
|
296
|
-
- `body_html` contains the full article body as HTML, including `<p>`, `<strong>`, `<a>`, `<img>` tags.
|
|
297
|
-
- Pagination is not supported — RSS always returns the 10 most recent posts.
|
|
298
|
-
|
|
299
|
-
---
|
|
300
|
-
|
|
301
|
-
## Path 4: `?format=json` on user profile — recent posts with metrics
|
|
302
|
-
|
|
303
|
-
Better than RSS when you need clap counts alongside post list. Returns up to `limit` posts (default 10) plus full author metadata.
|
|
304
|
-
|
|
305
|
-
```python
|
|
306
|
-
data = medium_json("https://medium.com/@karpathy?limit=10")
|
|
307
|
-
payload = data['payload']
|
|
308
|
-
|
|
309
|
-
user = payload['user']
|
|
310
|
-
# user['name'] -> "Andrej Karpathy"
|
|
311
|
-
# user['username'] -> "karpathy"
|
|
312
|
-
# user['bio'] -> "I like to train deep neural nets on large datasets."
|
|
313
|
-
|
|
314
|
-
refs = payload['references']
|
|
315
|
-
ss = refs['SocialStats'][user['userId']]
|
|
316
|
-
# ss['usersFollowedByCount'] -> 60028 (followers)
|
|
317
|
-
# ss['usersFollowedCount'] -> 183 (following)
|
|
318
|
-
|
|
319
|
-
posts = refs.get('Post', {}) # dict keyed by post ID
|
|
320
|
-
for pid, p in posts.items():
|
|
321
|
-
v = p['virtuals']
|
|
322
|
-
print(p['title'], v['totalClapCount'], round(v['readingTime'], 1))
|
|
323
|
-
|
|
324
|
-
# Paginate: use paging['next'] from payload
|
|
325
|
-
paging = payload['paging']
|
|
326
|
-
next_params = paging['next']
|
|
327
|
-
# next_params = {'limit': 10, 'to': '1495652975362', 'source': 'overview', 'page': 2, 'ignoredIds': []}
|
|
328
|
-
# Append as query params to the same profile URL to get next page
|
|
329
|
-
next_url = (
|
|
330
|
-
f"https://medium.com/@{user['username']}"
|
|
331
|
-
f"?limit={next_params['limit']}&to={next_params['to']}"
|
|
332
|
-
f"&source={next_params['source']}&page={next_params['page']}"
|
|
333
|
-
)
|
|
334
|
-
data2 = medium_json(next_url)
|
|
335
|
-
# Note: karpathy has only 8 total posts — pagination returns same refs on page 2
|
|
336
|
-
```
|
|
337
|
-
|
|
338
|
-
---
|
|
339
|
-
|
|
340
|
-
## Path 5: `?format=json` on publication page
|
|
341
|
-
|
|
342
|
-
Returns publication metadata and recent posts with metrics.
|
|
343
|
-
|
|
344
|
-
```python
|
|
345
|
-
data = medium_json("https://medium.com/towards-data-science")
|
|
346
|
-
payload = data['payload']
|
|
347
|
-
|
|
348
|
-
coll = payload['collection']
|
|
349
|
-
# coll['name'] -> "TDS Archive"
|
|
350
|
-
# coll['slug'] -> "data-science"
|
|
351
|
-
# coll['description'] -> full description string
|
|
352
|
-
# coll['subscriberCount'] -> 828527
|
|
353
|
-
# coll['metadata']['followerCount'] -> 828527
|
|
354
|
-
# coll['tags'] -> ['DATA SCIENCE', 'MACHINE LEARNING', ...]
|
|
355
|
-
|
|
356
|
-
posts = payload['references'].get('Post', {})
|
|
357
|
-
for pid, p in posts.items():
|
|
358
|
-
v = p['virtuals']
|
|
359
|
-
print(p['title'], v['totalClapCount'], p['isSubscriptionLocked'])
|
|
360
|
-
# Also includes: p['visibility'] (0=free, 2=paywalled)
|
|
361
|
-
|
|
362
|
-
# Paginate (same pattern as user profile)
|
|
363
|
-
paging = payload['paging']
|
|
364
|
-
# paging['next'] = {'to': '1738573325936', 'page': 3}
|
|
365
|
-
```
|
|
366
|
-
|
|
367
|
-
---
|
|
368
|
-
|
|
369
|
-
## Retrieving the article ID from a URL
|
|
370
|
-
|
|
371
|
-
The `id` is the last 12 hex chars of a Medium article URL slug:
|
|
372
|
-
|
|
373
|
-
```python
|
|
374
|
-
import re
|
|
375
|
-
|
|
376
|
-
url = "https://medium.com/@karpathy/software-2-0-a64152b37c35"
|
|
377
|
-
article_id = re.search(r'-([a-f0-9]{12})$', url.rstrip('/').split('?')[0])
|
|
378
|
-
if article_id:
|
|
379
|
-
article_id = article_id.group(1) # "a64152b37c35"
|
|
380
|
-
```
|
|
381
|
-
|
|
382
|
-
This ID is the same across all URL forms (`medium.com/@user/slug`, `user.medium.com/slug`, `medium.com/publication/slug`).
|
|
383
|
-
|
|
384
|
-
---
|
|
385
|
-
|
|
386
|
-
## Gotchas
|
|
387
|
-
|
|
388
|
-
- **HTTP 403 on plain `http_get`** — The default `http_get` helper sends `User-Agent: Mozilla/5.0` which Medium accepts for most endpoints, but article HTML pages (without `?format=json`) return 403. Always use `?format=json` for article and profile pages.
|
|
389
|
-
|
|
390
|
-
- **`?format=json` works; profile stream API does not** — `https://medium.com/_/api/users/{id}/profile/stream` returns HTTP 403 for unauthenticated requests. Use `?format=json` on the profile URL instead.
|
|
391
|
-
|
|
392
|
-
- **`?format=json` on search pages returns 403 or broken JSON** — `medium.com/search?q=...&format=json` and `medium.com/search/posts?q=...&format=json` both fail. Search is not available without auth.
|
|
393
|
-
|
|
394
|
-
- **GraphQL `collection()` requires ID, not slug** — `collection(slug: "towards-data-science")` returns HTTP 400. You must use the numeric ID (e.g. `"7f60cf5620c9"`). Get it from `?format=json` on the publication page: `payload['collection']['id']`.
|
|
395
|
-
|
|
396
|
-
- **GraphQL `tags` field on `post()` returns HTTP 400** — Use `topics { name slug }` instead. Topics are a subset of tags but work without auth.
|
|
397
|
-
|
|
398
|
-
- **GraphQL visibility is a string, not a number** — `post().visibility` returns `"PUBLIC"` or `"LOCKED"` (string). The `?format=json` `value.visibility` field uses integers: `0`=public, `2`=locked. Both agree on the lock status.
|
|
399
|
-
|
|
400
|
-
- **`totalClapCount` vs `recommends`** — `totalClapCount` (60865) counts all claps (Medium allows up to 50 claps per reader). `recommends` (8846) counts unique clappers. The GraphQL `clapCount` field equals `totalClapCount`, not `recommends`.
|
|
401
|
-
|
|
402
|
-
- **RSS returns at most 10 items, no clap counts** — RSS is best for getting recent article links + full HTML body. Use `?format=json` profile if you need metrics.
|
|
403
|
-
|
|
404
|
-
- **RSS link contains tracking params** — `posts[0]['link']` includes `?source=rss-{userId}------2`. Strip with `.split('?')[0]` if you need a clean URL.
|
|
405
|
-
|
|
406
|
-
- **`content:encoded` in RSS is full HTML, not plaintext** — Strip HTML tags if you want plaintext: `re.sub(r'<[^>]+>', '', body_html)`.
|
|
407
|
-
|
|
408
|
-
- **Medium subdomains** — Some users have custom subdomains (`karpathy.medium.com`). Both `medium.com/@karpathy/...` and `karpathy.medium.com/...` resolve to the same article; `?format=json` works on both.
|
|
409
|
-
|
|
410
|
-
- **towardsdatascience.com is no longer Medium** — TDS moved to its own WordPress site. `towardsdatascience.com/article-slug?format=json` returns full WordPress HTML, not Medium JSON. Use `medium.com/towards-data-science` for the archived Medium publication.
|
|
411
|
-
|
|
412
|
-
- **No public search API** — Medium has no Algolia equivalent. Finding articles by keyword requires either a browser, or fetching a user/publication feed and filtering locally.
|
|
413
|
-
|
|
414
|
-
- **Timestamps are unix milliseconds** — `firstPublishedAt`, `createdAt`, `latestPublishedAt` are all in milliseconds. Convert: `datetime.fromtimestamp(val['firstPublishedAt'] / 1000, tz=timezone.utc)`.
|
|
1
|
+
# Medium — Data Extraction
|
|
2
|
+
|
|
3
|
+
`https://medium.com` — blogging platform. Three access paths tested and validated: the undocumented `?format=json` endpoint (fastest for article + publication data), the undocumented GraphQL API (best for targeted metric lookups), and RSS feeds (best for recent posts lists without auth). No browser needed for any read-only task.
|
|
4
|
+
|
|
5
|
+
## Do this first: pick your access path
|
|
6
|
+
|
|
7
|
+
| Goal | Best approach | Latency |
|
|
8
|
+
|------|--------------|---------|
|
|
9
|
+
| Article metadata + full body | `?format=json` on article URL | ~400ms |
|
|
10
|
+
| Article metrics only (claps, visibility) | GraphQL `post(id:)` | ~275ms |
|
|
11
|
+
| Author profile + follower count | GraphQL `user(username:)` | ~220ms |
|
|
12
|
+
| Recent posts for a user (up to 10) | `?format=json` on profile URL | ~240ms |
|
|
13
|
+
| Recent posts for a publication | `?format=json` on publication URL | ~300ms |
|
|
14
|
+
| Paginated post list (feed) | RSS feed | ~260ms |
|
|
15
|
+
| Full article body as HTML | RSS `content:encoded` field | ~260ms |
|
|
16
|
+
| Publication subscriber count | `?format=json` on publication URL | ~300ms |
|
|
17
|
+
|
|
18
|
+
**Never use a browser for read-only Medium tasks.** All article content, metadata, and metrics are available over HTTP. Browser is only needed for authenticated actions (clapping, posting, account management).
|
|
19
|
+
|
|
20
|
+
---
|
|
21
|
+
|
|
22
|
+
## The XSSI prefix
|
|
23
|
+
|
|
24
|
+
Every `?format=json` response starts with the anti-hijacking prefix `])}while(1);</x>` before the JSON. **Strip it before parsing.** The helper below handles this.
|
|
25
|
+
|
|
26
|
+
```python
|
|
27
|
+
import urllib.request, gzip, json, re
|
|
28
|
+
|
|
29
|
+
def medium_json(url):
|
|
30
|
+
"""Fetch any Medium URL with ?format=json and return parsed dict.
|
|
31
|
+
Strips the XSSI prefix ])}while(1);</x> automatically.
|
|
32
|
+
Works on: article URLs, user profile URLs, publication URLs.
|
|
33
|
+
Does NOT work on: search pages, /latest, profile stream API.
|
|
34
|
+
"""
|
|
35
|
+
sep = '&' if '?' in url else '?'
|
|
36
|
+
req = urllib.request.Request(
|
|
37
|
+
url + sep + 'format=json',
|
|
38
|
+
headers={
|
|
39
|
+
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
|
|
40
|
+
"Accept": "application/json, */*",
|
|
41
|
+
"Accept-Encoding": "gzip",
|
|
42
|
+
}
|
|
43
|
+
)
|
|
44
|
+
with urllib.request.urlopen(req, timeout=20) as r:
|
|
45
|
+
raw = r.read()
|
|
46
|
+
if r.headers.get("Content-Encoding") == "gzip":
|
|
47
|
+
raw = gzip.decompress(raw)
|
|
48
|
+
text = raw.decode()
|
|
49
|
+
# Strip everything before the first {
|
|
50
|
+
return json.loads(re.sub(r'^[^\{]+', '', text))
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
---
|
|
54
|
+
|
|
55
|
+
## Path 1: `?format=json` — article metadata + body (fastest for articles)
|
|
56
|
+
|
|
57
|
+
Append `?format=json` to any article URL. Returns full metadata, virtuals (metrics), and the complete article body in a structured `bodyModel`. No auth required for public and subscriber-locked articles alike — the metadata and full body are always returned, but paywalled body content in a browser would be truncated.
|
|
58
|
+
|
|
59
|
+
```python
|
|
60
|
+
data = medium_json("https://medium.com/@karpathy/software-2-0-a64152b37c35")
|
|
61
|
+
payload = data['payload']
|
|
62
|
+
val = payload['value'] # article fields
|
|
63
|
+
refs = payload['references'] # User, Social, SocialStats dicts keyed by ID
|
|
64
|
+
|
|
65
|
+
# --- Article fields ---
|
|
66
|
+
title = val['title'] # "Software 2.0"
|
|
67
|
+
article_id = val['id'] # "a64152b37c35"
|
|
68
|
+
creator_id = val['creatorId'] # "ac9d9a35533e"
|
|
69
|
+
slug = val['uniqueSlug'] # "software-2-0-a64152b37c35"
|
|
70
|
+
url = val['canonicalUrl'] # "https://medium.com/@karpathy/..."
|
|
71
|
+
first_pub = val['firstPublishedAt'] # unix ms: 1510438733751
|
|
72
|
+
last_pub = val['latestPublishedAt'] # unix ms: 1615659523264
|
|
73
|
+
visibility = val['visibility'] # 0=public, 2=subscriber-locked
|
|
74
|
+
is_locked = val['isSubscriptionLocked'] # True if paywalled
|
|
75
|
+
locked_src = val['lockedPostSource'] # 0=free, 1=Medium Partner Program
|
|
76
|
+
|
|
77
|
+
# --- Metrics (in val['virtuals']) ---
|
|
78
|
+
virtuals = val['virtuals']
|
|
79
|
+
clap_count = virtuals['totalClapCount'] # 60865 (all claps, including multi-clap)
|
|
80
|
+
recommends = virtuals['recommends'] # 8846 (unique clappers)
|
|
81
|
+
read_time = virtuals['readingTime'] # 8.79811320754717 (minutes, float)
|
|
82
|
+
word_count = virtuals['wordCount'] # 2146
|
|
83
|
+
|
|
84
|
+
# --- Tags ---
|
|
85
|
+
tags = [t['slug'] for t in virtuals['tags']]
|
|
86
|
+
# ['machine-learning', 'artificial-intelligence', 'programming', 'software-development', 'future']
|
|
87
|
+
|
|
88
|
+
# --- Author (from references) ---
|
|
89
|
+
user = refs['User'][creator_id]
|
|
90
|
+
author_name = user['name'] # "Andrej Karpathy"
|
|
91
|
+
author_handle = user['username'] # "karpathy"
|
|
92
|
+
author_bio = user['bio'] # "I like to train deep neural nets on large datasets."
|
|
93
|
+
author_twitter = user['twitterScreenName'] # "karpathy"
|
|
94
|
+
|
|
95
|
+
# --- Follower count (from SocialStats) ---
|
|
96
|
+
ss = refs['SocialStats'][creator_id]
|
|
97
|
+
follower_count = ss['usersFollowedByCount'] # 60027
|
|
98
|
+
following_count = ss['usersFollowedCount'] # 183
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
### Detect paywall
|
|
102
|
+
|
|
103
|
+
```python
|
|
104
|
+
# Paywalled (Medium Partner Program): isSubscriptionLocked=True, visibility=2, lockedPostSource=1
|
|
105
|
+
# Free: isSubscriptionLocked=False, visibility=0, lockedPostSource=0
|
|
106
|
+
is_paywalled = val['isSubscriptionLocked'] # True / False
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
Confirmed on real TDS articles: paywalled articles return `isSubscriptionLocked=True`, `visibility=2`, `lockedPostSource=1`. Free articles: all three are `False`/`0`.
|
|
110
|
+
|
|
111
|
+
### Article body
|
|
112
|
+
|
|
113
|
+
The full body is in `val['content']['bodyModel']['paragraphs']` — a list of dicts:
|
|
114
|
+
|
|
115
|
+
```python
|
|
116
|
+
paragraphs = val['content']['bodyModel']['paragraphs']
|
|
117
|
+
|
|
118
|
+
# Paragraph types (confirmed for this article):
|
|
119
|
+
# type=1 -> body text (P)
|
|
120
|
+
# type=3 -> heading (H1/H2)
|
|
121
|
+
# type=4 -> image (text is empty; metadata has image ID)
|
|
122
|
+
|
|
123
|
+
# Reconstruct plain text:
|
|
124
|
+
text_paras = [p['text'] for p in paragraphs if p.get('text')]
|
|
125
|
+
full_text = '\n\n'.join(text_paras)
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
---
|
|
129
|
+
|
|
130
|
+
## Path 2: GraphQL API — targeted metric lookups
|
|
131
|
+
|
|
132
|
+
`POST https://medium.com/_/graphql` with a JSON body. No auth, no CSRF token required.
|
|
133
|
+
Returns HTTP 200 with JSON even for unauthenticated queries. Invalid fields return HTTP 400 — do not assume a field exists without testing first.
|
|
134
|
+
|
|
135
|
+
```python
|
|
136
|
+
import json, urllib.request, gzip
|
|
137
|
+
|
|
138
|
+
def gql(query):
|
|
139
|
+
body = json.dumps({"query": query}).encode()
|
|
140
|
+
req = urllib.request.Request(
|
|
141
|
+
"https://medium.com/_/graphql",
|
|
142
|
+
data=body,
|
|
143
|
+
headers={
|
|
144
|
+
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
|
|
145
|
+
"Content-Type": "application/json",
|
|
146
|
+
"Accept": "application/json",
|
|
147
|
+
"Accept-Encoding": "gzip",
|
|
148
|
+
},
|
|
149
|
+
method="POST",
|
|
150
|
+
)
|
|
151
|
+
with urllib.request.urlopen(req, timeout=20) as r:
|
|
152
|
+
raw = r.read()
|
|
153
|
+
if r.headers.get("Content-Encoding") == "gzip":
|
|
154
|
+
raw = gzip.decompress(raw)
|
|
155
|
+
return json.loads(raw.decode())
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
### Fetch article metrics (fastest)
|
|
159
|
+
|
|
160
|
+
```python
|
|
161
|
+
result = gql("""
|
|
162
|
+
{
|
|
163
|
+
post(id: "a64152b37c35") {
|
|
164
|
+
title
|
|
165
|
+
id
|
|
166
|
+
firstPublishedAt
|
|
167
|
+
latestPublishedAt
|
|
168
|
+
visibility
|
|
169
|
+
uniqueSlug
|
|
170
|
+
canonicalUrl
|
|
171
|
+
mediumUrl
|
|
172
|
+
isLocked
|
|
173
|
+
clapCount
|
|
174
|
+
readingTime
|
|
175
|
+
wordCount
|
|
176
|
+
}
|
|
177
|
+
}
|
|
178
|
+
""")
|
|
179
|
+
post = result['data']['post']
|
|
180
|
+
# post['visibility'] -> "PUBLIC" | "LOCKED" (string, not numeric)
|
|
181
|
+
# post['isLocked'] -> False | True
|
|
182
|
+
# post['clapCount'] -> 60865 (same as totalClapCount in format=json)
|
|
183
|
+
# post['readingTime'] -> 8.79811320754717 (minutes)
|
|
184
|
+
# post['wordCount'] -> 2146
|
|
185
|
+
```
|
|
186
|
+
|
|
187
|
+
**Confirmed working `post()` fields:** `title`, `id`, `createdAt`, `updatedAt`, `firstPublishedAt`, `latestPublishedAt`, `visibility`, `uniqueSlug`, `canonicalUrl`, `mediumUrl`, `isLocked`, `clapCount`, `readingTime`, `wordCount`
|
|
188
|
+
|
|
189
|
+
**Nested object that works:** `topics { name slug }`, `creator { name username }`, `collection { name id slug description domain creator { name username } }`
|
|
190
|
+
|
|
191
|
+
**Fields that return HTTP 400 (not available):** `tags`, `author`, `recommends`, `content`, `publication`, `responses`, `sequence`
|
|
192
|
+
|
|
193
|
+
### Fetch author profile
|
|
194
|
+
|
|
195
|
+
```python
|
|
196
|
+
result = gql("""
|
|
197
|
+
{
|
|
198
|
+
user(username: "karpathy") {
|
|
199
|
+
name
|
|
200
|
+
username
|
|
201
|
+
id
|
|
202
|
+
bio
|
|
203
|
+
imageId
|
|
204
|
+
twitterScreenName
|
|
205
|
+
mediumMemberAt
|
|
206
|
+
socialStats {
|
|
207
|
+
followerCount
|
|
208
|
+
followingCount
|
|
209
|
+
}
|
|
210
|
+
}
|
|
211
|
+
}
|
|
212
|
+
""")
|
|
213
|
+
user = result['data']['user']
|
|
214
|
+
# user['name'] -> "Andrej Karpathy"
|
|
215
|
+
# user['id'] -> "ac9d9a35533e"
|
|
216
|
+
# user['bio'] -> "I like to train deep neural nets on large datasets."
|
|
217
|
+
# user['twitterScreenName'] -> "karpathy"
|
|
218
|
+
# user['socialStats']['followerCount'] -> 60028
|
|
219
|
+
# user['mediumMemberAt'] -> 0 (not a member); nonzero = unix ms join date
|
|
220
|
+
```
|
|
221
|
+
|
|
222
|
+
**Confirmed working `user()` fields:** `name`, `username`, `id`, `bio`, `imageId`, `twitterScreenName`, `mediumMemberAt`, `socialStats { followerCount followingCount }`
|
|
223
|
+
|
|
224
|
+
**Fields that return HTTP 400:** `followerCount` (top-level), `followingCount` (top-level), `postCount`
|
|
225
|
+
|
|
226
|
+
### Fetch collection (publication) by ID
|
|
227
|
+
|
|
228
|
+
The GraphQL `collection()` query only accepts `id`, not `slug`. Get the ID from `?format=json` on the publication page.
|
|
229
|
+
|
|
230
|
+
```python
|
|
231
|
+
# TDS Archive id: 7f60cf5620c9 (from medium.com/towards-data-science?format=json)
|
|
232
|
+
result = gql("""
|
|
233
|
+
{
|
|
234
|
+
collection(id: "7f60cf5620c9") {
|
|
235
|
+
name
|
|
236
|
+
id
|
|
237
|
+
slug
|
|
238
|
+
description
|
|
239
|
+
domain
|
|
240
|
+
creator { name username }
|
|
241
|
+
}
|
|
242
|
+
}
|
|
243
|
+
""")
|
|
244
|
+
coll = result['data']['collection']
|
|
245
|
+
# coll['name'] -> "TDS Archive"
|
|
246
|
+
# coll['slug'] -> "data-science"
|
|
247
|
+
```
|
|
248
|
+
|
|
249
|
+
---
|
|
250
|
+
|
|
251
|
+
## Path 3: RSS feeds (best for recent posts list + article bodies)
|
|
252
|
+
|
|
253
|
+
Works with plain `http_get`. Returns up to 10 most recent posts. Full article HTML is in `content:encoded`. No clap count or visibility info in RSS.
|
|
254
|
+
|
|
255
|
+
```python
|
|
256
|
+
import re
|
|
257
|
+
from helpers import http_get
|
|
258
|
+
|
|
259
|
+
def parse_rss_items(rss_xml):
|
|
260
|
+
"""Extract items from Medium RSS feed. Returns list of dicts."""
|
|
261
|
+
def cdata(tag, text):
|
|
262
|
+
m = re.search(rf'<{tag}[^>]*><!\[CDATA\[(.*?)\]\]></{tag}>', text, re.DOTALL)
|
|
263
|
+
return m.group(1).strip() if m else None
|
|
264
|
+
|
|
265
|
+
items = []
|
|
266
|
+
for raw in re.findall(r'<item>(.*?)</item>', rss_xml, re.DOTALL):
|
|
267
|
+
# link is plain text (not CDATA)
|
|
268
|
+
link_m = re.search(r'<link>(.*?)</link>', raw, re.DOTALL)
|
|
269
|
+
items.append({
|
|
270
|
+
'title': cdata('title', raw),
|
|
271
|
+
'link': link_m.group(1).strip() if link_m else None,
|
|
272
|
+
'pubDate': cdata('pubDate', raw),
|
|
273
|
+
'creator': cdata('dc:creator', raw),
|
|
274
|
+
'tags': re.findall(r'<category><!\[CDATA\[(.*?)\]\]></category>', raw),
|
|
275
|
+
'body_html': cdata('content:encoded', raw), # full article HTML
|
|
276
|
+
})
|
|
277
|
+
return items
|
|
278
|
+
|
|
279
|
+
# User feed (up to 10 latest posts)
|
|
280
|
+
rss = http_get("https://medium.com/feed/@karpathy")
|
|
281
|
+
posts = parse_rss_items(rss)
|
|
282
|
+
# posts[0]['title'] -> "Software 2.0"
|
|
283
|
+
# posts[0]['pubDate'] -> "Sat, 11 Nov 2017 22:18:53 GMT"
|
|
284
|
+
# posts[0]['creator'] -> "Andrej Karpathy"
|
|
285
|
+
# posts[0]['tags'] -> ['programming', 'software-development', 'artificial-intelligence', 'future', 'machine-learning']
|
|
286
|
+
# posts[0]['link'] -> "https://karpathy.medium.com/software-2-0-a64152b37c35?source=rss-..."
|
|
287
|
+
# posts[0]['body_html'] -> full article body as HTML string (~15KB for this article)
|
|
288
|
+
|
|
289
|
+
# Publication feed (up to 10 latest posts)
|
|
290
|
+
rss_pub = http_get("https://medium.com/feed/towards-data-science")
|
|
291
|
+
pub_posts = parse_rss_items(rss_pub)
|
|
292
|
+
```
|
|
293
|
+
|
|
294
|
+
**RSS limitations:**
|
|
295
|
+
- RSS does not include clap count, view count, or paywall status.
|
|
296
|
+
- `body_html` contains the full article body as HTML, including `<p>`, `<strong>`, `<a>`, `<img>` tags.
|
|
297
|
+
- Pagination is not supported — RSS always returns the 10 most recent posts.
|
|
298
|
+
|
|
299
|
+
---
|
|
300
|
+
|
|
301
|
+
## Path 4: `?format=json` on user profile — recent posts with metrics
|
|
302
|
+
|
|
303
|
+
Better than RSS when you need clap counts alongside post list. Returns up to `limit` posts (default 10) plus full author metadata.
|
|
304
|
+
|
|
305
|
+
```python
|
|
306
|
+
data = medium_json("https://medium.com/@karpathy?limit=10")
|
|
307
|
+
payload = data['payload']
|
|
308
|
+
|
|
309
|
+
user = payload['user']
|
|
310
|
+
# user['name'] -> "Andrej Karpathy"
|
|
311
|
+
# user['username'] -> "karpathy"
|
|
312
|
+
# user['bio'] -> "I like to train deep neural nets on large datasets."
|
|
313
|
+
|
|
314
|
+
refs = payload['references']
|
|
315
|
+
ss = refs['SocialStats'][user['userId']]
|
|
316
|
+
# ss['usersFollowedByCount'] -> 60028 (followers)
|
|
317
|
+
# ss['usersFollowedCount'] -> 183 (following)
|
|
318
|
+
|
|
319
|
+
posts = refs.get('Post', {}) # dict keyed by post ID
|
|
320
|
+
for pid, p in posts.items():
|
|
321
|
+
v = p['virtuals']
|
|
322
|
+
print(p['title'], v['totalClapCount'], round(v['readingTime'], 1))
|
|
323
|
+
|
|
324
|
+
# Paginate: use paging['next'] from payload
|
|
325
|
+
paging = payload['paging']
|
|
326
|
+
next_params = paging['next']
|
|
327
|
+
# next_params = {'limit': 10, 'to': '1495652975362', 'source': 'overview', 'page': 2, 'ignoredIds': []}
|
|
328
|
+
# Append as query params to the same profile URL to get next page
|
|
329
|
+
next_url = (
|
|
330
|
+
f"https://medium.com/@{user['username']}"
|
|
331
|
+
f"?limit={next_params['limit']}&to={next_params['to']}"
|
|
332
|
+
f"&source={next_params['source']}&page={next_params['page']}"
|
|
333
|
+
)
|
|
334
|
+
data2 = medium_json(next_url)
|
|
335
|
+
# Note: karpathy has only 8 total posts — pagination returns same refs on page 2
|
|
336
|
+
```
|
|
337
|
+
|
|
338
|
+
---
|
|
339
|
+
|
|
340
|
+
## Path 5: `?format=json` on publication page
|
|
341
|
+
|
|
342
|
+
Returns publication metadata and recent posts with metrics.
|
|
343
|
+
|
|
344
|
+
```python
|
|
345
|
+
data = medium_json("https://medium.com/towards-data-science")
|
|
346
|
+
payload = data['payload']
|
|
347
|
+
|
|
348
|
+
coll = payload['collection']
|
|
349
|
+
# coll['name'] -> "TDS Archive"
|
|
350
|
+
# coll['slug'] -> "data-science"
|
|
351
|
+
# coll['description'] -> full description string
|
|
352
|
+
# coll['subscriberCount'] -> 828527
|
|
353
|
+
# coll['metadata']['followerCount'] -> 828527
|
|
354
|
+
# coll['tags'] -> ['DATA SCIENCE', 'MACHINE LEARNING', ...]
|
|
355
|
+
|
|
356
|
+
posts = payload['references'].get('Post', {})
|
|
357
|
+
for pid, p in posts.items():
|
|
358
|
+
v = p['virtuals']
|
|
359
|
+
print(p['title'], v['totalClapCount'], p['isSubscriptionLocked'])
|
|
360
|
+
# Also includes: p['visibility'] (0=free, 2=paywalled)
|
|
361
|
+
|
|
362
|
+
# Paginate (same pattern as user profile)
|
|
363
|
+
paging = payload['paging']
|
|
364
|
+
# paging['next'] = {'to': '1738573325936', 'page': 3}
|
|
365
|
+
```
|
|
366
|
+
|
|
367
|
+
---
|
|
368
|
+
|
|
369
|
+
## Retrieving the article ID from a URL
|
|
370
|
+
|
|
371
|
+
The `id` is the last 12 hex chars of a Medium article URL slug:
|
|
372
|
+
|
|
373
|
+
```python
|
|
374
|
+
import re
|
|
375
|
+
|
|
376
|
+
url = "https://medium.com/@karpathy/software-2-0-a64152b37c35"
|
|
377
|
+
article_id = re.search(r'-([a-f0-9]{12})$', url.rstrip('/').split('?')[0])
|
|
378
|
+
if article_id:
|
|
379
|
+
article_id = article_id.group(1) # "a64152b37c35"
|
|
380
|
+
```
|
|
381
|
+
|
|
382
|
+
This ID is the same across all URL forms (`medium.com/@user/slug`, `user.medium.com/slug`, `medium.com/publication/slug`).
|
|
383
|
+
|
|
384
|
+
---
|
|
385
|
+
|
|
386
|
+
## Gotchas
|
|
387
|
+
|
|
388
|
+
- **HTTP 403 on plain `http_get`** — The default `http_get` helper sends `User-Agent: Mozilla/5.0` which Medium accepts for most endpoints, but article HTML pages (without `?format=json`) return 403. Always use `?format=json` for article and profile pages.
|
|
389
|
+
|
|
390
|
+
- **`?format=json` works; profile stream API does not** — `https://medium.com/_/api/users/{id}/profile/stream` returns HTTP 403 for unauthenticated requests. Use `?format=json` on the profile URL instead.
|
|
391
|
+
|
|
392
|
+
- **`?format=json` on search pages returns 403 or broken JSON** — `medium.com/search?q=...&format=json` and `medium.com/search/posts?q=...&format=json` both fail. Search is not available without auth.
|
|
393
|
+
|
|
394
|
+
- **GraphQL `collection()` requires ID, not slug** — `collection(slug: "towards-data-science")` returns HTTP 400. You must use the numeric ID (e.g. `"7f60cf5620c9"`). Get it from `?format=json` on the publication page: `payload['collection']['id']`.
|
|
395
|
+
|
|
396
|
+
- **GraphQL `tags` field on `post()` returns HTTP 400** — Use `topics { name slug }` instead. Topics are a subset of tags but work without auth.
|
|
397
|
+
|
|
398
|
+
- **GraphQL visibility is a string, not a number** — `post().visibility` returns `"PUBLIC"` or `"LOCKED"` (string). The `?format=json` `value.visibility` field uses integers: `0`=public, `2`=locked. Both agree on the lock status.
|
|
399
|
+
|
|
400
|
+
- **`totalClapCount` vs `recommends`** — `totalClapCount` (60865) counts all claps (Medium allows up to 50 claps per reader). `recommends` (8846) counts unique clappers. The GraphQL `clapCount` field equals `totalClapCount`, not `recommends`.
|
|
401
|
+
|
|
402
|
+
- **RSS returns at most 10 items, no clap counts** — RSS is best for getting recent article links + full HTML body. Use `?format=json` profile if you need metrics.
|
|
403
|
+
|
|
404
|
+
- **RSS link contains tracking params** — `posts[0]['link']` includes `?source=rss-{userId}------2`. Strip with `.split('?')[0]` if you need a clean URL.
|
|
405
|
+
|
|
406
|
+
- **`content:encoded` in RSS is full HTML, not plaintext** — Strip HTML tags if you want plaintext: `re.sub(r'<[^>]+>', '', body_html)`.
|
|
407
|
+
|
|
408
|
+
- **Medium subdomains** — Some users have custom subdomains (`karpathy.medium.com`). Both `medium.com/@karpathy/...` and `karpathy.medium.com/...` resolve to the same article; `?format=json` works on both.
|
|
409
|
+
|
|
410
|
+
- **towardsdatascience.com is no longer Medium** — TDS moved to its own WordPress site. `towardsdatascience.com/article-slug?format=json` returns full WordPress HTML, not Medium JSON. Use `medium.com/towards-data-science` for the archived Medium publication.
|
|
411
|
+
|
|
412
|
+
- **No public search API** — Medium has no Algolia equivalent. Finding articles by keyword requires either a browser, or fetching a user/publication feed and filtering locally.
|
|
413
|
+
|
|
414
|
+
- **Timestamps are unix milliseconds** — `firstPublishedAt`, `createdAt`, `latestPublishedAt` are all in milliseconds. Convert: `datetime.fromtimestamp(val['firstPublishedAt'] / 1000, tz=timezone.utc)`.
|