@pencil-agent/nano-pencil 2.0.1 → 2.0.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +267 -267
- package/dist/build-meta.json +3 -3
- package/dist/core/export-html/AGENT.md +11 -11
- package/dist/core/export-html/template.css +971 -971
- package/dist/core/export-html/template.html +54 -54
- package/dist/core/model/custom-providers.js +1 -1
- package/dist/core/model-registry.js +5 -5
- package/dist/extensions/builtin/AGENT.md +115 -115
- package/dist/extensions/builtin/browser/AGENT.md +17 -17
- package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
- package/dist/extensions/builtin/browser/browser.md +73 -73
- package/dist/extensions/builtin/browser/install.md +142 -142
- package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
- package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
- package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
- package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
- package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
- package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
- package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
- package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
- package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
- package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
- package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
- package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
- package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
- package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
- package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
- package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
- package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
- package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
- package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
- package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
- package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
- package/dist/extensions/builtin/goal/README.md +67 -67
- package/dist/extensions/builtin/grub/README.md +112 -112
- package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
- package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
- package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
- package/dist/extensions/builtin/link-world/linkworld.md +313 -313
- package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
- package/dist/extensions/builtin/loop/README.md +92 -92
- package/dist/extensions/builtin/mcp/figma-design.md +68 -68
- package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
- package/dist/extensions/builtin/recap/AGENT.md +15 -15
- package/dist/extensions/builtin/sal/README.md +72 -72
- package/dist/extensions/builtin/security-audit/README.md +289 -289
- package/dist/extensions/builtin/team/AGENT.md +112 -112
- package/dist/extensions/builtin/team/TESTING.md +299 -299
- package/dist/extensions/builtin/token-save/README.md +56 -56
- package/dist/extensions/optional/AGENT.md +10 -10
- package/dist/modes/interactive/controllers/input-submit-controller.js +2 -2
- package/dist/modes/interactive/controllers/stream-render-controller.js +2 -2
- package/dist/modes/interactive/interactive-mode.js +19 -19
- package/dist/modes/interactive/theme/dark.json +85 -85
- package/dist/modes/interactive/theme/light.json +84 -84
- package/dist/modes/interactive/theme/theme-schema.json +335 -335
- package/dist/modes/interactive/theme/warm.json +81 -81
- package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
- package/dist/node_modules/@pencil-agent/ai/dist/models.generated.js +1 -1
- package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
- package/docs/SDK-TESTING.md +364 -0
- package/docs/codex-goal-command-impl.md +1055 -1055
- package/docs/codex-goal-vs-grub.md +500 -500
- package/docs/custom-provider.md +27 -27
- package/docs/extensions.md +27 -27
- package/docs/keybindings.md +27 -27
- package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
- package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
- package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
- package/docs/loop-usage-examples.md +214 -214
- package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
- package/docs/models.md +27 -27
- package/docs/packages.md +27 -27
- package/docs/pi-design-philosophy.md +457 -457
- package/docs/planmode.md +1987 -1987
- package/docs/prompt-templates.md +27 -27
- package/docs/providers.md +27 -27
- package/docs/sdk.md +27 -27
- package/docs/skills.md +27 -27
- package/docs/startup-performance-optimization.md +301 -0
- package/docs/themes.md +27 -27
- package/docs/tui.md +27 -27
- package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
- package/package.json +190 -190
- package/docs/cc-agent-design.md +0 -1297
- package/docs/cc-tui-design.md +0 -1333
- package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +0 -170
- package/docs/scan-report.md +0 -3820
- package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
- package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +0 -261
|
@@ -1,360 +1,360 @@
|
|
|
1
|
-
# Coursera — Course & Catalog Data Extraction
|
|
2
|
-
|
|
3
|
-
Field-tested against coursera.org and api.coursera.org on 2026-04-18.
|
|
4
|
-
No authentication required for the public catalog API.
|
|
5
|
-
|
|
6
|
-
## TL;DR — Fastest Approach
|
|
7
|
-
|
|
8
|
-
Use `http_get` against `api.coursera.org`. The public REST API returns clean JSON with no
|
|
9
|
-
auth, no bot-detection, and sub-600ms latency. Use `q=search` with a keyword
|
|
10
|
-
only when you need full-text search (requires a browser POST workaround — see below).
|
|
11
|
-
For bulk enumeration, iterate the catalog list with `start` pagination.
|
|
12
|
-
|
|
13
|
-
---
|
|
14
|
-
|
|
15
|
-
## 1. Catalog List (http_get — always works)
|
|
16
|
-
|
|
17
|
-
The default list query (`q=list` implied) returns ALL courses in Coursera's catalog —
|
|
18
|
-
20,659 as of the test date.
|
|
19
|
-
|
|
20
|
-
```python
|
|
21
|
-
from helpers import http_get
|
|
22
|
-
import json
|
|
23
|
-
|
|
24
|
-
resp = http_get(
|
|
25
|
-
"https://api.coursera.org/api/courses.v1"
|
|
26
|
-
"?fields=name,slug,description,primaryLanguages,workload,"
|
|
27
|
-
"partnerIds,courseType,instructorIds,domainTypes,photoUrl,certificates"
|
|
28
|
-
"&limit=100&start=0"
|
|
29
|
-
)
|
|
30
|
-
data = json.loads(resp)
|
|
31
|
-
courses = data["elements"] # list of dicts
|
|
32
|
-
next_start = data["paging"].get("next") # e.g. "100", None when exhausted
|
|
33
|
-
total = data["paging"].get("total") # 20659
|
|
34
|
-
```
|
|
35
|
-
|
|
36
|
-
### Response structure (confirmed field names)
|
|
37
|
-
|
|
38
|
-
```json
|
|
39
|
-
{
|
|
40
|
-
"courseType": "v2.ondemand",
|
|
41
|
-
"description": "Gamification is the application of game elements...",
|
|
42
|
-
"domainTypes": [
|
|
43
|
-
{"domainId": "computer-science", "subdomainId": "design-and-product"},
|
|
44
|
-
{"domainId": "business", "subdomainId": "marketing"}
|
|
45
|
-
],
|
|
46
|
-
"photoUrl": "https://d3njjcbhbojbot.cloudfront.net/api/utilities/v1/imageproxy/https://coursera-course-photos.s3.amazonaws.com/...",
|
|
47
|
-
"id": "69Bku0KoEeWZtA4u62x6lQ",
|
|
48
|
-
"slug": "gamification",
|
|
49
|
-
"instructorIds": ["226710"],
|
|
50
|
-
"specializations": [],
|
|
51
|
-
"workload": "4-8 hours/week",
|
|
52
|
-
"primaryLanguages": ["en"],
|
|
53
|
-
"partnerIds": ["6"],
|
|
54
|
-
"certificates": ["VerifiedCert"],
|
|
55
|
-
"name": "Gamification"
|
|
56
|
-
}
|
|
57
|
-
```
|
|
58
|
-
|
|
59
|
-
Field notes:
|
|
60
|
-
- `id` — opaque base64-ish string, stable identifier. Use for batch lookups and linking.
|
|
61
|
-
- `slug` — URL-safe identifier. Course page: `https://www.coursera.org/learn/{slug}`
|
|
62
|
-
- `courseType` — always `"v2.ondemand"` for self-paced courses in practice.
|
|
63
|
-
- `workload` — free-text string, e.g. `"4-8 hours/week"`, `"1 hour 30 minutes"`, `"4 weeks of study, 1-2 hours/week"`. Not normalized.
|
|
64
|
-
- `primaryLanguages` — ISO 639-1 list, e.g. `["en"]`, `["fr"]`.
|
|
65
|
-
- `partnerIds` — list of partner (university/org) IDs. Join to `partners.v1` by id.
|
|
66
|
-
- `instructorIds` — list of instructor IDs. Join to `instructors.v1` by id.
|
|
67
|
-
- `domainTypes` — list of `{domainId, subdomainId}` objects. Domain IDs include `"data-science"`, `"computer-science"`, `"business"`, `"information-technology"`.
|
|
68
|
-
- `certificates` — list of cert types, typically `["VerifiedCert"]`.
|
|
69
|
-
- `photoUrl` — direct CDN URL to course image. Works without auth.
|
|
70
|
-
- `specializations` — list of specialization IDs this course belongs to (often empty; not always populated here — use `onDemandSpecializations.v1` instead).
|
|
71
|
-
- `previewLink` — field exists but was empty in all tested records; skip it.
|
|
72
|
-
- `avgRating` — field does NOT appear in public API responses; not available.
|
|
73
|
-
|
|
74
|
-
### Pagination
|
|
75
|
-
|
|
76
|
-
```python
|
|
77
|
-
def iter_all_courses(fields=None, page_size=100):
|
|
78
|
-
base_fields = "name,slug,description,primaryLanguages,workload,partnerIds,courseType,domainTypes,photoUrl"
|
|
79
|
-
if fields:
|
|
80
|
-
base_fields = fields
|
|
81
|
-
start = 0
|
|
82
|
-
while True:
|
|
83
|
-
url = (
|
|
84
|
-
f"https://api.coursera.org/api/courses.v1"
|
|
85
|
-
f"?fields={base_fields}&limit={page_size}&start={start}"
|
|
86
|
-
)
|
|
87
|
-
data = json.loads(http_get(url))
|
|
88
|
-
yield from data["elements"]
|
|
89
|
-
nxt = data["paging"].get("next")
|
|
90
|
-
if nxt is None:
|
|
91
|
-
break
|
|
92
|
-
start = int(nxt)
|
|
93
|
-
```
|
|
94
|
-
|
|
95
|
-
- `paging.next` is a string offset (e.g. `"100"`), or absent when exhausted.
|
|
96
|
-
- `paging.total` is present on the first page (e.g. `20659`) but absent on subsequent pages.
|
|
97
|
-
- `limit` up to at least 1000 works (tested: 1000 returned 1000 items). Use 100–500 for safe batches.
|
|
98
|
-
|
|
99
|
-
---
|
|
100
|
-
|
|
101
|
-
## 2. Partners API (http_get — works)
|
|
102
|
-
|
|
103
|
-
422 partners (universities, companies) as of test date.
|
|
104
|
-
|
|
105
|
-
```python
|
|
106
|
-
resp = http_get(
|
|
107
|
-
"https://api.coursera.org/api/partners.v1"
|
|
108
|
-
"?fields=name,squareLogo,description,shortName&limit=50&start=0"
|
|
109
|
-
)
|
|
110
|
-
data = json.loads(resp)
|
|
111
|
-
partners = data["elements"]
|
|
112
|
-
# paging.next and paging.total follow same structure as courses
|
|
113
|
-
```
|
|
114
|
-
|
|
115
|
-
### Partner record structure
|
|
116
|
-
|
|
117
|
-
```json
|
|
118
|
-
{
|
|
119
|
-
"id": "6",
|
|
120
|
-
"name": "University of Pennsylvania",
|
|
121
|
-
"shortName": "penn",
|
|
122
|
-
"description": "The University of Pennsylvania (commonly referred to as Penn)...",
|
|
123
|
-
"squareLogo": "http://coursera-university-assets.s3.amazonaws.com/.../logo.png"
|
|
124
|
-
}
|
|
125
|
-
```
|
|
126
|
-
|
|
127
|
-
### Partner by ID (with courseIds)
|
|
128
|
-
|
|
129
|
-
```python
|
|
130
|
-
resp = http_get(
|
|
131
|
-
"https://api.coursera.org/api/partners.v1"
|
|
132
|
-
"?ids=6&fields=name,squareLogo,description,shortName,courseIds"
|
|
133
|
-
)
|
|
134
|
-
data = json.loads(resp)
|
|
135
|
-
partner = data["elements"][0]
|
|
136
|
-
# partner["courseIds"] is a list of course ID strings (150+ for large universities)
|
|
137
|
-
```
|
|
138
|
-
|
|
139
|
-
---
|
|
140
|
-
|
|
141
|
-
## 3. Specializations API (http_get — works)
|
|
142
|
-
|
|
143
|
-
```python
|
|
144
|
-
resp = http_get(
|
|
145
|
-
"https://api.coursera.org/api/onDemandSpecializations.v1"
|
|
146
|
-
"?fields=name,slug,description,partnerIds,courseIds,tagline&limit=100&start=0"
|
|
147
|
-
)
|
|
148
|
-
data = json.loads(resp)
|
|
149
|
-
specs = data["elements"]
|
|
150
|
-
```
|
|
151
|
-
|
|
152
|
-
### Specialization record structure
|
|
153
|
-
|
|
154
|
-
```json
|
|
155
|
-
{
|
|
156
|
-
"id": "AbCdEfGhIjKl",
|
|
157
|
-
"name": "SIEM Splunk",
|
|
158
|
-
"slug": "siem-splunk",
|
|
159
|
-
"tagline": "Learn SIEM fundamentals with Splunk",
|
|
160
|
-
"description": "Course Overview:\n\nIn the \"SIEM Splunk\" specialization course...",
|
|
161
|
-
"partnerIds": ["1441"],
|
|
162
|
-
"courseIds": ["pu2XQCuEEe6qTBJCf71DPw", "Xc46mVFkEe6a4wrvTcwXPw", "YH1ok1FXEe62cBI5JZME2w"]
|
|
163
|
-
}
|
|
164
|
-
```
|
|
165
|
-
|
|
166
|
-
Note: Specializations paging does NOT include `paging.total` — iterate until `paging.next` is absent.
|
|
167
|
-
|
|
168
|
-
---
|
|
169
|
-
|
|
170
|
-
## 4. Instructors API (http_get — works)
|
|
171
|
-
|
|
172
|
-
Only useful for lookups by ID (from course `instructorIds`). The plain list endpoint
|
|
173
|
-
returns many empty records (empty name/bio).
|
|
174
|
-
|
|
175
|
-
```python
|
|
176
|
-
# Lookup specific instructors by ID
|
|
177
|
-
resp = http_get(
|
|
178
|
-
"https://api.coursera.org/api/instructors.v1"
|
|
179
|
-
"?ids=226710&fields=fullName,bio,department,title,photo"
|
|
180
|
-
)
|
|
181
|
-
data = json.loads(resp)
|
|
182
|
-
instructor = data["elements"][0]
|
|
183
|
-
```
|
|
184
|
-
|
|
185
|
-
### Instructor record structure
|
|
186
|
-
|
|
187
|
-
```json
|
|
188
|
-
{
|
|
189
|
-
"id": "226710",
|
|
190
|
-
"fullName": "Kevin Werbach",
|
|
191
|
-
"title": "Professor of Legal Studies and Business Ethics",
|
|
192
|
-
"department": "Legal Studies and Business Ethics",
|
|
193
|
-
"bio": "Kevin Werbach is professor of Legal Studies...",
|
|
194
|
-
"photo": "https://d3njjcbhbojbot.cloudfront.net/api/utilities/v1/imageproxy/..."
|
|
195
|
-
}
|
|
196
|
-
```
|
|
197
|
-
|
|
198
|
-
---
|
|
199
|
-
|
|
200
|
-
## 5. Batch ID Lookup
|
|
201
|
-
|
|
202
|
-
Fetch multiple courses (or partners/instructors) in one request by passing a comma-separated `ids` list:
|
|
203
|
-
|
|
204
|
-
```python
|
|
205
|
-
ids = ",".join(["69Bku0KoEeWZtA4u62x6lQ", "hOzhxVNuEfCW8Q55q1kSNQ", "0HiU7Oe4EeWTAQ4yevf_oQ"])
|
|
206
|
-
resp = http_get(
|
|
207
|
-
f"https://api.coursera.org/api/courses.v1"
|
|
208
|
-
f"?ids={ids}&fields=name,slug,description,primaryLanguages,workload,partnerIds"
|
|
209
|
-
)
|
|
210
|
-
data = json.loads(resp)
|
|
211
|
-
# data["elements"] has exactly the courses you asked for
|
|
212
|
-
```
|
|
213
|
-
|
|
214
|
-
No observed limit on the number of IDs per request in testing (tried up to 3).
|
|
215
|
-
|
|
216
|
-
---
|
|
217
|
-
|
|
218
|
-
## 6. Keyword Search — BLOCKED for GET (405)
|
|
219
|
-
|
|
220
|
-
`q=search&query=...` returns **HTTP 405 Method Not Allowed** on GET.
|
|
221
|
-
This applies to all three resource types:
|
|
222
|
-
- `courses.v1?q=search&query=python` → 405
|
|
223
|
-
- `onDemandSpecializations.v1?q=search&query=data+science` → 405
|
|
224
|
-
- `partners.v1?q=search&query=stanford` → 405
|
|
225
|
-
|
|
226
|
-
The search endpoint requires a POST request (Coursera's public Autocomplete/Search
|
|
227
|
-
service). For keyword-based discovery without a browser, use the catalog list and filter
|
|
228
|
-
client-side, or use the browser approach below.
|
|
229
|
-
|
|
230
|
-
### Browser fallback for keyword search
|
|
231
|
-
|
|
232
|
-
```python
|
|
233
|
-
new_tab("https://www.coursera.org/search?query=machine+learning")
|
|
234
|
-
wait_for_load()
|
|
235
|
-
wait(3) # Results load asynchronously via React
|
|
236
|
-
capture_screenshot()
|
|
237
|
-
```
|
|
238
|
-
|
|
239
|
-
Note: The search results page (`/search?query=...`) is a client-rendered React app. The
|
|
240
|
-
HTML returned by `http_get` does NOT contain course cards — it's a bare shell with no
|
|
241
|
-
`__NEXT_DATA__` or embedded JSON. A live browser is required to see rendered results.
|
|
242
|
-
|
|
243
|
-
---
|
|
244
|
-
|
|
245
|
-
## 7. Course Detail HTML Page (http_get — works, limited data)
|
|
246
|
-
|
|
247
|
-
```python
|
|
248
|
-
html = http_get("https://www.coursera.org/learn/machine-learning")
|
|
249
|
-
# html is ~980KB of server-rendered HTML (no NEXT_DATA, no Apollo state)
|
|
250
|
-
```
|
|
251
|
-
|
|
252
|
-
The course detail page IS served as full HTML (no JS-gate), but contains very
|
|
253
|
-
little machine-readable course data. What you can extract:
|
|
254
|
-
|
|
255
|
-
```python
|
|
256
|
-
import re, json
|
|
257
|
-
|
|
258
|
-
# Page title (includes course name)
|
|
259
|
-
title = re.search(r'<title[^>]*>(.*?)</title>', html).group(1)
|
|
260
|
-
# "Supervised Machine Learning: Regression and Classification | Coursera"
|
|
261
|
-
|
|
262
|
-
# JSON-LD blocks (2 present)
|
|
263
|
-
jsonld_blocks = re.findall(r'<script type="application/ld\+json">(.*?)</script>', html, re.DOTALL)
|
|
264
|
-
# Block 0: FAQPage schema (common Q&A about how courses work)
|
|
265
|
-
# Block 1: BreadcrumbList (category path, e.g. Browse > Data Science > Machine Learning)
|
|
266
|
-
faq = json.loads(jsonld_blocks[0]) # {"@type": "FAQPage", "mainEntity": [...]}
|
|
267
|
-
crumb = json.loads(jsonld_blocks[1]) # {"@type": "BreadcrumbList", "itemListElement": [...]}
|
|
268
|
-
|
|
269
|
-
# Extract breadcrumb categories
|
|
270
|
-
categories = [item["item"]["name"] for item in crumb["@graph"][0]["itemListElement"]]
|
|
271
|
-
# e.g. ["Browse", "Data Science", "Machine Learning"]
|
|
272
|
-
```
|
|
273
|
-
|
|
274
|
-
The HTML does NOT embed: description, rating, instructor names, enrollment count,
|
|
275
|
-
price, or any course-specific metadata as machine-readable fields.
|
|
276
|
-
Use the API (`courses.v1?ids=...`) to get those from the slug.
|
|
277
|
-
|
|
278
|
-
### Slug-to-ID lookup pattern
|
|
279
|
-
|
|
280
|
-
```python
|
|
281
|
-
# Get course data from slug (need ID first — get it from catalog or search)
|
|
282
|
-
# Pattern: enumerate catalog, match by slug
|
|
283
|
-
resp = http_get("https://api.coursera.org/api/courses.v1?fields=name,slug,description&limit=100&start=0")
|
|
284
|
-
data = json.loads(resp)
|
|
285
|
-
by_slug = {el["slug"]: el for el in data["elements"]}
|
|
286
|
-
course = by_slug.get("machine-learning")
|
|
287
|
-
```
|
|
288
|
-
|
|
289
|
-
---
|
|
290
|
-
|
|
291
|
-
## Endpoints Summary
|
|
292
|
-
|
|
293
|
-
| Endpoint | Method | Result |
|
|
294
|
-
|---|---|---|
|
|
295
|
-
| `courses.v1` (list) | GET | 200 OK — full catalog, 20,659 courses |
|
|
296
|
-
| `courses.v1?ids=...` | GET | 200 OK — batch lookup by ID |
|
|
297
|
-
| `courses.v1?q=search&query=...` | GET | **405 Method Not Allowed** |
|
|
298
|
-
| `partners.v1` (list) | GET | 200 OK — 422 partners |
|
|
299
|
-
| `partners.v1?ids=...` | GET | 200 OK — with courseIds |
|
|
300
|
-
| `partners.v1?q=search&query=...` | GET | **405 Method Not Allowed** |
|
|
301
|
-
| `onDemandSpecializations.v1` (list) | GET | 200 OK — paginated (no total) |
|
|
302
|
-
| `onDemandSpecializations.v1?q=search&query=...` | GET | **405 Method Not Allowed** |
|
|
303
|
-
| `instructors.v1?ids=...` | GET | 200 OK — rich records by ID |
|
|
304
|
-
| `instructors.v1` (list) | GET | 200 OK — mostly empty records |
|
|
305
|
-
| `degrees.v1` | GET | 403 Forbidden |
|
|
306
|
-
| `/search?query=...` page HTML | GET | 200 OK — React shell only, no data |
|
|
307
|
-
| `/learn/{slug}` page HTML | GET | 200 OK — HTML with JSON-LD breadcrumb only |
|
|
308
|
-
|
|
309
|
-
---
|
|
310
|
-
|
|
311
|
-
## Rate Limits
|
|
312
|
-
|
|
313
|
-
No rate limiting observed in testing:
|
|
314
|
-
- 5 consecutive requests with no delay: all succeeded, avg 0.55s each.
|
|
315
|
-
- No `X-RateLimit-*` or `Retry-After` headers in responses.
|
|
316
|
-
- No auth headers needed for any working endpoint.
|
|
317
|
-
|
|
318
|
-
Response headers that are present: `X-Coursera-Request-Id`, `X-Coursera-Trace-Id-Hex`,
|
|
319
|
-
`x-envoy-upstream-service-time`. No rate-limit indicators.
|
|
320
|
-
|
|
321
|
-
Use a small delay (0.5s) between requests if doing bulk enumeration of the full 20K+
|
|
322
|
-
catalog as a courtesy, but no hard cap was observed.
|
|
323
|
-
|
|
324
|
-
---
|
|
325
|
-
|
|
326
|
-
## Gotchas
|
|
327
|
-
|
|
328
|
-
- **`q=search` is POST-only**: All three resource types (courses, specializations,
|
|
329
|
-
partners) return 405 on GET when `q=search` is added. There is no documented public
|
|
330
|
-
POST endpoint. For keyword filtering, enumerate the catalog and filter client-side.
|
|
331
|
-
|
|
332
|
-
- **`paging.total` absent after page 1**: Only the first page response includes
|
|
333
|
-
`paging.total`. Subsequent pages have only `paging.next`. Check for the `"next"` key
|
|
334
|
-
being absent to detect end-of-list.
|
|
335
|
-
|
|
336
|
-
- **Specializations never include `paging.total`**: The `onDemandSpecializations.v1`
|
|
337
|
-
endpoint never returns `paging.total` in any page. Iterate until `"next"` is absent.
|
|
338
|
-
|
|
339
|
-
- **`workload` is free-text, unnormalized**: Values include `"4-8 hours/week"`,
|
|
340
|
-
`"1 hour 30 minutes"`, `"4 weeks of study, 1-2 hours/week"`. Do not parse as a number
|
|
341
|
-
without normalization logic.
|
|
342
|
-
|
|
343
|
-
- **`instructors.v1` list returns empty records**: The plain list endpoint returns many
|
|
344
|
-
instructors with empty `fullName`, `bio`, `title`. Always look up by `ids=` using
|
|
345
|
-
IDs from course records.
|
|
346
|
-
|
|
347
|
-
- **`degrees.v1` is 403**: Degree programs are not accessible via the public API.
|
|
348
|
-
|
|
349
|
-
- **HTML pages contain no embedded course data**: Both the search page and the course
|
|
350
|
-
detail page are React-rendered. `http_get` on `/search?query=...` returns an HTML
|
|
351
|
-
shell with no course listings. `http_get` on `/learn/{slug}` returns HTML with only
|
|
352
|
-
a FAQ JSON-LD and a breadcrumb JSON-LD — no course description, rating, price, or
|
|
353
|
-
enrollment data as machine-readable fields.
|
|
354
|
-
|
|
355
|
-
- **`linked` resources don't populate**: Passing `includes=partners.v1` to the courses
|
|
356
|
-
endpoint returns an empty `linked: {}` object. Cross-resource joins require separate
|
|
357
|
-
requests by IDs.
|
|
358
|
-
|
|
359
|
-
- **`previewLink` and `avgRating` fields**: These field names are accepted without error
|
|
360
|
-
but return no data in the response objects. Do not request them.
|
|
1
|
+
# Coursera — Course & Catalog Data Extraction
|
|
2
|
+
|
|
3
|
+
Field-tested against coursera.org and api.coursera.org on 2026-04-18.
|
|
4
|
+
No authentication required for the public catalog API.
|
|
5
|
+
|
|
6
|
+
## TL;DR — Fastest Approach
|
|
7
|
+
|
|
8
|
+
Use `http_get` against `api.coursera.org`. The public REST API returns clean JSON with no
|
|
9
|
+
auth, no bot-detection, and sub-600ms latency. Use `q=search` with a keyword
|
|
10
|
+
only when you need full-text search (requires a browser POST workaround — see below).
|
|
11
|
+
For bulk enumeration, iterate the catalog list with `start` pagination.
|
|
12
|
+
|
|
13
|
+
---
|
|
14
|
+
|
|
15
|
+
## 1. Catalog List (http_get — always works)
|
|
16
|
+
|
|
17
|
+
The default list query (`q=list` implied) returns ALL courses in Coursera's catalog —
|
|
18
|
+
20,659 as of the test date.
|
|
19
|
+
|
|
20
|
+
```python
|
|
21
|
+
from helpers import http_get
|
|
22
|
+
import json
|
|
23
|
+
|
|
24
|
+
resp = http_get(
|
|
25
|
+
"https://api.coursera.org/api/courses.v1"
|
|
26
|
+
"?fields=name,slug,description,primaryLanguages,workload,"
|
|
27
|
+
"partnerIds,courseType,instructorIds,domainTypes,photoUrl,certificates"
|
|
28
|
+
"&limit=100&start=0"
|
|
29
|
+
)
|
|
30
|
+
data = json.loads(resp)
|
|
31
|
+
courses = data["elements"] # list of dicts
|
|
32
|
+
next_start = data["paging"].get("next") # e.g. "100", None when exhausted
|
|
33
|
+
total = data["paging"].get("total") # 20659
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
### Response structure (confirmed field names)
|
|
37
|
+
|
|
38
|
+
```json
|
|
39
|
+
{
|
|
40
|
+
"courseType": "v2.ondemand",
|
|
41
|
+
"description": "Gamification is the application of game elements...",
|
|
42
|
+
"domainTypes": [
|
|
43
|
+
{"domainId": "computer-science", "subdomainId": "design-and-product"},
|
|
44
|
+
{"domainId": "business", "subdomainId": "marketing"}
|
|
45
|
+
],
|
|
46
|
+
"photoUrl": "https://d3njjcbhbojbot.cloudfront.net/api/utilities/v1/imageproxy/https://coursera-course-photos.s3.amazonaws.com/...",
|
|
47
|
+
"id": "69Bku0KoEeWZtA4u62x6lQ",
|
|
48
|
+
"slug": "gamification",
|
|
49
|
+
"instructorIds": ["226710"],
|
|
50
|
+
"specializations": [],
|
|
51
|
+
"workload": "4-8 hours/week",
|
|
52
|
+
"primaryLanguages": ["en"],
|
|
53
|
+
"partnerIds": ["6"],
|
|
54
|
+
"certificates": ["VerifiedCert"],
|
|
55
|
+
"name": "Gamification"
|
|
56
|
+
}
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
Field notes:
|
|
60
|
+
- `id` — opaque base64-ish string, stable identifier. Use for batch lookups and linking.
|
|
61
|
+
- `slug` — URL-safe identifier. Course page: `https://www.coursera.org/learn/{slug}`
|
|
62
|
+
- `courseType` — always `"v2.ondemand"` for self-paced courses in practice.
|
|
63
|
+
- `workload` — free-text string, e.g. `"4-8 hours/week"`, `"1 hour 30 minutes"`, `"4 weeks of study, 1-2 hours/week"`. Not normalized.
|
|
64
|
+
- `primaryLanguages` — ISO 639-1 list, e.g. `["en"]`, `["fr"]`.
|
|
65
|
+
- `partnerIds` — list of partner (university/org) IDs. Join to `partners.v1` by id.
|
|
66
|
+
- `instructorIds` — list of instructor IDs. Join to `instructors.v1` by id.
|
|
67
|
+
- `domainTypes` — list of `{domainId, subdomainId}` objects. Domain IDs include `"data-science"`, `"computer-science"`, `"business"`, `"information-technology"`.
|
|
68
|
+
- `certificates` — list of cert types, typically `["VerifiedCert"]`.
|
|
69
|
+
- `photoUrl` — direct CDN URL to course image. Works without auth.
|
|
70
|
+
- `specializations` — list of specialization IDs this course belongs to (often empty; not always populated here — use `onDemandSpecializations.v1` instead).
|
|
71
|
+
- `previewLink` — field exists but was empty in all tested records; skip it.
|
|
72
|
+
- `avgRating` — field does NOT appear in public API responses; not available.
|
|
73
|
+
|
|
74
|
+
### Pagination
|
|
75
|
+
|
|
76
|
+
```python
|
|
77
|
+
def iter_all_courses(fields=None, page_size=100):
|
|
78
|
+
base_fields = "name,slug,description,primaryLanguages,workload,partnerIds,courseType,domainTypes,photoUrl"
|
|
79
|
+
if fields:
|
|
80
|
+
base_fields = fields
|
|
81
|
+
start = 0
|
|
82
|
+
while True:
|
|
83
|
+
url = (
|
|
84
|
+
f"https://api.coursera.org/api/courses.v1"
|
|
85
|
+
f"?fields={base_fields}&limit={page_size}&start={start}"
|
|
86
|
+
)
|
|
87
|
+
data = json.loads(http_get(url))
|
|
88
|
+
yield from data["elements"]
|
|
89
|
+
nxt = data["paging"].get("next")
|
|
90
|
+
if nxt is None:
|
|
91
|
+
break
|
|
92
|
+
start = int(nxt)
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
- `paging.next` is a string offset (e.g. `"100"`), or absent when exhausted.
|
|
96
|
+
- `paging.total` is present on the first page (e.g. `20659`) but absent on subsequent pages.
|
|
97
|
+
- `limit` up to at least 1000 works (tested: 1000 returned 1000 items). Use 100–500 for safe batches.
|
|
98
|
+
|
|
99
|
+
---
|
|
100
|
+
|
|
101
|
+
## 2. Partners API (http_get — works)
|
|
102
|
+
|
|
103
|
+
422 partners (universities, companies) as of test date.
|
|
104
|
+
|
|
105
|
+
```python
|
|
106
|
+
resp = http_get(
|
|
107
|
+
"https://api.coursera.org/api/partners.v1"
|
|
108
|
+
"?fields=name,squareLogo,description,shortName&limit=50&start=0"
|
|
109
|
+
)
|
|
110
|
+
data = json.loads(resp)
|
|
111
|
+
partners = data["elements"]
|
|
112
|
+
# paging.next and paging.total follow same structure as courses
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
### Partner record structure
|
|
116
|
+
|
|
117
|
+
```json
|
|
118
|
+
{
|
|
119
|
+
"id": "6",
|
|
120
|
+
"name": "University of Pennsylvania",
|
|
121
|
+
"shortName": "penn",
|
|
122
|
+
"description": "The University of Pennsylvania (commonly referred to as Penn)...",
|
|
123
|
+
"squareLogo": "http://coursera-university-assets.s3.amazonaws.com/.../logo.png"
|
|
124
|
+
}
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
### Partner by ID (with courseIds)
|
|
128
|
+
|
|
129
|
+
```python
|
|
130
|
+
resp = http_get(
|
|
131
|
+
"https://api.coursera.org/api/partners.v1"
|
|
132
|
+
"?ids=6&fields=name,squareLogo,description,shortName,courseIds"
|
|
133
|
+
)
|
|
134
|
+
data = json.loads(resp)
|
|
135
|
+
partner = data["elements"][0]
|
|
136
|
+
# partner["courseIds"] is a list of course ID strings (150+ for large universities)
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
---
|
|
140
|
+
|
|
141
|
+
## 3. Specializations API (http_get — works)
|
|
142
|
+
|
|
143
|
+
```python
|
|
144
|
+
resp = http_get(
|
|
145
|
+
"https://api.coursera.org/api/onDemandSpecializations.v1"
|
|
146
|
+
"?fields=name,slug,description,partnerIds,courseIds,tagline&limit=100&start=0"
|
|
147
|
+
)
|
|
148
|
+
data = json.loads(resp)
|
|
149
|
+
specs = data["elements"]
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
### Specialization record structure
|
|
153
|
+
|
|
154
|
+
```json
|
|
155
|
+
{
|
|
156
|
+
"id": "AbCdEfGhIjKl",
|
|
157
|
+
"name": "SIEM Splunk",
|
|
158
|
+
"slug": "siem-splunk",
|
|
159
|
+
"tagline": "Learn SIEM fundamentals with Splunk",
|
|
160
|
+
"description": "Course Overview:\n\nIn the \"SIEM Splunk\" specialization course...",
|
|
161
|
+
"partnerIds": ["1441"],
|
|
162
|
+
"courseIds": ["pu2XQCuEEe6qTBJCf71DPw", "Xc46mVFkEe6a4wrvTcwXPw", "YH1ok1FXEe62cBI5JZME2w"]
|
|
163
|
+
}
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
Note: Specializations paging does NOT include `paging.total` — iterate until `paging.next` is absent.
|
|
167
|
+
|
|
168
|
+
---
|
|
169
|
+
|
|
170
|
+
## 4. Instructors API (http_get — works)
|
|
171
|
+
|
|
172
|
+
Only useful for lookups by ID (from course `instructorIds`). The plain list endpoint
|
|
173
|
+
returns many empty records (empty name/bio).
|
|
174
|
+
|
|
175
|
+
```python
|
|
176
|
+
# Lookup specific instructors by ID
|
|
177
|
+
resp = http_get(
|
|
178
|
+
"https://api.coursera.org/api/instructors.v1"
|
|
179
|
+
"?ids=226710&fields=fullName,bio,department,title,photo"
|
|
180
|
+
)
|
|
181
|
+
data = json.loads(resp)
|
|
182
|
+
instructor = data["elements"][0]
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
### Instructor record structure
|
|
186
|
+
|
|
187
|
+
```json
|
|
188
|
+
{
|
|
189
|
+
"id": "226710",
|
|
190
|
+
"fullName": "Kevin Werbach",
|
|
191
|
+
"title": "Professor of Legal Studies and Business Ethics",
|
|
192
|
+
"department": "Legal Studies and Business Ethics",
|
|
193
|
+
"bio": "Kevin Werbach is professor of Legal Studies...",
|
|
194
|
+
"photo": "https://d3njjcbhbojbot.cloudfront.net/api/utilities/v1/imageproxy/..."
|
|
195
|
+
}
|
|
196
|
+
```
|
|
197
|
+
|
|
198
|
+
---
|
|
199
|
+
|
|
200
|
+
## 5. Batch ID Lookup
|
|
201
|
+
|
|
202
|
+
Fetch multiple courses (or partners/instructors) in one request by passing a comma-separated `ids` list:
|
|
203
|
+
|
|
204
|
+
```python
|
|
205
|
+
ids = ",".join(["69Bku0KoEeWZtA4u62x6lQ", "hOzhxVNuEfCW8Q55q1kSNQ", "0HiU7Oe4EeWTAQ4yevf_oQ"])
|
|
206
|
+
resp = http_get(
|
|
207
|
+
f"https://api.coursera.org/api/courses.v1"
|
|
208
|
+
f"?ids={ids}&fields=name,slug,description,primaryLanguages,workload,partnerIds"
|
|
209
|
+
)
|
|
210
|
+
data = json.loads(resp)
|
|
211
|
+
# data["elements"] has exactly the courses you asked for
|
|
212
|
+
```
|
|
213
|
+
|
|
214
|
+
No observed limit on the number of IDs per request in testing (tried up to 3).
|
|
215
|
+
|
|
216
|
+
---
|
|
217
|
+
|
|
218
|
+
## 6. Keyword Search — BLOCKED for GET (405)
|
|
219
|
+
|
|
220
|
+
`q=search&query=...` returns **HTTP 405 Method Not Allowed** on GET.
|
|
221
|
+
This applies to all three resource types:
|
|
222
|
+
- `courses.v1?q=search&query=python` → 405
|
|
223
|
+
- `onDemandSpecializations.v1?q=search&query=data+science` → 405
|
|
224
|
+
- `partners.v1?q=search&query=stanford` → 405
|
|
225
|
+
|
|
226
|
+
The search endpoint requires a POST request (Coursera's public Autocomplete/Search
|
|
227
|
+
service). For keyword-based discovery without a browser, use the catalog list and filter
|
|
228
|
+
client-side, or use the browser approach below.
|
|
229
|
+
|
|
230
|
+
### Browser fallback for keyword search
|
|
231
|
+
|
|
232
|
+
```python
|
|
233
|
+
new_tab("https://www.coursera.org/search?query=machine+learning")
|
|
234
|
+
wait_for_load()
|
|
235
|
+
wait(3) # Results load asynchronously via React
|
|
236
|
+
capture_screenshot()
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
Note: The search results page (`/search?query=...`) is a client-rendered React app. The
|
|
240
|
+
HTML returned by `http_get` does NOT contain course cards — it's a bare shell with no
|
|
241
|
+
`__NEXT_DATA__` or embedded JSON. A live browser is required to see rendered results.
|
|
242
|
+
|
|
243
|
+
---
|
|
244
|
+
|
|
245
|
+
## 7. Course Detail HTML Page (http_get — works, limited data)
|
|
246
|
+
|
|
247
|
+
```python
|
|
248
|
+
html = http_get("https://www.coursera.org/learn/machine-learning")
|
|
249
|
+
# html is ~980KB of server-rendered HTML (no NEXT_DATA, no Apollo state)
|
|
250
|
+
```
|
|
251
|
+
|
|
252
|
+
The course detail page IS served as full HTML (no JS-gate), but contains very
|
|
253
|
+
little machine-readable course data. What you can extract:
|
|
254
|
+
|
|
255
|
+
```python
|
|
256
|
+
import re, json
|
|
257
|
+
|
|
258
|
+
# Page title (includes course name)
|
|
259
|
+
title = re.search(r'<title[^>]*>(.*?)</title>', html).group(1)
|
|
260
|
+
# "Supervised Machine Learning: Regression and Classification | Coursera"
|
|
261
|
+
|
|
262
|
+
# JSON-LD blocks (2 present)
|
|
263
|
+
jsonld_blocks = re.findall(r'<script type="application/ld\+json">(.*?)</script>', html, re.DOTALL)
|
|
264
|
+
# Block 0: FAQPage schema (common Q&A about how courses work)
|
|
265
|
+
# Block 1: BreadcrumbList (category path, e.g. Browse > Data Science > Machine Learning)
|
|
266
|
+
faq = json.loads(jsonld_blocks[0]) # {"@type": "FAQPage", "mainEntity": [...]}
|
|
267
|
+
crumb = json.loads(jsonld_blocks[1]) # {"@type": "BreadcrumbList", "itemListElement": [...]}
|
|
268
|
+
|
|
269
|
+
# Extract breadcrumb categories
|
|
270
|
+
categories = [item["item"]["name"] for item in crumb["@graph"][0]["itemListElement"]]
|
|
271
|
+
# e.g. ["Browse", "Data Science", "Machine Learning"]
|
|
272
|
+
```
|
|
273
|
+
|
|
274
|
+
The HTML does NOT embed: description, rating, instructor names, enrollment count,
|
|
275
|
+
price, or any course-specific metadata as machine-readable fields.
|
|
276
|
+
Use the API (`courses.v1?ids=...`) to get those from the slug.
|
|
277
|
+
|
|
278
|
+
### Slug-to-ID lookup pattern
|
|
279
|
+
|
|
280
|
+
```python
|
|
281
|
+
# Get course data from slug (need ID first — get it from catalog or search)
|
|
282
|
+
# Pattern: enumerate catalog, match by slug
|
|
283
|
+
resp = http_get("https://api.coursera.org/api/courses.v1?fields=name,slug,description&limit=100&start=0")
|
|
284
|
+
data = json.loads(resp)
|
|
285
|
+
by_slug = {el["slug"]: el for el in data["elements"]}
|
|
286
|
+
course = by_slug.get("machine-learning")
|
|
287
|
+
```
|
|
288
|
+
|
|
289
|
+
---
|
|
290
|
+
|
|
291
|
+
## Endpoints Summary
|
|
292
|
+
|
|
293
|
+
| Endpoint | Method | Result |
|
|
294
|
+
|---|---|---|
|
|
295
|
+
| `courses.v1` (list) | GET | 200 OK — full catalog, 20,659 courses |
|
|
296
|
+
| `courses.v1?ids=...` | GET | 200 OK — batch lookup by ID |
|
|
297
|
+
| `courses.v1?q=search&query=...` | GET | **405 Method Not Allowed** |
|
|
298
|
+
| `partners.v1` (list) | GET | 200 OK — 422 partners |
|
|
299
|
+
| `partners.v1?ids=...` | GET | 200 OK — with courseIds |
|
|
300
|
+
| `partners.v1?q=search&query=...` | GET | **405 Method Not Allowed** |
|
|
301
|
+
| `onDemandSpecializations.v1` (list) | GET | 200 OK — paginated (no total) |
|
|
302
|
+
| `onDemandSpecializations.v1?q=search&query=...` | GET | **405 Method Not Allowed** |
|
|
303
|
+
| `instructors.v1?ids=...` | GET | 200 OK — rich records by ID |
|
|
304
|
+
| `instructors.v1` (list) | GET | 200 OK — mostly empty records |
|
|
305
|
+
| `degrees.v1` | GET | 403 Forbidden |
|
|
306
|
+
| `/search?query=...` page HTML | GET | 200 OK — React shell only, no data |
|
|
307
|
+
| `/learn/{slug}` page HTML | GET | 200 OK — HTML with JSON-LD breadcrumb only |
|
|
308
|
+
|
|
309
|
+
---
|
|
310
|
+
|
|
311
|
+
## Rate Limits
|
|
312
|
+
|
|
313
|
+
No rate limiting observed in testing:
|
|
314
|
+
- 5 consecutive requests with no delay: all succeeded, avg 0.55s each.
|
|
315
|
+
- No `X-RateLimit-*` or `Retry-After` headers in responses.
|
|
316
|
+
- No auth headers needed for any working endpoint.
|
|
317
|
+
|
|
318
|
+
Response headers that are present: `X-Coursera-Request-Id`, `X-Coursera-Trace-Id-Hex`,
|
|
319
|
+
`x-envoy-upstream-service-time`. No rate-limit indicators.
|
|
320
|
+
|
|
321
|
+
Use a small delay (0.5s) between requests if doing bulk enumeration of the full 20K+
|
|
322
|
+
catalog as a courtesy, but no hard cap was observed.
|
|
323
|
+
|
|
324
|
+
---
|
|
325
|
+
|
|
326
|
+
## Gotchas
|
|
327
|
+
|
|
328
|
+
- **`q=search` is POST-only**: All three resource types (courses, specializations,
|
|
329
|
+
partners) return 405 on GET when `q=search` is added. There is no documented public
|
|
330
|
+
POST endpoint. For keyword filtering, enumerate the catalog and filter client-side.
|
|
331
|
+
|
|
332
|
+
- **`paging.total` absent after page 1**: Only the first page response includes
|
|
333
|
+
`paging.total`. Subsequent pages have only `paging.next`. Check for the `"next"` key
|
|
334
|
+
being absent to detect end-of-list.
|
|
335
|
+
|
|
336
|
+
- **Specializations never include `paging.total`**: The `onDemandSpecializations.v1`
|
|
337
|
+
endpoint never returns `paging.total` in any page. Iterate until `"next"` is absent.
|
|
338
|
+
|
|
339
|
+
- **`workload` is free-text, unnormalized**: Values include `"4-8 hours/week"`,
|
|
340
|
+
`"1 hour 30 minutes"`, `"4 weeks of study, 1-2 hours/week"`. Do not parse as a number
|
|
341
|
+
without normalization logic.
|
|
342
|
+
|
|
343
|
+
- **`instructors.v1` list returns empty records**: The plain list endpoint returns many
|
|
344
|
+
instructors with empty `fullName`, `bio`, `title`. Always look up by `ids=` using
|
|
345
|
+
IDs from course records.
|
|
346
|
+
|
|
347
|
+
- **`degrees.v1` is 403**: Degree programs are not accessible via the public API.
|
|
348
|
+
|
|
349
|
+
- **HTML pages contain no embedded course data**: Both the search page and the course
|
|
350
|
+
detail page are React-rendered. `http_get` on `/search?query=...` returns an HTML
|
|
351
|
+
shell with no course listings. `http_get` on `/learn/{slug}` returns HTML with only
|
|
352
|
+
a FAQ JSON-LD and a breadcrumb JSON-LD — no course description, rating, price, or
|
|
353
|
+
enrollment data as machine-readable fields.
|
|
354
|
+
|
|
355
|
+
- **`linked` resources don't populate**: Passing `includes=partners.v1` to the courses
|
|
356
|
+
endpoint returns an empty `linked: {}` object. Cross-resource joins require separate
|
|
357
|
+
requests by IDs.
|
|
358
|
+
|
|
359
|
+
- **`previewLink` and `avgRating` fields**: These field names are accepted without error
|
|
360
|
+
but return no data in the response objects. Do not request them.
|